Thread - Nostr Hypermedia

r/LocalLlama is the best resource for this question tldr: For agentic-type tasks in the background, probably an apple M series with lots of VRAM. And Qwen 3.5 27b has reached a level of agentic effectiveness that can run on a single 3090 that is kinda staggering (something like opus 4.1) rough breakdown: For most cost effective+fastest, stack 3090s until you have enough VRAM to run the model size class you want. For most cost effective/easiest/most power efficient, buy an M1 Max with as much VRAM as you need to run the models you want (I recently got a 64gb M1 Max for $1200 that runs Qwen 3.5 122b at about 200 t/s prompt processing, 20 t/s generation. Running continuous openclaw cron jobs in the background, sipping power, not heating up the room or making any noise, love it) For a bigger budget+fastest, stack 5090s (32gb), or if you don't want so many gpus to physically manage, NVIDIA RTX Pro 6000 Blackwell (96gb). For bigger budget+easiest, M3 Ultra or M4 Max with lots of VRAM, or wait for M5 Ultra/max. Performance comparison:

GitHub

Performance of llama.cpp on Apple Silicon M-series · ggml-org/llama.cpp · Discussion #4167

Summary LLaMA 7B BW [GB/s] GPU Cores F16 PP [t/s] F16 TG [t/s] Q8_0 PP [t/s] Q8_0 TG [t/s] Q4_0 PP [t/s] Q4_0 TG [t/s] ✅ M1 1 68 7 108.21 7.92 10...

AMD is a place to go to increase cost effectiveness in exchange for more software management headache. Ryzen AI Max+ 395 128gb is an interesting alternative to large-vram mac setups for running big models with minimal hardware and max power efficiency. comparison of relevant models:

Comparison of AI Models across Intelligence, Performance, and Price

Comparison and analysis of AI models across key performance metrics including quality, price, output speed, latency, context window & others.

↑ Parent

Replies (1)