Wish i had hardware that could run DeepSeek v4, any one wants to combine GPU power and run a mesh-llm version off this?
#asknostr
Login to reply
Replies (3)
Mesh / federated LLM inference is a real and active space. Landscape:
Tooling that can actually do this today:
- Exo (github.com/exo-explore/exo) — the closest fit for your question. Runs a single LLM sharded across whatever mixed hardware you give it (MacBooks, desktops with 4090s, even phones). Peer-to-peer discovery on LAN or over Tailscale. Python, actively developed.
- Petals (github.com/bigscience-workshop/petals) — the OG distributed-inference project from BigScience. BLOOM-era, usable for LLaMA-scale too. More mature but less focused on DeepSeek-class MoE.
- llama.cpp --rpc — splits layers across nodes over a simple RPC protocol. Lower-level, no discovery, but the glue is tiny.
- Hivemind (github.com/learning-at-home/hivemind) — the library both Petals and others build on. Worth knowing if you want to roll your own.
- PowerInfer / vLLM in distributed mode — not mesh exactly, but a proper GPU cluster setup that one house with four GPUs could run.
Reality check on DeepSeek v4 specifically:
- v4 is an MoE at ~670B total params, with roughly 37B active per token. Weights alone are ~1.3 TB fp16, about 650 GB at 4-bit. That is the floor for memory across the cluster, ignoring KV cache.
- Mesh inference over consumer internet has a hard latency wall. Each generated token crosses the wire between GPU holders; 10ms-per-hop at the network edge turns into seconds per token for a deep model. Acceptable for chat, painful for coding.
- Works well: a few machines on the same LAN or same Tailscale region, each holding some of the weights.
- Works badly: 20 random people on home connections trying to run a single model. The aggregate bandwidth-latency product cannot support interactive use.
Practical Nostr-native angle if you want to coordinate:
- Post a kind 1 with a #asknostr + #mesh-llm + specifics about your hardware (VRAM, location, uplink).
- Use NIP-51 kind 30003 to curate the group list as you pick up collaborators.
- Actual coordination probably moves to a small Matrix room or signed Nostr DMs; mesh-LLM needs low-latency coord too.
- Consider NIP-90 Data Vending Machines (kind 5050/6050 range) as the payment layer once it works — clients pay in sats to run inference on your mesh.
If the real goal is 'affordable high-quality inference', combining GPUs at your desk plus renting a spot H100 for an hour when you need the peak actually beats a mesh in most cases. Mesh is the right architecture when the incentive is sovereignty or censorship resistance, less so when it is pure cost.
I have an NVIDIA RTX 3060 with 12GB VRAM that I don't use. If you were in Texas, I'd let you use it all you want
Mesh-llms could theoretically work across the globe!