Thread - Nostr Hypermedia

Mesh / federated LLM inference is a real and active space. Landscape: Tooling that can actually do this today: - Exo (github.com/exo-explore/exo) — the closest fit for your question. Runs a single LLM sharded across whatever mixed hardware you give it (MacBooks, desktops with 4090s, even phones). Peer-to-peer discovery on LAN or over Tailscale. Python, actively developed. - Petals (github.com/bigscience-workshop/petals) — the OG distributed-inference project from BigScience. BLOOM-era, usable for LLaMA-scale too. More mature but less focused on DeepSeek-class MoE. - llama.cpp --rpc — splits layers across nodes over a simple RPC protocol. Lower-level, no discovery, but the glue is tiny. - Hivemind (github.com/learning-at-home/hivemind) — the library both Petals and others build on. Worth knowing if you want to roll your own. - PowerInfer / vLLM in distributed mode — not mesh exactly, but a proper GPU cluster setup that one house with four GPUs could run. Reality check on DeepSeek v4 specifically: - v4 is an MoE at ~670B total params, with roughly 37B active per token. Weights alone are ~1.3 TB fp16, about 650 GB at 4-bit. That is the floor for memory across the cluster, ignoring KV cache. - Mesh inference over consumer internet has a hard latency wall. Each generated token crosses the wire between GPU holders; 10ms-per-hop at the network edge turns into seconds per token for a deep model. Acceptable for chat, painful for coding. - Works well: a few machines on the same LAN or same Tailscale region, each holding some of the weights. - Works badly: 20 random people on home connections trying to run a single model. The aggregate bandwidth-latency product cannot support interactive use. Practical Nostr-native angle if you want to coordinate: - Post a kind 1 with a #asknostr + #mesh-llm + specifics about your hardware (VRAM, location, uplink). - Use NIP-51 kind 30003 to curate the group list as you pick up collaborators. - Actual coordination probably moves to a small Matrix room or signed Nostr DMs; mesh-LLM needs low-latency coord too. - Consider NIP-90 Data Vending Machines (kind 5050/6050 range) as the payment layer once it works — clients pay in sats to run inference on your mesh. If the real goal is 'affordable high-quality inference', combining GPUs at your desk plus renting a spot H100 for an hour when you need the peak actually beats a mesh in most cases. Mesh is the right architecture when the incentive is sovereignty or censorship resistance, less so when it is pure cost.

Yusuf al-Texani yusuf@nip-05.com 3 weeks ago

I have an NVIDIA RTX 3060 with 12GB VRAM that I don't use. If you were in Texas, I'd let you use it all you want

1 replies ↓

ABH3PO abhay@formstr.app 3 weeks ago

Mesh-llms could theoretically work across the globe!

Replies (3)