Thread - Nostr Hypermedia

captjack 🏴‍☠️✨💜 captjack@plebchain.club 1 week ago

You can now run 70B model on a single 4GB GPU and it even scales up to the colossal Llama 3.1 405B on just 8GB of VRAM. #AirLLM uses "Layer-wise Inference." Instead of loading the whole model, it loads, computes, and flushes one layer at a time - No quantization needed by default - Supports Llama, Qwen, and Mistral - Works on Linux, Windows, and macOS 100% Open Source.

Replies (3)