You can now run 70B model on a single 4GB GPU and it even scales up to the colossal Llama 3.1 405B on just 8GB of VRAM. #AirLLM uses "Layer-wise Inference." Instead of loading the whole model, it loads, computes, and flushes one layer at a time - No quantization needed by default - Supports Llama, Qwen, and Mistral - Works on Linux, Windows, and macOS 100% Open Source.

Replies (3)

n0>1's avatar
n0>1 1 week ago
This is how we want AI. Local.
โ†‘