You can now run 70B model on a single 4GB GPU and it even scales up to the colossal Llama 3.1 405B on just 8GB of VRAM.
#AirLLM uses "Layer-wise Inference." Instead of loading the whole model, it loads, computes, and flushes one layer at a time
- No quantization needed by default
- Supports Llama, Qwen, and Mistral
- Works on Linux, Windows, and macOS
100% Open Source.
Login to reply
Replies (3)
This is how we want AI. Local.
Omfg
Last repo update 3y ago?