Sharing a library I built to solve the “model too big for GPU” problem automatically.
Problem: Loading large models requires knowing which combination of device_map, quantization, and offloading to use — and it varies by hardware. FP8 doesn’t work with CPU offload on Windows. INT4 needs bitsandbytes. Sequential offload and attention_slicing crash together.
Solution:
import overflowml
# Detects your hardware, picks strategy, loads with optimal config
model, tokenizer = overflowml.load_model("meta-llama/Llama-3-70B")
Under the hood it:
- Detects GPU type, VRAM, RAM, FP8/BF16 support
- Estimates model size from config (no weight download needed)
- Picks the best strategy: direct load, FP8, BitsAndBytes INT4/INT8, model_cpu_offload, or sequential_cpu_offload
- Sets up device_map, max_memory, quantization_config automatically
- Avoids known incompatibilities
Also works with diffusers pipelines:
overflowml.optimize_pipeline(pipe, model_size_gb=40)
CLI tool included:
$ overflowml benchmark # shows what models your hardware can run
$ overflowml plan 70 # detailed strategy for a 70GB model
$ overflowml detect # show hardware capabilities
Cross-platform: NVIDIA (CUDA), Apple Silicon (MPS/MLX unified memory), AMD (ROCm planned).
pip install overflowml[transformers]
GitHub: GitHub - Khaeldur/overflowml: Run AI models larger than your GPU. Auto-detects hardware, picks optimal memory strategy. · GitHub
PyPI: overflowml · PyPI