OverflowML: Auto-optimal model loading for any hardware

Sharing a library I built to solve the “model too big for GPU” problem automatically.

Problem: Loading large models requires knowing which combination of device_map, quantization, and offloading to use — and it varies by hardware. FP8 doesn’t work with CPU offload on Windows. INT4 needs bitsandbytes. Sequential offload and attention_slicing crash together.

Solution:

import overflowml

# Detects your hardware, picks strategy, loads with optimal config
model, tokenizer = overflowml.load_model("meta-llama/Llama-3-70B")

Under the hood it:

  • Detects GPU type, VRAM, RAM, FP8/BF16 support
  • Estimates model size from config (no weight download needed)
  • Picks the best strategy: direct load, FP8, BitsAndBytes INT4/INT8, model_cpu_offload, or sequential_cpu_offload
  • Sets up device_map, max_memory, quantization_config automatically
  • Avoids known incompatibilities

Also works with diffusers pipelines:

overflowml.optimize_pipeline(pipe, model_size_gb=40)

CLI tool included:

$ overflowml benchmark      # shows what models your hardware can run
$ overflowml plan 70        # detailed strategy for a 70GB model
$ overflowml detect         # show hardware capabilities

Cross-platform: NVIDIA (CUDA), Apple Silicon (MPS/MLX unified memory), AMD (ROCm planned).

pip install overflowml[transformers]

GitHub: GitHub - Khaeldur/overflowml: Run AI models larger than your GPU. Auto-detects hardware, picks optimal memory strategy. · GitHub
PyPI: overflowml · PyPI

1 Like