When using large models, not only do the model weights themselves become heavy, but the RAM consumed by context, KV cache, etc. increases even more significantly.
What âfits in memoryâ actually means (for GGUF/llama.cpp-style inference)
When you run a local LLM, youâre budgeting for more than just the model file size:
- Model weights (largest chunk; roughly the GGUF âfile sizeâ for a quant).
- KV cache (grows with context length; can be multiple GB).
- Runtime overhead (temporary buffers, CUDA/Metal/DirectML workspaces, tokenizer, graphs, fragmentation, etc.).
- OS + background apps.
On Windows, there are two different âlimitsâ that matter:
- Commit limit (RAM + page file). If you exceed this, allocations fail and the load/crash happens. Microsoft describes the system commit limit as RAM + all page files and shows how Task Manager reports âCommittedâ. (Microsoft Learn)
- Physical RAM working set (whatâs actually in RAM right now). If you exceed physical RAM a lot, Windows will page heavily â massive slowdowns even if it âstill runsâ.
1) âCan a model be a bit bigger than my 32GB RAM if I have 16GB VRAM?â
The key point
VRAM is not an extension of system RAM you can just add up as â48GB unifiedâ. On a discrete GPU, VRAM (âlocalâ) and system memory (ânon-localâ) are different pools; Windows can map/evict GPU allocations, but using non-local/system memory for GPU workloads is far slower and can thrash. (Microsoft Learn)
What âpartial GPU offload possibleâ usually means
Backends in the llama.cpp family can place some weight tensors on the GPU and keep the rest in system RAM (plus KV cache either on GPU or CPU depending on settings). You donât need âthe entire model in RAMâ in the simplistic senseâyou need the total allocations to stay within commit, and you need enough physical RAM to avoid paging cliffs.
LM Studio specifically talks about GPU offloading by splitting into subgraphs and moving them to the GPU, loading/unloading as needed. (Microsoft Learn)
Your Nemotron example (Q8_0 â 33.6GB)
The file is about 33.6GB for Q8_0 on HF, and smaller quants exist (Q4_K_M â 24.5GB; Q3_K_L â 20.7GB). (Hugging Face)
With 32GB RAM, Q8_0 is right on the edge even before KV cache + overhead. Whether it loads depends on:
- how much weight can be offloaded into dedicated VRAM,
- how big your page file / commit limit is,
- how much overhead and background usage you have,
- and what context length you use (KV cache).
Why Task Manager shows â10.2 / 63.5 committedâ: the 63.5GB is your commit limit (RAM + page file). Windows explicitly defines that relationship. (Microsoft Learn)
So: yes, a model slightly bigger than physical RAM can still load/run if the commit limit allows itâbut performance may collapse if it actually needs to page.
2) âWhy does LM Studio say Hermes 4 70B Q3/Q4 is âlikely too largeâ even though I have RAM+VRAM (or paging)? Whereâs the line?â
Why the warning happens
A 70B model is âweights + KV cache + overheadâ. Even if the weights look only slightly above RAM, the KV cache makes it non-slight very quickly.
Hermes-4-70B GGUF sizes (examples):
- Q3_K_L ~ 37.14GB
- Q4_K_M ~ 42.52GB
âŠand it also has smaller âimportance/mixedâ quants like IQ3_M ~ 31.94GB, IQ3_XXS ~ 27.47GB, etc. (Hugging Face)
KV cache: the hidden memory eater (concrete numbers)
Hermes-4-70B is based on Llama-3.1-70B. (Hugging Face)
Llama-3.1-70B config is (notably) 80 layers, 64 attention heads, 8 KV heads, hidden size 8192. (Hugging Face)
For fp16 KV cache, memory per token is approximately:
- head_dim = 8192 / 64 = 128
- KV bytes/token â 80 layers Ă 8 kv_heads Ă 128 Ă 2 bytes Ă 2 (K+V)
- = 327,680 bytes/token â 320 KB/token
That implies KV cache alone is roughly:
- 2k context: ~0.6 GB
- 4k: ~1.25 GB
- 8k: ~2.5 GB
- 32k: ~10 GB
- 128k: ~40 GB
So even if you could âbarelyâ fit weights, a bigger context pushes you over.
Where to draw the line (practical ânormal inferenceâ guidance)
On 32GB RAM + 16GB VRAM (Windows), for ânormalâ usability (not minutes/token), a good rule is:
- Keep the CPU-side resident footprint under ~26â28GB (leaves OS/headroom).
- Prefer weights that are clearly below RAM, not âbarely aboveâ.
- Treat paging as a last resortâitâs for âit runs eventuallyâ, not âusableâ.
For Hermes-4-70B specifically, that typically means:
- If you insist on 70B: try IQ3_M (~31.9GB) or smaller IQ3_XXS (~27.5GB) first (and keep context modest). (Hugging Face)
- Q3_K_L (~37GB) and Q4_* (~40â44GB) are generally âpaging territoryâ on 32GB unless you offload a large fraction of weights to VRAM and keep KV/offload choices conservative.
3) âExtreme modes: VRAM-as-RAM? model streaming from file? Does LM Studio/Ollama support it?â
3a) âCan I use VRAM like normal RAM to get 48GB usable?â
Not in the way you want on a discrete GPU.
- On Windows (WDDM), discrete GPUs have local (VRAM) and non-local (system memory) budgets; system memory can be used for GPU resources, but itâs not equivalent to extra VRAM and is much slower. (Microsoft Learn)
- âShared GPU memoryâ shown in Task Manager is basically system RAM that the GPU may borrow, typically as spillover. This is exactly the scenario that causes dramatic slowdowns.
A related real-world pitfall: LM Studio users have reported cases where memory goes into shared GPU memory instead of dedicated VRAM and causes issues. (GitHub)
LM Studio even added an option specifically to limit offload to dedicated GPU memory to avoid spilling into shared memory (performance cliff). (LM Studio)
3b) âModel streaming from fileâ
In llama.cpp terms, what people often call âstreamingâ is memory mapping (mmap):
- The model file is mapped into the process address space and pages are pulled from disk on demand by the OS.
llama.cpp exposes this explicitly: --mmap / --no-mmap. (manpages.debian.org)
What mmap does and does not do:
It can reduce up-front load time and avoid copying the entire file immediately.
It can allow âit loadsâ even when RAM is tight (because the OS can page).
During generation, if the model touches most weight pages, your working set trends toward ânearly the whole model anywayâ.
If youâre over RAM, this often becomes random NVMe reads + page faults â extremely slow.
So: âstreamingâ is real, but itâs not magicâit often turns into paging.
3c) Ollama: mmap and memory checks
Ollama has a use_mmap concept, but multiple issues indicate:
- it may not apply to all engine/model types, and
- it doesnât necessarily bypass âneeds full memoryâ behavior for some models. (GitHub)
Also, there are reports that use_mmap=true can have side effects like memory not being released as expected in some scenarios. (GitHub)
So if your primary goal is âsqueeze borderline models into memoryâ, direct llama.cpp (or a tool that exposes llama.cpp flags cleanly) is usually the most controllable.
Concrete best practices for your exact hardware (32GB RAM + 16GB VRAM, Windows)
A) Treat 70B as âspecial handlingâ, not default
If you want Hermes-4-70B locally:
-
Pick a smaller 70B quant first
- Start with IQ3_M (~31.9GB) or even IQ3_XXS (~27.5GB). (Hugging Face)
- Avoid Q4_* (~40GB+) on 32GB unless you accept heavy paging.
-
Keep context modest
- 2kâ4k is the difference between âmaybe worksâ and âfalls overâ (KV cache grows fast; see numbers above).
-
Keep KV cache off the GPU if VRAM is tight
- If weights already consume most of 16GB VRAM, pushing KV there can trigger spill into shared GPU memory (bad cliff).
- llama.cpp exposes
--kv-offload/--no-kv-offload and KV cache quantization (--cache-type-k, --cache-type-v). (manpages.debian.org)
-
Use KV cache quantization when RAM-bound
- For example, setting KV cache to q8 or q4 can materially reduce KV memory. (Tradeoff: extra compute/latency; depends on backend/hardware.) (manpages.debian.org)
B) For Nemotron-3-Nano-30B-A3B: use the right quant instead of forcing Q8
You cited Q8_0 (~33.6GB). The same repo lists:
- Q4_K_M ~24.5GB
- Q3_K_L ~20.7GB
- Q6_K ~33.5GB
- Q8_0 ~33.6GB (Hugging Face)
On 32GB RAM, Q4_K_M is the âruns normallyâ option; Q8 is âedge/pagingâ unless you can offload a meaningful chunk of weights to VRAM.
C) Page file: make it a conscious tool, not an accident
If you decide to run borderline models with paging:
- Put the page file on a fast SSD/NVMe.
- Ensure the commit limit is comfortably above your worst-case usage (Windows defines commit limit as RAM + page files). (Microsoft Learn)
This prevents hard allocation failures, but it does not prevent slowdowns.
D) If you stay with LM Studio
- Enable the setting to avoid spilling into shared GPU memory (limit to dedicated VRAM) when youâre near VRAM capacity. (LM Studio)
- Use partial offload deliberately (donât assume VRAM âaddsâ to RAM; itâs a placement decision).
E) If youâre open to switching tools
For memory-constrained, âIâll wait longerâ workloads:
- Direct llama.cpp / llama-server: most knobs (mmap, KV offload, KV quantization). (manpages.debian.org)
- Ollama: simplest UX, but mmap behavior can be inconsistent across engines/models per issues. (GitHub)
- LM Studio: good GUI; has explicit work on controlling dedicated vs shared GPU memory and supports GPU offload workflows. (LM Studio)
Similar cases / useful issues & references (worth reading)
- Windows commit/page file mechanics (so Task Manager numbers make sense): (Microsoft Learn)
- WDDM GPU memory model (why VRAM doesnât become âextra RAMâ): (Microsoft Learn)
- LM Studio dedicated-vs-shared GPU memory control: (LM Studio)
- llama.cpp server flags for mmap + KV controls + KV quantization: (manpages.debian.org)
- Ollama mmap limitations / behavior discussions: (GitHub)
- Hermes-4-70B GGUF quant size table (includes IQ quants): (Hugging Face)
- Nemotron-3-Nano-30B-A3B GGUF file sizes (Q3/Q4/Q6/Q8): (Hugging Face)
Bottom line recommendations for you
- If you want ânormalâ inference: stick to â€30B in Q4-ish (or 70B only in very small IQ quants).
- If you want Hermes-4-70B anyway: start at IQ3_M (or smaller), keep context 2kâ4k, and be deliberate about KV placement/quantization. (Hugging Face)
- Donât plan on â48GB unifiedââplan on careful tensor placement and avoiding shared GPU memory spill. (Microsoft Learn)