Gemma 4 E2B-it - RotorQuant AWQ 4-bit

4-bit AWQ-quantized version of google/gemma-4-E2B-it (instruction-tuned) with RotorQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference. RotorQuant delivers 5.3x faster prefill and 28% faster decode vs TurboQuant, making it a strong choice for low-latency chat serving.

Approximate model size: ~1.5 GB

Note: RotorQuant KV cache modes (planar3, iso3) require the RotorQuant fork or the llama-cpp-turboquant fork. The AWQ weights themselves load cleanly in stock AutoAWQ / vLLM; RotorQuant KV-cache kernels are opt-in.

Model Specifications

Property Value
Base Model google/gemma-4-E2B-it
Parameters ~2 billion
Architecture Dense transformer, instruction-tuned
Modality Multimodal: image + text input, text output
License Apache 2.0
Weight Quantization AWQ 4-bit (~1.5 GB)
Group Size 128
KV-Cache Quantization RotorQuant (planar3 / iso3)
Framework transformers + AutoAWQ / vLLM

Quickstart

AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_quantized(
    "majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit",
    device_map="auto",
    fuse_layers=True,
)
tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit")

messages = [{"role": "user", "content": "Explain RotorQuant briefly."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))

vLLM

vllm serve majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit \
  --quantization awq_marlin \
  --max-model-len 8192

With RotorQuant KV cache (fork)

from rotorquant import RotorQuantCache
cache = RotorQuantCache(model, mode="iso3")  # or "planar3"

What is RotorQuant?

RotorQuant is a high-performance KV-cache quantization method using block-diagonal Clifford-algebra rotors. Combined with AWQ 4-bit weights, this delivers a dual compression strategy with superior KV-cache performance for GPU inference.

Key advantages over TurboQuant:

  • 5.3x faster prefill
  • 28% faster decode
  • Equivalent memory savings
  • planar3 / iso3 3-bit KV cache modes

KV-Cache Quantization Comparison

Method Prefill Speed Decode Speed Memory Savings Reference
TurboQuant 1x (baseline) 1x (baseline) High arXiv: 2504.19874
RotorQuant 5.3x faster 28% faster High GitHub

AWQ vs GGUF vs MLX

Format Target Hardware Runtime Best For
AWQ NVIDIA / AMD GPU (CUDA/ROCm) AutoAWQ, vLLM, TGI GPU-native inference, production serving
GGUF CPU + GPU (cross-platform) llama.cpp, Ollama, LM Studio Laptops, CPU-only boxes, mixed offload
MLX Apple Silicon MLX, mlx-lm, mlx-vlm Macs with unified memory

This repo ships AWQ. See the "See Also" section for GGUF and MLX siblings.

Memory Estimates (Gemma 4 E2B-it)

Precision Approximate Size VRAM Tier
FP16 (original) ~4 GB 8 GB+
AWQ 8-bit ~2 GB 4 GB+
AWQ 4-bit ~1.5 GB 4 GB+

Fits comfortably on entry-level GPUs (RTX 3050 / 4060 / A2000 and up).

Hardware Requirements

  • NVIDIA GPU with >=4 GB VRAM (RTX 3050, 3060, 4060, A2000, T4)
  • CUDA 12.x recommended
  • For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels
  • For RotorQuant KV cache: scrya-com/rotorquant fork

See Also

Quant trade-off (AWQ lane)

Bits Approx size Use case Recommendation
4-bit ~860 MB Activation-aware 4-bit weight quant GPU inference (vLLM, transformers, AutoAWQ)
8-bit ~1.5 GB Activation-aware 8-bit weight quant Quality-sensitive GPU inference

(Current variant — 4bit — is bolded.)

Variants in this family

(Showing 18 sibling variants under majentik/gemma4-e2b-it-*. The current variant — RotorQuant-AWQ-4bit — is bolded.)

Variant Runtime Approx size Use case
RotorQuant runtime modifier n/a KV-cache root (weight-agnostic)
RotorQuant-AWQ-4bit transformers ~1.2 GB GPU 4-bit (AutoAWQ)
RotorQuant-AWQ-8bit transformers ~2.2 GB GPU 8-bit (AutoAWQ)
RotorQuant-GGUF-IQ4_XS llama.cpp ~1.7 GB Lossy 4-bit, low-RAM CPU/edge
RotorQuant-GGUF-Q2_K llama.cpp ~1.2 GB Lossy, low-RAM CPU/edge
RotorQuant-GGUF-Q3_K_M llama.cpp ~1.6 GB Smaller 3-bit, CPU-friendly
RotorQuant-GGUF-Q4_K_M llama.cpp ~2.2 GB Balanced default
RotorQuant-GGUF-Q5_K_M llama.cpp ~2.6 GB Higher fidelity, more RAM
RotorQuant-GGUF-Q8_0 llama.cpp ~4.2 GB Near-lossless reference
RotorQuant-MLX-2bit mlx-lm ~655 MB Apple Silicon, smallest
RotorQuant-MLX-4bit mlx-lm ~1.2 GB Apple Silicon balanced
RotorQuant-MLX-8bit mlx-lm ~2.4 GB Apple Silicon reference
TurboQuant runtime modifier n/a KV-cache root (weight-agnostic)
TurboQuant-AWQ-4bit transformers ~1.2 GB GPU 4-bit (AutoAWQ)
TurboQuant-AWQ-8bit transformers ~2.2 GB GPU 8-bit (AutoAWQ)
TurboQuant-MLX-2bit mlx-lm ~655 MB Apple Silicon, smallest
TurboQuant-MLX-4bit mlx-lm ~1.2 GB Apple Silicon balanced
TurboQuant-MLX-8bit mlx-lm ~2.4 GB Apple Silicon reference
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit

Finetuned
(178)
this model

Paper for majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit