Instructions to use majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit
- SGLang
How to use majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit with Docker Model Runner:
docker model run hf.co/majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit
Gemma 4 E2B-it - RotorQuant AWQ 4-bit
4-bit AWQ-quantized version of google/gemma-4-E2B-it (instruction-tuned) with RotorQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference. RotorQuant delivers 5.3x faster prefill and 28% faster decode vs TurboQuant, making it a strong choice for low-latency chat serving.
Approximate model size: ~1.5 GB
Note: RotorQuant KV cache modes (
planar3,iso3) require the RotorQuant fork or the llama-cpp-turboquant fork. The AWQ weights themselves load cleanly in stock AutoAWQ / vLLM; RotorQuant KV-cache kernels are opt-in.
Model Specifications
| Property | Value |
|---|---|
| Base Model | google/gemma-4-E2B-it |
| Parameters | ~2 billion |
| Architecture | Dense transformer, instruction-tuned |
| Modality | Multimodal: image + text input, text output |
| License | Apache 2.0 |
| Weight Quantization | AWQ 4-bit (~1.5 GB) |
| Group Size | 128 |
| KV-Cache Quantization | RotorQuant (planar3 / iso3) |
| Framework | transformers + AutoAWQ / vLLM |
Quickstart
AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model = AutoAWQForCausalLM.from_quantized(
"majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit",
device_map="auto",
fuse_layers=True,
)
tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit")
messages = [{"role": "user", "content": "Explain RotorQuant briefly."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))
vLLM
vllm serve majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit \
--quantization awq_marlin \
--max-model-len 8192
With RotorQuant KV cache (fork)
from rotorquant import RotorQuantCache
cache = RotorQuantCache(model, mode="iso3") # or "planar3"
What is RotorQuant?
RotorQuant is a high-performance KV-cache quantization method using block-diagonal Clifford-algebra rotors. Combined with AWQ 4-bit weights, this delivers a dual compression strategy with superior KV-cache performance for GPU inference.
Key advantages over TurboQuant:
- 5.3x faster prefill
- 28% faster decode
- Equivalent memory savings
planar3/iso33-bit KV cache modes
KV-Cache Quantization Comparison
| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
|---|---|---|---|---|
| TurboQuant | 1x (baseline) | 1x (baseline) | High | arXiv: 2504.19874 |
| RotorQuant | 5.3x faster | 28% faster | High | GitHub |
AWQ vs GGUF vs MLX
| Format | Target Hardware | Runtime | Best For |
|---|---|---|---|
| AWQ | NVIDIA / AMD GPU (CUDA/ROCm) | AutoAWQ, vLLM, TGI | GPU-native inference, production serving |
| GGUF | CPU + GPU (cross-platform) | llama.cpp, Ollama, LM Studio | Laptops, CPU-only boxes, mixed offload |
| MLX | Apple Silicon | MLX, mlx-lm, mlx-vlm | Macs with unified memory |
This repo ships AWQ. See the "See Also" section for GGUF and MLX siblings.
Memory Estimates (Gemma 4 E2B-it)
| Precision | Approximate Size | VRAM Tier |
|---|---|---|
| FP16 (original) | ~4 GB | 8 GB+ |
| AWQ 8-bit | ~2 GB | 4 GB+ |
| AWQ 4-bit | ~1.5 GB | 4 GB+ |
Fits comfortably on entry-level GPUs (RTX 3050 / 4060 / A2000 and up).
Hardware Requirements
- NVIDIA GPU with >=4 GB VRAM (RTX 3050, 3060, 4060, A2000, T4)
- CUDA 12.x recommended
- For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels
- For RotorQuant KV cache: scrya-com/rotorquant fork
See Also
- google/gemma-4-E2B-it -- Base model
- majentik/gemma-4-E2B-it-RotorQuant -- RotorQuant KV-cache only (transformers)
- majentik/gemma-4-E2B-it-RotorQuant-AWQ-8bit -- AWQ 8-bit variant
- majentik/gemma-4-E2B-it-TurboQuant-AWQ-4bit -- TurboQuant AWQ 4-bit variant
- majentik/gemma-4-E2B-it-RotorQuant-MLX-4bit -- MLX variant (Apple Silicon)
- RotorQuant GitHub
- llama-cpp-turboquant fork
- AutoAWQ
- vLLM
Quant trade-off (AWQ lane)
| Bits | Approx size | Use case | Recommendation |
|---|---|---|---|
| 4-bit | ~860 MB | Activation-aware 4-bit weight quant | GPU inference (vLLM, transformers, AutoAWQ) |
| 8-bit | ~1.5 GB | Activation-aware 8-bit weight quant | Quality-sensitive GPU inference |
(Current variant — 4bit — is bolded.)
Variants in this family
(Showing 18 sibling variants under majentik/gemma4-e2b-it-*. The current variant — RotorQuant-AWQ-4bit — is bolded.)
| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| RotorQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| RotorQuant-AWQ-4bit | transformers | ~1.2 GB | GPU 4-bit (AutoAWQ) |
| RotorQuant-AWQ-8bit | transformers | ~2.2 GB | GPU 8-bit (AutoAWQ) |
| RotorQuant-GGUF-IQ4_XS | llama.cpp | ~1.7 GB | Lossy 4-bit, low-RAM CPU/edge |
| RotorQuant-GGUF-Q2_K | llama.cpp | ~1.2 GB | Lossy, low-RAM CPU/edge |
| RotorQuant-GGUF-Q3_K_M | llama.cpp | ~1.6 GB | Smaller 3-bit, CPU-friendly |
| RotorQuant-GGUF-Q4_K_M | llama.cpp | ~2.2 GB | Balanced default |
| RotorQuant-GGUF-Q5_K_M | llama.cpp | ~2.6 GB | Higher fidelity, more RAM |
| RotorQuant-GGUF-Q8_0 | llama.cpp | ~4.2 GB | Near-lossless reference |
| RotorQuant-MLX-2bit | mlx-lm | ~655 MB | Apple Silicon, smallest |
| RotorQuant-MLX-4bit | mlx-lm | ~1.2 GB | Apple Silicon balanced |
| RotorQuant-MLX-8bit | mlx-lm | ~2.4 GB | Apple Silicon reference |
| TurboQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| TurboQuant-AWQ-4bit | transformers | ~1.2 GB | GPU 4-bit (AutoAWQ) |
| TurboQuant-AWQ-8bit | transformers | ~2.2 GB | GPU 8-bit (AutoAWQ) |
| TurboQuant-MLX-2bit | mlx-lm | ~655 MB | Apple Silicon, smallest |
| TurboQuant-MLX-4bit | mlx-lm | ~1.2 GB | Apple Silicon balanced |
| TurboQuant-MLX-8bit | mlx-lm | ~2.4 GB | Apple Silicon reference |