Instructions to use majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit

SGLang

How to use majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit with Docker Model Runner:
```
docker model run hf.co/majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit
```

Gemma 4 E2B-it - RotorQuant AWQ 4-bit

4-bit AWQ-quantized version of google/gemma-4-E2B-it (instruction-tuned) with RotorQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference. RotorQuant delivers 5.3x faster prefill and 28% faster decode vs TurboQuant, making it a strong choice for low-latency chat serving.

Approximate model size: ~1.5 GB

Note: RotorQuant KV cache modes (planar3, iso3) require the RotorQuant fork or the llama-cpp-turboquant fork. The AWQ weights themselves load cleanly in stock AutoAWQ / vLLM; RotorQuant KV-cache kernels are opt-in.

Model Specifications

Property	Value
Base Model	google/gemma-4-E2B-it
Parameters	~2 billion
Architecture	Dense transformer, instruction-tuned
Modality	Multimodal: image + text input, text output
License	Apache 2.0
Weight Quantization	AWQ 4-bit (~1.5 GB)
Group Size	128
KV-Cache Quantization	RotorQuant (`planar3` / `iso3`)
Framework	transformers + AutoAWQ / vLLM

Quickstart

AutoAWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_quantized(
    "majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit",
    device_map="auto",
    fuse_layers=True,
)
tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit")

messages = [{"role": "user", "content": "Explain RotorQuant briefly."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))

vLLM

vllm serve majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit \
  --quantization awq_marlin \
  --max-model-len 8192

With RotorQuant KV cache (fork)

from rotorquant import RotorQuantCache
cache = RotorQuantCache(model, mode="iso3")  # or "planar3"

What is RotorQuant?

RotorQuant is a high-performance KV-cache quantization method using block-diagonal Clifford-algebra rotors. Combined with AWQ 4-bit weights, this delivers a dual compression strategy with superior KV-cache performance for GPU inference.

Key advantages over TurboQuant:

5.3x faster prefill
28% faster decode
Equivalent memory savings
planar3 / iso3 3-bit KV cache modes

KV-Cache Quantization Comparison

Method	Prefill Speed	Decode Speed	Memory Savings	Reference
TurboQuant	1x (baseline)	1x (baseline)	High	arXiv: 2504.19874
RotorQuant	5.3x faster	28% faster	High	GitHub

AWQ vs GGUF vs MLX

Format	Target Hardware	Runtime	Best For
AWQ	NVIDIA / AMD GPU (CUDA/ROCm)	AutoAWQ, vLLM, TGI	GPU-native inference, production serving
GGUF	CPU + GPU (cross-platform)	llama.cpp, Ollama, LM Studio	Laptops, CPU-only boxes, mixed offload
MLX	Apple Silicon	MLX, mlx-lm, mlx-vlm	Macs with unified memory

This repo ships AWQ. See the "See Also" section for GGUF and MLX siblings.

Memory Estimates (Gemma 4 E2B-it)

Precision	Approximate Size	VRAM Tier
FP16 (original)	~4 GB	8 GB+
AWQ 8-bit	~2 GB	4 GB+
AWQ 4-bit	~1.5 GB	4 GB+

Fits comfortably on entry-level GPUs (RTX 3050 / 4060 / A2000 and up).

Hardware Requirements

NVIDIA GPU with >=4 GB VRAM (RTX 3050, 3060, 4060, A2000, T4)
CUDA 12.x recommended
For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels
For RotorQuant KV cache: scrya-com/rotorquant fork

Quant trade-off (AWQ lane)

Bits	Approx size	Use case	Recommendation
4-bit	~860 MB	Activation-aware 4-bit weight quant	GPU inference (vLLM, transformers, AutoAWQ)
8-bit	~1.5 GB	Activation-aware 8-bit weight quant	Quality-sensitive GPU inference

(Current variant — 4bit — is bolded.)

Variants in this family

(Showing 18 sibling variants under majentik/gemma4-e2b-it-*. The current variant — RotorQuant-AWQ-4bit — is bolded.)

Variant	Runtime	Approx size	Use case
RotorQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
RotorQuant-AWQ-4bit	transformers	~1.2 GB	GPU 4-bit (AutoAWQ)
RotorQuant-AWQ-8bit	transformers	~2.2 GB	GPU 8-bit (AutoAWQ)
RotorQuant-GGUF-IQ4_XS	llama.cpp	~1.7 GB	Lossy 4-bit, low-RAM CPU/edge
RotorQuant-GGUF-Q2_K	llama.cpp	~1.2 GB	Lossy, low-RAM CPU/edge
RotorQuant-GGUF-Q3_K_M	llama.cpp	~1.6 GB	Smaller 3-bit, CPU-friendly
RotorQuant-GGUF-Q4_K_M	llama.cpp	~2.2 GB	Balanced default
RotorQuant-GGUF-Q5_K_M	llama.cpp	~2.6 GB	Higher fidelity, more RAM
RotorQuant-GGUF-Q8_0	llama.cpp	~4.2 GB	Near-lossless reference
RotorQuant-MLX-2bit	mlx-lm	~655 MB	Apple Silicon, smallest
RotorQuant-MLX-4bit	mlx-lm	~1.2 GB	Apple Silicon balanced
RotorQuant-MLX-8bit	mlx-lm	~2.4 GB	Apple Silicon reference
TurboQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
TurboQuant-AWQ-4bit	transformers	~1.2 GB	GPU 4-bit (AutoAWQ)
TurboQuant-AWQ-8bit	transformers	~2.2 GB	GPU 8-bit (AutoAWQ)
TurboQuant-MLX-2bit	mlx-lm	~655 MB	Apple Silicon, smallest
TurboQuant-MLX-4bit	mlx-lm	~1.2 GB	Apple Silicon balanced
TurboQuant-MLX-8bit	mlx-lm	~2.4 GB	Apple Silicon reference

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Finetuned

(178)

this model

Paper for majentik/gemma-4-E2B-it-RotorQuant-AWQ-4bit

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 34