gemma-3-4b-it-qat-4bit-lite

Paper: On-Device Multimodal LLM Optimization: Fitting Gemma 3 into 2 GB

Optimized version of gemma-3-4b-it-qat-4bit for Apple Silicon edge devices. Reduces model size from 2.8 GB to 2.3 GB with lower runtime memory and significantly reduced thermal output, while preserving text and image understanding quality.

For an even smaller version (2.1 GB) with weight splitting and neuron pruning, see gemma-3-4b-it-qat-4bit-mobile.

Optimizations Applied

Step	Optimization	Effect
1	Vocabulary pruning 262K → 144K tokens	-170 MB disk, token_map remapping
2	Vision fc2 bf16 → 4-bit (pad 4304 → 4352)	-191 MB disk
3	Remove text layers 31, 32, 33 (34 → 31 layers)	-159 MB disk, faster inference
4	Image resolution 896 → 672	~3x less vision attention compute

Architecture

Text model:
  vocab_size: 262,208 (token_map → 144,257 compact embeddings)
  hidden_size: 2560
  intermediate_size: 10240
  num_hidden_layers: 31
  num_attention_heads: 8 (GQA, 4 KV heads)
  head_dim: 256
  quantization: 4-bit, group_size=64

Vision model (SigLIP):
  hidden_size: 1152
  intermediate_size: 4352 (padded from 4304, fc2 4-bit quantized)
  num_hidden_layers: 27
  image_size: 672
  patch_size: 14
  mm_tokens_per_image: 144

Model Files

File	Size	Description
`model.safetensors`	2.3 GB	All weights (language + vision)
`config.json`	-	Model configuration with `vocab_pruning` metadata

Requirements

This model uses token_map for vocabulary pruning. The inference engine must:

Token map: Read vocab_pruning.compact_vocab_size (144,257) from config.json. Initialize embedding with compact size. Load language_model.model.embed_tokens.token_map (int32[262208]) and remap: embedding(token_map[input_ids]).

Usage

Swift (swift-gemma-cli)

A native Swift CLI for running this model on Apple Silicon, with full support for token_map.

git clone https://github.com/AtomGradient/swift-gemma-cli.git
cd swift-gemma-cli
swift build -c release

# Text generation
swift run -c release gemma-cli <model-path> \
  --prompt "Hello, how are you?" --max-tokens 100 --temperature 0.0

# Image understanding
swift run -c release gemma-cli <model-path> \
  --image photo.jpg \
  --prompt "Describe this image in detail." --max-tokens 200 --temperature 0.0

Benchmarks (Apple Silicon)

Metric	Original	This Model
Disk size	2.8 GB	2.3 GB
Peak memory (image)	~5500 MB	4590 MB
Prompt speed (text)	109 t/s	~120 t/s
Generation speed (text)	90 t/s	~110 t/s
Prompt speed (image)	54 t/s	123 t/s
Generation speed (image)	27 t/s	66 t/s
Image understanding	Correct	Correct
Text quality	Perfect	Good

Quality Notes

Image understanding is fully preserved: correctly identifies objects, colors, composition
Text quality is better than the mobile variant since no neuron pruning is applied
Recommended when text quality is prioritized over minimum model size

Base Model

gemma-3-4b-it-qat-4bit

License

Same as the base model. See Gemma Terms of Use.

Downloads last month: 89

MLX

Hardware compatibility

Quantized

Model tree for AtomGradient/gemma-3-4b-it-qat-4bit-lite

Base model

OpenGVLab/InternVL3-1B-Pretrained

Finetuned

OpenGVLab/InternVL3-1B-Instruct

Finetuned

mlx-community/gemma-3-4b-it-qat-4bit

Finetuned

(3)

this model