gemma-3-4b-it-qat-4bit-lite
Paper: On-Device Multimodal LLM Optimization: Fitting Gemma 3 into 2 GB
Optimized version of gemma-3-4b-it-qat-4bit for Apple Silicon edge devices. Reduces model size from 2.8 GB to 2.3 GB with lower runtime memory and significantly reduced thermal output, while preserving text and image understanding quality.
For an even smaller version (2.1 GB) with weight splitting and neuron pruning, see gemma-3-4b-it-qat-4bit-mobile.
Optimizations Applied
| Step |
Optimization |
Effect |
| 1 |
Vocabulary pruning 262K โ 144K tokens |
-170 MB disk, token_map remapping |
| 2 |
Vision fc2 bf16 โ 4-bit (pad 4304 โ 4352) |
-191 MB disk |
| 3 |
Remove text layers 31, 32, 33 (34 โ 31 layers) |
-159 MB disk, faster inference |
| 4 |
Image resolution 896 โ 672 |
~3x less vision attention compute |
Architecture
Text model:
vocab_size: 262,208 (token_map โ 144,257 compact embeddings)
hidden_size: 2560
intermediate_size: 10240
num_hidden_layers: 31
num_attention_heads: 8 (GQA, 4 KV heads)
head_dim: 256
quantization: 4-bit, group_size=64
Vision model (SigLIP):
hidden_size: 1152
intermediate_size: 4352 (padded from 4304, fc2 4-bit quantized)
num_hidden_layers: 27
image_size: 672
patch_size: 14
mm_tokens_per_image: 144
Model Files
| File |
Size |
Description |
model.safetensors |
2.3 GB |
All weights (language + vision) |
config.json |
- |
Model configuration with vocab_pruning metadata |
Requirements
This model uses token_map for vocabulary pruning. The inference engine must:
- Token map: Read
vocab_pruning.compact_vocab_size (144,257) from config.json. Initialize embedding with compact size. Load language_model.model.embed_tokens.token_map (int32[262208]) and remap: embedding(token_map[input_ids]).
Usage
A native Swift CLI for running this model on Apple Silicon, with full support for token_map.
git clone https://github.com/AtomGradient/swift-gemma-cli.git
cd swift-gemma-cli
swift build -c release
swift run -c release gemma-cli <model-path> \
--prompt "Hello, how are you?" --max-tokens 100 --temperature 0.0
swift run -c release gemma-cli <model-path> \
--image photo.jpg \
--prompt "Describe this image in detail." --max-tokens 200 --temperature 0.0
Benchmarks (Apple Silicon)
| Metric |
Original |
This Model |
| Disk size |
2.8 GB |
2.3 GB |
| Peak memory (image) |
~5500 MB |
4590 MB |
| Prompt speed (text) |
109 t/s |
~120 t/s |
| Generation speed (text) |
90 t/s |
~110 t/s |
| Prompt speed (image) |
54 t/s |
123 t/s |
| Generation speed (image) |
27 t/s |
66 t/s |
| Image understanding |
Correct |
Correct |
| Text quality |
Perfect |
Good |
Quality Notes
- Image understanding is fully preserved: correctly identifies objects, colors, composition
- Text quality is better than the
mobile variant since no neuron pruning is applied
- Recommended when text quality is prioritized over minimum model size
Base Model
gemma-3-4b-it-qat-4bit
License
Same as the base model. See Gemma Terms of Use.