Qwen3-1.7B (ExecuTorch, XNNPACK, 8da4w)
This folder contains an ExecuTorch .pte export of Qwen/Qwen3-1.7B for CPU inference via the XNNPACK backend, with post-training quantization enabled.
Contents
model.pte: ExecuTorch programtokenizer.json,vocab.json,merges.txt: tokenizer artifactsconfig.json,generation_config.json,tokenizer_config.json: metadata files from the upstream Hugging Face repo
Quantization
Export flags:
--qlinear 8da4w: linear layers use 8-bit dynamic activations + 4-bit weights--qembedding 8w: embeddings use 8-bit weights
Other export settings:
- Task:
text-generation - Recipe:
xnnpack - Flags:
--use_custom_sdpa --use_custom_kv_cache
Tooling versions used to generate these artifacts:
- ExecuTorch:
executorch==1.2.0a0+d265acf(gitd265acfb63) - Optimum ExecuTorch:
optimum-executorch==0.2.0.dev0(git4c62ed77)
Command used:
optimum-cli export executorch \
--model "Qwen/Qwen3-1.7B" \
--task "text-generation" \
--recipe "xnnpack" \
--use_custom_sdpa \
--use_custom_kv_cache \
--qlinear "8da4w" \
--qembedding "8w" \
--output_dir "<output_dir>"
Run with the ExecuTorch Llama runner
Build the runner from the ExecuTorch repo root:
make llama-cpu
Binary: cmake-out/examples/models/llama/llama_main
cmake-out/examples/models/llama/llama_main \
--model_path "model.pte" \
--tokenizer_path "tokenizer.json" \
--prompt "Simply put, the theory of relativity states that" \
--max_new_tokens 48 \
--temperature 0
- Downloads last month
- 16