Qwen3-1.7B (ExecuTorch, XNNPACK, 8da4w)

This folder contains an ExecuTorch .pte export of Qwen/Qwen3-1.7B for CPU inference via the XNNPACK backend, with post-training quantization enabled.

Contents

  • model.pte: ExecuTorch program
  • tokenizer.json, vocab.json, merges.txt: tokenizer artifacts
  • config.json, generation_config.json, tokenizer_config.json: metadata files from the upstream Hugging Face repo

Quantization

Export flags:

  • --qlinear 8da4w: linear layers use 8-bit dynamic activations + 4-bit weights
  • --qembedding 8w: embeddings use 8-bit weights

Other export settings:

  • Task: text-generation
  • Recipe: xnnpack
  • Flags: --use_custom_sdpa --use_custom_kv_cache

Tooling versions used to generate these artifacts:

  • ExecuTorch: executorch==1.2.0a0+d265acf (git d265acfb63)
  • Optimum ExecuTorch: optimum-executorch==0.2.0.dev0 (git 4c62ed77)

Command used:

optimum-cli export executorch \
  --model "Qwen/Qwen3-1.7B" \
  --task "text-generation" \
  --recipe "xnnpack" \
  --use_custom_sdpa \
  --use_custom_kv_cache \
  --qlinear "8da4w" \
  --qembedding "8w" \
  --output_dir "<output_dir>"

Run with the ExecuTorch Llama runner

Build the runner from the ExecuTorch repo root:

make llama-cpu

Binary: cmake-out/examples/models/llama/llama_main

cmake-out/examples/models/llama/llama_main \
  --model_path "model.pte" \
  --tokenizer_path "tokenizer.json" \
  --prompt "Simply put, the theory of relativity states that" \
  --max_new_tokens 48 \
  --temperature 0
Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for larryliu0820/Qwen3-1.7B-INT8-INT4-ExecuTorch-XNNPACK

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(491)
this model