Qwen3-1.7B (ExecuTorch, XNNPACK, 8da4w)

This folder contains an ExecuTorch .pte export of Qwen/Qwen3-1.7B for CPU inference via the XNNPACK backend, with post-training quantization enabled.

model.pte: ExecuTorch program
tokenizer.json, vocab.json, merges.txt: tokenizer artifacts
config.json, generation_config.json, tokenizer_config.json: metadata files from the upstream Hugging Face repo

Quantization

Export flags:

--qlinear 8da4w: linear layers use 8-bit dynamic activations + 4-bit weights
--qembedding 8w: embeddings use 8-bit weights

Other export settings:

Task: text-generation
Recipe: xnnpack
Flags: --use_custom_sdpa --use_custom_kv_cache

Tooling versions used to generate these artifacts:

ExecuTorch: executorch==1.2.0a0+d265acf (git d265acfb63)
Optimum ExecuTorch: optimum-executorch==0.2.0.dev0 (git 4c62ed77)

Command used:

optimum-cli export executorch \
  --model "Qwen/Qwen3-1.7B" \
  --task "text-generation" \
  --recipe "xnnpack" \
  --use_custom_sdpa \
  --use_custom_kv_cache \
  --qlinear "8da4w" \
  --qembedding "8w" \
  --output_dir "<output_dir>"

Run with the ExecuTorch Llama runner

Build the runner from the ExecuTorch repo root:

make llama-cpu

Binary: cmake-out/examples/models/llama/llama_main

cmake-out/examples/models/llama/llama_main \
  --model_path "model.pte" \
  --tokenizer_path "tokenizer.json" \
  --prompt "Simply put, the theory of relativity states that" \
  --max_new_tokens 48 \
  --temperature 0

Downloads last month: 16

Model tree for larryliu0820/Qwen3-1.7B-INT8-INT4-ExecuTorch-XNNPACK

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

(491)

this model

larryliu0820
/

Qwen3-1.7B-INT8-INT4-ExecuTorch-XNNPACK

Qwen3-1.7B (ExecuTorch, XNNPACK, 8da4w)

Contents

Quantization

Run with the ExecuTorch Llama runner

Model tree for larryliu0820/Qwen3-1.7B-INT8-INT4-ExecuTorch-XNNPACK