Tarka-Embedding-150M-V1-ONNX-Q (Quantized)

Quantized ONNX version of Tarka-AIR/Tarka-Embedding-150M-V1.

This model uses dynamic INT8 quantization for optimized CPU inference performance.

Embedding dimension: 768
Context length: 2048
Model size: ~150MB (75% smaller than FP32)
Quantization: Dynamic INT8 (QUInt8)

Performance

✅ 3-4x faster on CPU compared to the full precision model
⚠️ Slight accuracy trade-off (~1-2% on MTEB benchmarks)
💾 75% smaller file size

For GPU inference, use the full precision model: permutans/Tarka-Embedding-150M-V1-ONNX

Usage

With ONNX Runtime (Python)

import onnxruntime as ort
from transformers import AutoTokenizer

session = ort.InferenceSession("model.onnx")
tokenizer = AutoTokenizer.from_pretrained("permutans/Tarka-Embedding-150M-V1-ONNX-Q")

text = "Your text here"
inputs = tokenizer(text, return_tensors="np")

onnx_inputs = {
    "input_ids": inputs["input_ids"],
    "attention_mask": inputs["attention_mask"]
}
_, sentence_embedding = session.run(None, onnx_inputs)
print(sentence_embedding.shape)  # (1, 768)

With FastEmbed (Rust)

Compatible with fastembed-rs for high-performance CPU embedding generation.

Model Outputs

token_embeddings: Token-level embeddings (batch_size, sequence_length, 768)
sentence_embedding: Pooled sentence embeddings (batch_size, 768) - use this for most tasks

Recommendation

CPU deployment: Use this quantized model
GPU deployment: Use permutans/Tarka-Embedding-150M-V1-ONNX (35% faster)

Downloads last month: 16

Model tree for permutans/Tarka-Embedding-150M-V1-ONNX-Q

Base model

Tarka-AIR/Tarka-Embedding-150M-V1

Quantized

(2)

this model