Tarka-Embedding-150M-V1-ONNX-Q (Quantized)
Quantized ONNX version of Tarka-AIR/Tarka-Embedding-150M-V1.
This model uses dynamic INT8 quantization for optimized CPU inference performance.
- Embedding dimension: 768
- Context length: 2048
- Model size: ~150MB (75% smaller than FP32)
- Quantization: Dynamic INT8 (QUInt8)
Performance
- โ 3-4x faster on CPU compared to the full precision model
- โ ๏ธ Slight accuracy trade-off (~1-2% on MTEB benchmarks)
- ๐พ 75% smaller file size
For GPU inference, use the full precision model: permutans/Tarka-Embedding-150M-V1-ONNX
Usage
With ONNX Runtime (Python)
import onnxruntime as ort
from transformers import AutoTokenizer
session = ort.InferenceSession("model.onnx")
tokenizer = AutoTokenizer.from_pretrained("permutans/Tarka-Embedding-150M-V1-ONNX-Q")
text = "Your text here"
inputs = tokenizer(text, return_tensors="np")
onnx_inputs = {
"input_ids": inputs["input_ids"],
"attention_mask": inputs["attention_mask"]
}
_, sentence_embedding = session.run(None, onnx_inputs)
print(sentence_embedding.shape) # (1, 768)
With FastEmbed (Rust)
Compatible with fastembed-rs for high-performance CPU embedding generation.
Model Outputs
token_embeddings: Token-level embeddings (batch_size, sequence_length, 768)sentence_embedding: Pooled sentence embeddings (batch_size, 768) - use this for most tasks
Recommendation
- CPU deployment: Use this quantized model
- GPU deployment: Use permutans/Tarka-Embedding-150M-V1-ONNX (35% faster)
- Downloads last month
- 16
Model tree for permutans/Tarka-Embedding-150M-V1-ONNX-Q
Base model
Tarka-AIR/Tarka-Embedding-150M-V1