Tarka-Embedding-150M-V1-ONNX-Q (Quantized)

Quantized ONNX version of Tarka-AIR/Tarka-Embedding-150M-V1.

This model uses dynamic INT8 quantization for optimized CPU inference performance.

  • Embedding dimension: 768
  • Context length: 2048
  • Model size: ~150MB (75% smaller than FP32)
  • Quantization: Dynamic INT8 (QUInt8)

Performance

  • โœ… 3-4x faster on CPU compared to the full precision model
  • โš ๏ธ Slight accuracy trade-off (~1-2% on MTEB benchmarks)
  • ๐Ÿ’พ 75% smaller file size

For GPU inference, use the full precision model: permutans/Tarka-Embedding-150M-V1-ONNX

Usage

With ONNX Runtime (Python)

import onnxruntime as ort
from transformers import AutoTokenizer

session = ort.InferenceSession("model.onnx")
tokenizer = AutoTokenizer.from_pretrained("permutans/Tarka-Embedding-150M-V1-ONNX-Q")

text = "Your text here"
inputs = tokenizer(text, return_tensors="np")

onnx_inputs = {
    "input_ids": inputs["input_ids"],
    "attention_mask": inputs["attention_mask"]
}
_, sentence_embedding = session.run(None, onnx_inputs)
print(sentence_embedding.shape)  # (1, 768)

With FastEmbed (Rust)

Compatible with fastembed-rs for high-performance CPU embedding generation.

Model Outputs

  • token_embeddings: Token-level embeddings (batch_size, sequence_length, 768)
  • sentence_embedding: Pooled sentence embeddings (batch_size, 768) - use this for most tasks

Recommendation

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for permutans/Tarka-Embedding-150M-V1-ONNX-Q

Quantized
(2)
this model