ONNX
English
distill
openai
embeddings

Distilled Embedding Model (1536-d) — ONNX / INT8 / ORT

This repository folder contains a distilled SentenceTransformer embedding model exported to ONNX and additionally provided in INT8 ONNX and INT8 ORT formats for efficient deployment (including client-side scenarios).

Evaluation testing on Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K:

image

image

  • Loss around 0.46–0.50 implies cosine similarity around 0.50–0.54 between student and teacher for the same input text.
  • That is meaningfully better than random, but it is not a tight alignment.
  • Quantised Int8 model is just 57mb!

The model was trained to approximate OpenAI text-embedding-3-small embeddings (1536-d) on the dataset: Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K and texts from Tabotea corpus (in all languages) - 3.5 mil unique records in total, 80gb of data.

Intended Use

  • Primary: Fast embedding generation where you want a local approximation of text-embedding-3-small at 1536 dimensions. (e.g., in a web browser client)
  • Common deployment patterns
  1. All-local retrieval: both queries and documents embedded by this model.
  2. Hybrid: queries embedded locally, documents embedded server-side by OpenAI.

Pattern (1) is generally more stable. Pattern (2) is possible but has specific risks (see below).


Input / Output Contract

use same tokeniser as for mdbr-leaf-mt to generate inputs, get OpenAI-styled embedding vectors

Inputs

  • input_ids (int64) — shape [batch, seq]
  • attention_mask (int64) — shape [batch, seq]
  • token_type_ids (int64) — only if the exported model expects it (some backbones do)

Output

  • sentence_embedding (float32) — shape [batch, 1536]

Normalization

Exports are typically configured to output L2-normalized embeddings (normalized_output: true in export_meta.json).

  • If you store embeddings for dot-product search, keep a consistent convention:
  • normalized embeddings + dot product ≈ cosine similarity
  • If your database expects raw vectors, normalize both query and document vectors consistently.

Max sequence length

The model was trained/exported with a fixed truncation policy (max_length=512).

  • Always use the same max_length and truncation settings on all clients to avoid drift.

Quantized Variants

  • model.onnx (float):

  • best fidelity

  • larger file, slower than quantized on CPU/WASM

  • model.int8.onnx (dynamic weight-only INT8):

  • smaller, faster on many CPU/WASM setups

  • slightly reduced fidelity vs float (usually small, but measurable)

  • model.int8.ort:

  • same intent as INT8 ONNX, packaged for ORT format workflows

  • typically used for minimal runtime deployments


Validation Recommendations

If you plan production use, validate at the level that matters:

  • Retrieval metrics: Recall@K / nDCG@K on a query set that matches your users.
  • Short-query performance: many real prompts are short; measure separately.
  • Domain shift: test on your real documents and prompts, not only DBPedia-style text.

Risks & Limitations (Short List)

1) Teacher–student mismatch (hybrid embedding spaces)

This model approximates OpenAI embeddings; it is not identical. If you embed documents with OpenAI and queries locally, ranking can degrade due to query-space mismatch, especially for near-ties.

Mitigation: Prefer “all-local” (both docs and queries with the same model), or add reranking / calibration.

2) Short prompt sensitivity

User prompts are often short. Distilled models frequently exhibit larger relative errors on short inputs.

Mitigation: Evaluate on a real prompt distribution; consider prompting templates or query expansion if needed.

3) Quantization accuracy loss (INT8)

INT8 generally introduces a small but consistent quality drop vs float.

Mitigation: Use float ONNX for highest fidelity; use INT8 where latency/size dominate and retrieval tolerates minor loss.

4) Preprocessing drift (tokenization, truncation, normalization)

Differences in max_length, truncation, whitespace cleanup, or normalization can materially change embeddings.

Mitigation: Lock preprocessing and document it; enforce consistent L2 normalization across query and doc vectors.

5) Upstream version drift (OpenAI embeddings)

If server-side embeddings are re-generated with a different OpenAI model/version, the student alignment may degrade.

Mitigation: Version embeddings and keep a migration plan; retrain or recalibrate if the teacher changes.

6) Client-side operational constraints

Browsers vary in performance and capabilities. Cold-start cost, memory pressure, and backend differences (WASM/WebGPU) can cause inconsistent UX.

Mitigation: Use caching, lazy loading, fallbacks, and device-aware routing (server-side embedding for low-end clients).


Licensing / Compliance Notes

  • This model derives from a base model (e.g., MongoDB/mdbr-leaf-mt) and training datasets that have their own licenses/terms.
  • Ensure you comply with:
  • base model license
  • dataset terms
  • any constraints related to the teacher embeddings used for distillation

Practical Deployment Guidance

  • If you need maximum relevance quality: use float ONNX and consider server-side embedding for queries.

  • If you need low-latency or offline mode: use INT8 ONNX client-side, but validate retrieval metrics and consider reranking.

  • For hybrid (client query + server doc embeddings):

  • Treat this as a performance optimisation that requires A/B validation

  • prefer larger top-K retrieval and reranking to compensate for the mismatch


Training Procedure (Distillation Details)

This model is a student encoder trained to regress toward OpenAI text-embedding-3-small (1536-d) embeddings using embedding-level distillation.

Base Model and Head Modification

  • Base encoder: MongoDB/mdbr-leaf-mt (SentenceTransformer).
  • Projection head: the SentenceTransformer Dense head was replaced (or appended if missing) to output 1536 dimensions:
  • bias=False
  • activation_function=Identity()
  • weights initialised with Xavier uniform
  • Output dimensionality after modification was enforced to be 1536.

Trainable Parameters (Partial Fine-Tuning)

To preserve the base semantic space while adapting the projection toward the teacher embeddings:

  • All parameters were frozen, then:
  • Dense head(s) were unfrozen (trainable)
  • last 2 transformer blocks were unfrozen (trainable)
  • (optionally) a final normalisation module was unfrozen if present
  • Gradient checkpointing was enabled on the transformer backbone when supported.

Schema was normalised to:

  • text (string)
  • labels (float32[1536]) — teacher embedding

Dataset handling:

  • Concatenation was performed after casting both datasets to an identical fixed-length feature schema.
  • The combined dataset was shuffled (seeded), then split into (3+mil records):
  • Train: 99%
  • Eval/Test: 1%

Objective / Loss

The model was trained using a cosine regression objective against the teacher embeddings:

  • Compute student embedding emb
  • L2-normalise both student and teacher:
  • emb_n = normalize(emb)
  • lab_n = normalize(labels)
  • Loss per batch:
  • cosine-loss: loss = 1 - mean(sum(emb_n * lab_n))

This directly optimises angular alignment between student and teacher vectors.

Tokenisation / Sequence Handling

  • Tokeniser: inherited from the base model.
  • Truncation/padding:
  • max_length = 512
  • dynamic padding per batch, truncation enabled

Optimisation and Schedule

Key hyperparameters:

  • Epochs: 10
  • Learning rate: 1e-3
  • Weight decay: 0.01
  • Warmup ratio: 0.03
  • Scheduler: cosine
  • Optimizer: AdamW (fused) (adamw_torch_fused)
  • NEFTune: disabled (neftune_noise_alpha = 0.0)

Precision and kernels:

  • BF16 enabled on capable GPUs (e.g., H100), else FP16
  • TF32 is enabled where applicable for throughput

On a single A100 80GB, training took 5hours.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kabumbus/mdbr-leaf-mt-1536-onnx

Quantized
(2)
this model

Datasets used to train Kabumbus/mdbr-leaf-mt-1536-onnx