Distilled Embedding Model (1536-d) — ONNX / INT8 / ORT
This repository folder contains a distilled SentenceTransformer embedding model exported to ONNX and additionally provided in INT8 ONNX and INT8 ORT formats for efficient deployment (including client-side scenarios).
Evaluation testing on Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K:
- Loss around 0.46–0.50 implies cosine similarity around 0.50–0.54 between student and teacher for the same input text.
- That is meaningfully better than random, but it is not a tight alignment.
- Quantised Int8 model is just 57mb!
The model was trained to approximate OpenAI text-embedding-3-small embeddings (1536-d) on the dataset:
Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K and texts from Tabotea corpus (in all languages) - 3.5 mil unique records in total, 80gb of data.
Intended Use
- Primary: Fast embedding generation where you want a local approximation of
text-embedding-3-smallat 1536 dimensions. (e.g., in a web browser client) - Common deployment patterns
- All-local retrieval: both queries and documents embedded by this model.
- Hybrid: queries embedded locally, documents embedded server-side by OpenAI.
Pattern (1) is generally more stable. Pattern (2) is possible but has specific risks (see below).
Input / Output Contract
use same tokeniser as for mdbr-leaf-mt to generate inputs, get OpenAI-styled embedding vectors
Inputs
input_ids(int64) — shape[batch, seq]attention_mask(int64) — shape[batch, seq]token_type_ids(int64) — only if the exported model expects it (some backbones do)
Output
sentence_embedding(float32) — shape[batch, 1536]
Normalization
Exports are typically configured to output L2-normalized embeddings (normalized_output: true in export_meta.json).
- If you store embeddings for dot-product search, keep a consistent convention:
- normalized embeddings + dot product ≈ cosine similarity
- If your database expects raw vectors, normalize both query and document vectors consistently.
Max sequence length
The model was trained/exported with a fixed truncation policy (max_length=512).
- Always use the same
max_lengthand truncation settings on all clients to avoid drift.
Quantized Variants
model.onnx(float):best fidelity
larger file, slower than quantized on CPU/WASM
model.int8.onnx(dynamic weight-only INT8):smaller, faster on many CPU/WASM setups
slightly reduced fidelity vs float (usually small, but measurable)
model.int8.ort:same intent as INT8 ONNX, packaged for ORT format workflows
typically used for minimal runtime deployments
Validation Recommendations
If you plan production use, validate at the level that matters:
- Retrieval metrics: Recall@K / nDCG@K on a query set that matches your users.
- Short-query performance: many real prompts are short; measure separately.
- Domain shift: test on your real documents and prompts, not only DBPedia-style text.
Risks & Limitations (Short List)
1) Teacher–student mismatch (hybrid embedding spaces)
This model approximates OpenAI embeddings; it is not identical. If you embed documents with OpenAI and queries locally, ranking can degrade due to query-space mismatch, especially for near-ties.
Mitigation: Prefer “all-local” (both docs and queries with the same model), or add reranking / calibration.
2) Short prompt sensitivity
User prompts are often short. Distilled models frequently exhibit larger relative errors on short inputs.
Mitigation: Evaluate on a real prompt distribution; consider prompting templates or query expansion if needed.
3) Quantization accuracy loss (INT8)
INT8 generally introduces a small but consistent quality drop vs float.
Mitigation: Use float ONNX for highest fidelity; use INT8 where latency/size dominate and retrieval tolerates minor loss.
4) Preprocessing drift (tokenization, truncation, normalization)
Differences in max_length, truncation, whitespace cleanup, or normalization can materially change embeddings.
Mitigation: Lock preprocessing and document it; enforce consistent L2 normalization across query and doc vectors.
5) Upstream version drift (OpenAI embeddings)
If server-side embeddings are re-generated with a different OpenAI model/version, the student alignment may degrade.
Mitigation: Version embeddings and keep a migration plan; retrain or recalibrate if the teacher changes.
6) Client-side operational constraints
Browsers vary in performance and capabilities. Cold-start cost, memory pressure, and backend differences (WASM/WebGPU) can cause inconsistent UX.
Mitigation: Use caching, lazy loading, fallbacks, and device-aware routing (server-side embedding for low-end clients).
Licensing / Compliance Notes
- This model derives from a base model (e.g.,
MongoDB/mdbr-leaf-mt) and training datasets that have their own licenses/terms. - Ensure you comply with:
- base model license
- dataset terms
- any constraints related to the teacher embeddings used for distillation
Practical Deployment Guidance
If you need maximum relevance quality: use float ONNX and consider server-side embedding for queries.
If you need low-latency or offline mode: use INT8 ONNX client-side, but validate retrieval metrics and consider reranking.
For hybrid (client query + server doc embeddings):
Treat this as a performance optimisation that requires A/B validation
prefer larger top-K retrieval and reranking to compensate for the mismatch
Training Procedure (Distillation Details)
This model is a student encoder trained to regress toward OpenAI text-embedding-3-small (1536-d) embeddings using embedding-level distillation.
Base Model and Head Modification
- Base encoder:
MongoDB/mdbr-leaf-mt(SentenceTransformer). - Projection head: the SentenceTransformer
Densehead was replaced (or appended if missing) to output 1536 dimensions: bias=Falseactivation_function=Identity()- weights initialised with Xavier uniform
- Output dimensionality after modification was enforced to be 1536.
Trainable Parameters (Partial Fine-Tuning)
To preserve the base semantic space while adapting the projection toward the teacher embeddings:
- All parameters were frozen, then:
- Dense head(s) were unfrozen (trainable)
- last 2 transformer blocks were unfrozen (trainable)
- (optionally) a final normalisation module was unfrozen if present
- Gradient checkpointing was enabled on the transformer backbone when supported.
Schema was normalised to:
text(string)labels(float32[1536]) — teacher embedding
Dataset handling:
- Concatenation was performed after casting both datasets to an identical fixed-length feature schema.
- The combined dataset was shuffled (seeded), then split into (3+mil records):
- Train: 99%
- Eval/Test: 1%
Objective / Loss
The model was trained using a cosine regression objective against the teacher embeddings:
- Compute student embedding
emb - L2-normalise both student and teacher:
emb_n = normalize(emb)lab_n = normalize(labels)- Loss per batch:
- cosine-loss:
loss = 1 - mean(sum(emb_n * lab_n))
This directly optimises angular alignment between student and teacher vectors.
Tokenisation / Sequence Handling
- Tokeniser: inherited from the base model.
- Truncation/padding:
max_length = 512- dynamic padding per batch, truncation enabled
Optimisation and Schedule
Key hyperparameters:
- Epochs: 10
- Learning rate: 1e-3
- Weight decay: 0.01
- Warmup ratio: 0.03
- Scheduler: cosine
- Optimizer: AdamW (fused) (
adamw_torch_fused) - NEFTune: disabled (
neftune_noise_alpha = 0.0)
Precision and kernels:
- BF16 enabled on capable GPUs (e.g., H100), else FP16
- TF32 is enabled where applicable for throughput
On a single A100 80GB, training took 5hours.
Model tree for Kabumbus/mdbr-leaf-mt-1536-onnx
Base model
MongoDB/mdbr-leaf-mt
