Distilled Embedding Model (1536-d) — ONNX / INT8 / ORT

This repository folder contains a distilled SentenceTransformer embedding model exported to ONNX and additionally provided in INT8 ONNX and INT8 ORT formats for efficient deployment (including client-side scenarios).

Evaluation testing on Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K:

Loss around 0.46–0.50 implies cosine similarity around 0.50–0.54 between student and teacher for the same input text.
That is meaningfully better than random, but it is not a tight alignment.
Quantised Int8 model is just 57mb!

The model was trained to approximate OpenAI text-embedding-3-small embeddings (1536-d) on the dataset: Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K and texts from Tabotea corpus (in all languages) - 3.5 mil unique records in total, 80gb of data.

Intended Use

Primary: Fast embedding generation where you want a local approximation of text-embedding-3-small at 1536 dimensions. (e.g., in a web browser client)
Common deployment patterns

All-local retrieval: both queries and documents embedded by this model.
Hybrid: queries embedded locally, documents embedded server-side by OpenAI.

Pattern (1) is generally more stable. Pattern (2) is possible but has specific risks (see below).

Input / Output Contract

use same tokeniser as for mdbr-leaf-mt to generate inputs, get OpenAI-styled embedding vectors

Inputs

input_ids (int64) — shape [batch, seq]
attention_mask (int64) — shape [batch, seq]
token_type_ids (int64) — only if the exported model expects it (some backbones do)

Output

sentence_embedding (float32) — shape [batch, 1536]

Normalization

Exports are typically configured to output L2-normalized embeddings (normalized_output: true in export_meta.json).

If you store embeddings for dot-product search, keep a consistent convention:
normalized embeddings + dot product ≈ cosine similarity
If your database expects raw vectors, normalize both query and document vectors consistently.

Max sequence length

The model was trained/exported with a fixed truncation policy (max_length=512).

Always use the same max_length and truncation settings on all clients to avoid drift.

Quantized Variants

model.onnx (float):
best fidelity
larger file, slower than quantized on CPU/WASM
model.int8.onnx (dynamic weight-only INT8):
smaller, faster on many CPU/WASM setups
slightly reduced fidelity vs float (usually small, but measurable)
model.int8.ort:
same intent as INT8 ONNX, packaged for ORT format workflows
typically used for minimal runtime deployments

Validation Recommendations

If you plan production use, validate at the level that matters:

Retrieval metrics: Recall@K / nDCG@K on a query set that matches your users.
Short-query performance: many real prompts are short; measure separately.
Domain shift: test on your real documents and prompts, not only DBPedia-style text.

Risks & Limitations (Short List)

1) Teacher–student mismatch (hybrid embedding spaces)

This model approximates OpenAI embeddings; it is not identical. If you embed documents with OpenAI and queries locally, ranking can degrade due to query-space mismatch, especially for near-ties.

Mitigation: Prefer “all-local” (both docs and queries with the same model), or add reranking / calibration.

2) Short prompt sensitivity

User prompts are often short. Distilled models frequently exhibit larger relative errors on short inputs.

Mitigation: Evaluate on a real prompt distribution; consider prompting templates or query expansion if needed.

3) Quantization accuracy loss (INT8)

INT8 generally introduces a small but consistent quality drop vs float.

Mitigation: Use float ONNX for highest fidelity; use INT8 where latency/size dominate and retrieval tolerates minor loss.

4) Preprocessing drift (tokenization, truncation, normalization)

Differences in max_length, truncation, whitespace cleanup, or normalization can materially change embeddings.

Mitigation: Lock preprocessing and document it; enforce consistent L2 normalization across query and doc vectors.

5) Upstream version drift (OpenAI embeddings)

If server-side embeddings are re-generated with a different OpenAI model/version, the student alignment may degrade.

Mitigation: Version embeddings and keep a migration plan; retrain or recalibrate if the teacher changes.

6) Client-side operational constraints

Browsers vary in performance and capabilities. Cold-start cost, memory pressure, and backend differences (WASM/WebGPU) can cause inconsistent UX.

Mitigation: Use caching, lazy loading, fallbacks, and device-aware routing (server-side embedding for low-end clients).

Licensing / Compliance Notes

This model derives from a base model (e.g., MongoDB/mdbr-leaf-mt) and training datasets that have their own licenses/terms.
Ensure you comply with:
base model license
dataset terms
any constraints related to the teacher embeddings used for distillation

Practical Deployment Guidance

If you need maximum relevance quality: use float ONNX and consider server-side embedding for queries.
If you need low-latency or offline mode: use INT8 ONNX client-side, but validate retrieval metrics and consider reranking.
For hybrid (client query + server doc embeddings):
Treat this as a performance optimisation that requires A/B validation
prefer larger top-K retrieval and reranking to compensate for the mismatch

Training Procedure (Distillation Details)

This model is a student encoder trained to regress toward OpenAI text-embedding-3-small (1536-d) embeddings using embedding-level distillation.

Base Model and Head Modification

Base encoder: MongoDB/mdbr-leaf-mt (SentenceTransformer).
Projection head: the SentenceTransformer Dense head was replaced (or appended if missing) to output 1536 dimensions:
bias=False
activation_function=Identity()
weights initialised with Xavier uniform
Output dimensionality after modification was enforced to be 1536.

Trainable Parameters (Partial Fine-Tuning)

To preserve the base semantic space while adapting the projection toward the teacher embeddings:

All parameters were frozen, then:
Dense head(s) were unfrozen (trainable)
last 2 transformer blocks were unfrozen (trainable)
(optionally) a final normalisation module was unfrozen if present
Gradient checkpointing was enabled on the transformer backbone when supported.

Schema was normalised to:

text (string)
labels (float32[1536]) — teacher embedding

Dataset handling:

Concatenation was performed after casting both datasets to an identical fixed-length feature schema.
The combined dataset was shuffled (seeded), then split into (3+mil records):
Train: 99%
Eval/Test: 1%

Objective / Loss

The model was trained using a cosine regression objective against the teacher embeddings:

Compute student embedding emb
L2-normalise both student and teacher:
emb_n = normalize(emb)
lab_n = normalize(labels)
Loss per batch:
cosine-loss: loss = 1 - mean(sum(emb_n * lab_n))

This directly optimises angular alignment between student and teacher vectors.

Tokenisation / Sequence Handling

Tokeniser: inherited from the base model.
Truncation/padding:
max_length = 512
dynamic padding per batch, truncation enabled

Optimisation and Schedule

Key hyperparameters:

Epochs: 10
Learning rate: 1e-3
Weight decay: 0.01
Warmup ratio: 0.03
Scheduler: cosine
Optimizer: AdamW (fused) (adamw_torch_fused)
NEFTune: disabled (neftune_noise_alpha = 0.0)

Precision and kernels:

BF16 enabled on capable GPUs (e.g., H100), else FP16
TF32 is enabled where applicable for throughput

On a single A100 80GB, training took 5hours.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kabumbus/mdbr-leaf-mt-1536-onnx

Base model

MongoDB/mdbr-leaf-mt

Quantized

(2)

this model

Kabumbus
/

mdbr-leaf-mt-1536-onnx