dejanseo
/

ecommerce-query-volume-classifier

+---
+license: other
+license_name: link-attribution
+license_link: https://dejan.ai/blog/query-length-vs-volume/
+language:
+- en
+library_name: transformers
+pipeline_tag: text-classification
+tags:
+- deberta-v2
+- deberta-v3
+- ecommerce
+- search
+- query-volume
+- seo
+- keyword-research
+- amazon
+base_model: microsoft/deberta-v3-base
+datasets:
+- amazon/AmazonQAC
+metrics:
+- accuracy
+- f1
+model-index:
+- name: ecommerce-query-volume-classifier
+  results:
+  - task:
+      type: text-classification
+      name: Search Query Volume Classification
+    dataset:
+      name: Amazon Shopping Queries (AmazonQAC)
+      type: amazon/AmazonQAC
+    metrics:
+    - name: Accuracy
+      type: accuracy
+      value: 0.721
+    - name: Macro F1
+      type: f1
+      value: 0.6877
+    - name: Spearman Correlation
+      type: spearmanr
+      value: 0.896
+---
+# eCommerce Query Volume Classifier
+A fine-tuned [DeBERTa v3 base](https://huggingface.co/microsoft/deberta-v3-base) model that predicts the search volume class of ecommerce product queries. Trained on 39.6 million unique queries from the [Amazon Shopping Queries](https://huggingface.co/datasets/amazon/AmazonQAC) dataset spanning 395.5 million search sessions.
+**Blog post:** [Is Query Length a Reliable Predictor of Search Volume?](https://dejan.ai/blog/query-length-vs-volume/)
+## Model Description
+This model classifies ecommerce search queries into five volume tiers based on their expected search popularity:
+| Label | Class | Occurrences | Description |
+|-------|-------|-------------|-------------|
+| 0 | `very_high` | 10,000+ | Head terms, major brands (e.g. "airpods", "laptop") |
+| 1 | `high` | 1,000–9,999 | Popular product categories and well-known items |
+| 2 | `medium` | 100–999 | Moderately specific queries |
+| 3 | `low` | 10–99 | Niche or qualified queries |
+| 4 | `very_low` | <10 | Long-tail, highly specific queries |
+The model learns semantic signals — brand recognition, category head terms, specificity markers — rather than superficial features like query length. Simple character/word-count heuristics achieve only ~25% accuracy on this task (barely above the 20% random baseline), while this model achieves **72.1% accuracy**.
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_name = "dejanseo/ecommerce-query-volume-classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+model.eval()
+labels = ["very_high", "high", "medium", "low", "very_low"]
+queries = [
+    "airpods",
+    "wireless mouse",
+    "organic flurb capsules",
+    "replacement gasket for instant pot duo 8 quart",
+]
+inputs = tokenizer(queries, return_tensors="pt", padding=True, truncation=True, max_length=32)
+with torch.no_grad():
+    outputs = model(**inputs)
+    probs = torch.softmax(outputs.logits, dim=-1)
+    preds = torch.argmax(probs, dim=-1)
+for query, pred, prob in zip(queries, preds, probs):
+    label = labels[pred.item()]
+    confidence = prob[pred.item()].item() * 100
+    print(f"{query:50s} → {label:>10s}  ({confidence:.1f}%)")
+```
+## Performance
+### Evaluation (25K balanced sample, 5K per class)
+| Method | Accuracy | Spearman ρ |
+|--------|----------|------------|
+| **This model** | **72.1%** | **0.896** |
+| Word count heuristic | 25.4% | -0.345 |
+| Char count heuristic | 24.9% | -0.336 |
+### Per-Class F1 Scores (best validation checkpoint)
+| Class | Precision | Recall | F1 |
+|-------|-----------|--------|----|
+| very_high | 0.892 | 0.980 | 0.934 |
+| high | 0.727 | 0.921 | 0.813 |
+| medium | 0.625 | 0.790 | 0.698 |
+| low | 0.496 | 0.335 | 0.400 |
+| very_low | 0.610 | 0.579 | 0.594 |
+The model performs best on the extremes (very high and very low volume) and struggles most with the `low` class, which sits in an ambiguous zone between `medium` and `very_low`.
+## Training Details
+### Hyperparameters
+| Parameter | Value |
+|-----------|-------|
+| Base model | `microsoft/deberta-v3-base` |
+| Epochs | 20 |
+| Batch size | 128 |
+| Learning rate | 3e-5 |
+| Max sequence length | 32 |
+| Warmup ratio | 0.1 |
+| Weight decay | 0.01 |
+| Label smoothing | 0.1 |
+| Scheduler | Linear with warmup |
+### Sampling Strategy
+Balanced sampling per epoch with different random seeds:
+| Class | Samples per epoch |
+|-------|-------------------|
+| very_low | 100,000 |
+| low | 100,000 |
+| medium | 100,000 |
+| high | 30,000 |
+| very_high | 30,000 |
+**Total per epoch:** 324,000 train / 36,000 validation
+### Hardware
+- **GPU:** NVIDIA GeForce RTX 4090 (24 GB)
+- **RAM:** 128 GB
+- **OS:** Windows 11
+- **Training time:** ~2 hours 16 minutes
+- **Framework:** PyTorch + Transformers 4.57.1
+### Dataset
+[Amazon Shopping Queries (AmazonQAC)](https://huggingface.co/datasets/amazon/AmazonQAC) �� 395.5 million sessions, 39.6 million unique queries. Volume classes derived from raw occurrence counts across sessions.
+| Class | Unique Queries |
+|-------|---------------|
+| very_high | ~18K |
+| high | ~30K |
+| medium | ~321K |
+| low | ~4.6M |
+| very_low | ~34.7M |
+## What the Model Learns
+The model captures semantic patterns rather than surface-level features like query length:
+- **Brand recognition:** "airpods" → very high, regardless of character count
+- **Category head terms:** "laptop", "headphones", "dog food" → recognized as high-volume entry points
+- **Specificity markers:** Size specs, compatibility constraints, and material callouts signal niche demand
+- **Nonsense detection:** Gibberish queries like "blorf" and "wireless blorf adapter" are correctly classified as very low volume, confirming the model isn't just counting characters
+## Limitations
+- Trained exclusively on Amazon product search queries — may not generalize well to Google web search, informational queries, or non-English markets
+- The `low` volume class is the weakest (F1 ≈ 0.39), reflecting genuine ambiguity in the boundary between medium and very low volume queries
+- Volume thresholds are based on the Amazon QAC dataset's session counts, which may not map directly to other volume scales (e.g. Google Keyword Planner)
+- Product trends shift over time; queries that were high volume in the training data may not remain so
+## Citation
+```bibtex
+@article{petrovic2026querylength,
+  title={Is Query Length a Reliable Predictor of Search Volume?},
+  author={Petrovic, Dan},
+  year={2026},
+  month={March},
+  url={https://dejan.ai/blog/query-length-vs-volume/}
+}
+```
+## Author
+**Dan Petrovic** — [DEJAN AI](https://dejan.ai/)