File size: 4,245 Bytes
7f1b181 2274e86 7f1b181 2274e86 3594bf7 2274e86 0910ab4 2274e86 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
---
license: mit
library_name: sklearn
tags:
- multi-label-classification
- movie-genre-classification
- tfidf
- svc
- huggingface
- mlops
- github-actions
- serverless
- pipeline
- tmdb
- sklearn
datasets:
- tmdb
model_name: TMDB Multi-Label Genre Classifier
author: Arjun Varma
language:
- en
pretty_name: TMDB Movie Genre Classifier
task:
- text-classification
- multi-label-classification
---
# 🎬 TMDB Multi-Label Movie Genre Classifier
_Serverless Machine Learning Pipeline — TF-IDF + Linear SVC — Fully Automated & Deployed_
---
## Summary
This project demonstrates the ability to **design, automate, and deploy a real-world Machine Learning system** without relying on paid cloud services.
It showcases strong understanding and application of:
- **MLOps & CI/CD**
- **Automated retraining & scheduled jobs**
- **Model deployment & UI interface**
- **Testing, documentation, reproducibility**
The model predicts **multiple genres** for a movie based on its description — similar to how streaming platforms tag content for recommendations.
➡ Live Demo: https://huggingface.co/spaces/arjun-varma/tmdb-genre-classifier
➡ Model Hub: https://huggingface.co/arjun-varma/tmdb-genre-classifier
---
## 🧠 Problem — Why Multi‑Label Classification?
Movies are **not mutually exclusive**:
| Plot Summary | Correct Genres |
|-------------|----------------|
| Soldier returns from war, struggling with trauma | Drama, War |
| AI becomes sentient and turns against creators | Sci‑Fi, Thriller |
| A musician finds love on tour | Music, Romance |
Single‑label classifiers fail here.
Multi‑label learning predicts **all genres that simultaneously apply**.
This creates challenges:
- Soft labels
- Ambiguity
- Genre co‑occurrence patterns
- Long‑tail imbalance (Documentary vs Thriller vs Music)
---
## 🧱 Architecture — Serverless ML Pipeline

No AWS SageMaker, no GCP Vertex AI.
**Infrastructure cost = $0**
---
## ⚙️ Model — Why TF‑IDF + Linear SVC vs Transformers?
| Choice | Reason |
|--------|-------|
| Transformers | Expensive & slow for nightly retraining |
| Neural Networks | Need GPUs / infra |
| Logistic Regression | High precision, low recall |
| **Linear SVC + TF‑IDF** | Fast, scalable, interpretable 👈 Best for pipeline |
The biggest improvement:
- Logistic Regression predicted almost nothing → trying to be “safe”
- Linear SVC learned boundary margins → better multi‑genre recall
- Applying sigmoid + threshold → configurable precision/recall trade‑off
---
## 📊 Performance Metrics
| Model | Precision_micro | Recall_micro | F1_macro | Result |
|------|----------------|--------------|----------|--------|
| Logistic Regression | 0.83 | 0.006 | ~0.03 | Almost no predictions |
| **Linear SVC + threshold 0.25** | 0.16 | **0.99** | **0.27** | Usable predictions |
Interpretation:
- High recall = the model "understands" the genres
- Threshold lets *different applications choose correctness level*
If this was powering **recommendations**, threshold matters.
---
## 🧪 Testing
This project includes:
- Unit tests for vectorization & data transformation
- Mocked API tests for dataset ingestion
- End‑to‑end pipeline test verifying artifacts & metrics
Tools used:
- `pytest`
- `monkeypatch`
- `tmp_path`
- GitHub CI
This demonstrates **reliability in automation-focused ML environments**.
---
## 🖥 Demo & Integration
| Component | Link |
|----------|------|
| 🔥 Live App (HF Space) | [link](https://arjun-varma-tmdb-genre-demo.hf.space/?__theme=system&deep_link=uuDed8RzLJI) |
| 📁 Github repo | [link](https://github.com/ArjunXvarma/Serverless-ML-pipeline.git)|
The model provides:
- ⭐ Ranked genre probabilities
- ⭐ Adjustable confidence threshold
- ⭐ Real‑time inference
---
## 🚀 Future Enhancements
| Idea | Value |
|-----|------|
| Compare vs MiniLM Transformer | Benchmark credibility |
| Add FastAPI inference service | Deployable microservice |
| Visualize confidence & confusion | Explainable AI |
---
## ✍ Author
**Arjun Varma**
Machine Learning Engineer & Systems Developer
Designed for real-world ML infrastructure readiness.
---
|