File size: 4,245 Bytes
7f1b181
 
2274e86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7f1b181
 
2274e86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3594bf7
2274e86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0910ab4
 
2274e86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
license: mit
library_name: sklearn
tags:
- multi-label-classification
- movie-genre-classification
- tfidf
- svc
- huggingface
- mlops
- github-actions
- serverless
- pipeline
- tmdb
- sklearn
datasets:
- tmdb
model_name: TMDB Multi-Label Genre Classifier
author: Arjun Varma
language:
- en
pretty_name: TMDB Movie Genre Classifier
task:
- text-classification
- multi-label-classification
---

# 🎬 TMDB Multi-Label Movie Genre Classifier  
_Serverless Machine Learning Pipeline — TF-IDF + Linear SVC — Fully Automated & Deployed_

---

## Summary

This project demonstrates the ability to **design, automate, and deploy a real-world Machine Learning system** without relying on paid cloud services.

It showcases strong understanding and application of:
- **MLOps & CI/CD**
- **Automated retraining & scheduled jobs**
- **Model deployment & UI interface**
- **Testing, documentation, reproducibility**

The model predicts **multiple genres** for a movie based on its description — similar to how streaming platforms tag content for recommendations.

➡ Live Demo: https://huggingface.co/spaces/arjun-varma/tmdb-genre-classifier  
➡ Model Hub: https://huggingface.co/arjun-varma/tmdb-genre-classifier  

---

## 🧠 Problem — Why Multi‑Label Classification?

Movies are **not mutually exclusive**:

| Plot Summary | Correct Genres |
|-------------|----------------|
| Soldier returns from war, struggling with trauma | Drama, War |
| AI becomes sentient and turns against creators | Sci‑Fi, Thriller |
| A musician finds love on tour | Music, Romance |

Single‑label classifiers fail here.  
Multi‑label learning predicts **all genres that simultaneously apply**.

This creates challenges:
- Soft labels
- Ambiguity
- Genre co‑occurrence patterns
- Long‑tail imbalance (Documentary vs Thriller vs Music)

---

## 🧱 Architecture — Serverless ML Pipeline

![pipeline-architecture](pipeline-architecture.png)

No AWS SageMaker, no GCP Vertex AI.  
**Infrastructure cost = $0**

---

## ⚙️ Model — Why TF‑IDF + Linear SVC vs Transformers?

| Choice | Reason |
|--------|-------|
| Transformers | Expensive & slow for nightly retraining |
| Neural Networks | Need GPUs / infra |
| Logistic Regression | High precision, low recall |
| **Linear SVC + TF‑IDF** | Fast, scalable, interpretable 👈 Best for pipeline |

The biggest improvement:
- Logistic Regression predicted almost nothing → trying to be “safe”
- Linear SVC learned boundary margins → better multi‑genre recall
- Applying sigmoid + threshold → configurable precision/recall trade‑off

---

## 📊 Performance Metrics

| Model | Precision_micro | Recall_micro | F1_macro | Result |
|------|----------------|--------------|----------|--------|
| Logistic Regression | 0.83 | 0.006 | ~0.03 | Almost no predictions |
| **Linear SVC + threshold 0.25** | 0.16 | **0.99** | **0.27** | Usable predictions |

Interpretation:
- High recall = the model "understands" the genres
- Threshold lets *different applications choose correctness level*

If this was powering **recommendations**, threshold matters.

---

## 🧪 Testing

This project includes:
- Unit tests for vectorization & data transformation
- Mocked API tests for dataset ingestion
- End‑to‑end pipeline test verifying artifacts & metrics

Tools used:
- `pytest`
- `monkeypatch`
- `tmp_path`
- GitHub CI

This demonstrates **reliability in automation-focused ML environments**.

---

## 🖥 Demo & Integration

| Component | Link |
|----------|------|
| 🔥 Live App (HF Space) | [link](https://arjun-varma-tmdb-genre-demo.hf.space/?__theme=system&deep_link=uuDed8RzLJI) |
| 📁 Github repo | [link](https://github.com/ArjunXvarma/Serverless-ML-pipeline.git)|

The model provides:
- ⭐ Ranked genre probabilities
- ⭐ Adjustable confidence threshold
- ⭐ Real‑time inference

---

## 🚀 Future Enhancements

| Idea | Value |
|-----|------|
| Compare vs MiniLM Transformer | Benchmark credibility |
| Add FastAPI inference service | Deployable microservice |
| Visualize confidence & confusion | Explainable AI |

---

## ✍ Author

**Arjun Varma**  
Machine Learning Engineer & Systems Developer  
Designed for real-world ML infrastructure readiness.

---