---
license: mit
library_name: sklearn
tags:
- multi-label-classification
- movie-genre-classification
- tfidf
- svc
- huggingface
- mlops
- github-actions
- serverless
- pipeline
- tmdb
- sklearn
datasets:
- tmdb
model_name: TMDB Multi-Label Genre Classifier
author: Arjun Varma
language:
- en
pretty_name: TMDB Movie Genre Classifier
task:
- text-classification
- multi-label-classification
---

# 🎬 TMDB Multi-Label Movie Genre Classifier  
_Serverless Machine Learning Pipeline — TF-IDF + Linear SVC — Fully Automated & Deployed_

---

## Summary

This project demonstrates the ability to **design, automate, and deploy a real-world Machine Learning system** without relying on paid cloud services.

It showcases strong understanding and application of:
- **MLOps & CI/CD**
- **Automated retraining & scheduled jobs**
- **Model deployment & UI interface**
- **Testing, documentation, reproducibility**

The model predicts **multiple genres** for a movie based on its description — similar to how streaming platforms tag content for recommendations.

➡ Live Demo: https://huggingface.co/spaces/arjun-varma/tmdb-genre-classifier  
➡ Model Hub: https://huggingface.co/arjun-varma/tmdb-genre-classifier  

---

## 🧠 Problem — Why Multi‑Label Classification?

Movies are **not mutually exclusive**:

| Plot Summary | Correct Genres |
|-------------|----------------|
| Soldier returns from war, struggling with trauma | Drama, War |
| AI becomes sentient and turns against creators | Sci‑Fi, Thriller |
| A musician finds love on tour | Music, Romance |

Single‑label classifiers fail here.  
Multi‑label learning predicts **all genres that simultaneously apply**.

This creates challenges:
- Soft labels
- Ambiguity
- Genre co‑occurrence patterns
- Long‑tail imbalance (Documentary vs Thriller vs Music)

---

## 🧱 Architecture — Serverless ML Pipeline

![pipeline-architecture](pipeline-architecture.png)

No AWS SageMaker, no GCP Vertex AI.  
**Infrastructure cost = $0**

---

## ⚙️ Model — Why TF‑IDF + Linear SVC vs Transformers?

| Choice | Reason |
|--------|-------|
| Transformers | Expensive & slow for nightly retraining |
| Neural Networks | Need GPUs / infra |
| Logistic Regression | High precision, low recall |
| **Linear SVC + TF‑IDF** | Fast, scalable, interpretable 👈 Best for pipeline |

The biggest improvement:
- Logistic Regression predicted almost nothing → trying to be “safe”
- Linear SVC learned boundary margins → better multi‑genre recall
- Applying sigmoid + threshold → configurable precision/recall trade‑off

---

## 📊 Performance Metrics

| Model | Precision_micro | Recall_micro | F1_macro | Result |
|------|----------------|--------------|----------|--------|
| Logistic Regression | 0.83 | 0.006 | ~0.03 | Almost no predictions |
| **Linear SVC + threshold 0.25** | 0.16 | **0.99** | **0.27** | Usable predictions |

Interpretation:
- High recall = the model "understands" the genres
- Threshold lets *different applications choose correctness level*

If this was powering **recommendations**, threshold matters.

---

## 🧪 Testing

This project includes:
- Unit tests for vectorization & data transformation
- Mocked API tests for dataset ingestion
- End‑to‑end pipeline test verifying artifacts & metrics

Tools used:
- `pytest`
- `monkeypatch`
- `tmp_path`
- GitHub CI

This demonstrates **reliability in automation-focused ML environments**.

---

## 🖥 Demo & Integration

| Component | Link |
|----------|------|
| 🔥 Live App (HF Space) | [link](https://arjun-varma-tmdb-genre-demo.hf.space/?__theme=system&deep_link=uuDed8RzLJI) |
| 📁 Github repo | [link](https://github.com/ArjunXvarma/Serverless-ML-pipeline.git)|

The model provides:
- ⭐ Ranked genre probabilities
- ⭐ Adjustable confidence threshold
- ⭐ Real‑time inference

---

## 🚀 Future Enhancements

| Idea | Value |
|-----|------|
| Compare vs MiniLM Transformer | Benchmark credibility |
| Add FastAPI inference service | Deployable microservice |
| Visualize confidence & confusion | Explainable AI |

---

## ✍ Author

**Arjun Varma**  
Machine Learning Engineer & Systems Developer  
Designed for real-world ML infrastructure readiness.

---