--- license: mit library_name: sklearn tags: - multi-label-classification - movie-genre-classification - tfidf - svc - huggingface - mlops - github-actions - serverless - pipeline - tmdb - sklearn datasets: - tmdb model_name: TMDB Multi-Label Genre Classifier author: Arjun Varma language: - en pretty_name: TMDB Movie Genre Classifier task: - text-classification - multi-label-classification --- # 🎬 TMDB Multi-Label Movie Genre Classifier _Serverless Machine Learning Pipeline — TF-IDF + Linear SVC — Fully Automated & Deployed_ --- ## Summary This project demonstrates the ability to **design, automate, and deploy a real-world Machine Learning system** without relying on paid cloud services. It showcases strong understanding and application of: - **MLOps & CI/CD** - **Automated retraining & scheduled jobs** - **Model deployment & UI interface** - **Testing, documentation, reproducibility** The model predicts **multiple genres** for a movie based on its description — similar to how streaming platforms tag content for recommendations. ➡ Live Demo: https://huggingface.co/spaces/arjun-varma/tmdb-genre-classifier ➡ Model Hub: https://huggingface.co/arjun-varma/tmdb-genre-classifier --- ## 🧠 Problem — Why Multi‑Label Classification? Movies are **not mutually exclusive**: | Plot Summary | Correct Genres | |-------------|----------------| | Soldier returns from war, struggling with trauma | Drama, War | | AI becomes sentient and turns against creators | Sci‑Fi, Thriller | | A musician finds love on tour | Music, Romance | Single‑label classifiers fail here. Multi‑label learning predicts **all genres that simultaneously apply**. This creates challenges: - Soft labels - Ambiguity - Genre co‑occurrence patterns - Long‑tail imbalance (Documentary vs Thriller vs Music) --- ## 🧱 Architecture — Serverless ML Pipeline ![pipeline-architecture](pipeline-architecture.png) No AWS SageMaker, no GCP Vertex AI. **Infrastructure cost = $0** --- ## ⚙️ Model — Why TF‑IDF + Linear SVC vs Transformers? | Choice | Reason | |--------|-------| | Transformers | Expensive & slow for nightly retraining | | Neural Networks | Need GPUs / infra | | Logistic Regression | High precision, low recall | | **Linear SVC + TF‑IDF** | Fast, scalable, interpretable 👈 Best for pipeline | The biggest improvement: - Logistic Regression predicted almost nothing → trying to be “safe” - Linear SVC learned boundary margins → better multi‑genre recall - Applying sigmoid + threshold → configurable precision/recall trade‑off --- ## 📊 Performance Metrics | Model | Precision_micro | Recall_micro | F1_macro | Result | |------|----------------|--------------|----------|--------| | Logistic Regression | 0.83 | 0.006 | ~0.03 | Almost no predictions | | **Linear SVC + threshold 0.25** | 0.16 | **0.99** | **0.27** | Usable predictions | Interpretation: - High recall = the model "understands" the genres - Threshold lets *different applications choose correctness level* If this was powering **recommendations**, threshold matters. --- ## 🧪 Testing This project includes: - Unit tests for vectorization & data transformation - Mocked API tests for dataset ingestion - End‑to‑end pipeline test verifying artifacts & metrics Tools used: - `pytest` - `monkeypatch` - `tmp_path` - GitHub CI This demonstrates **reliability in automation-focused ML environments**. --- ## 🖥 Demo & Integration | Component | Link | |----------|------| | 🔥 Live App (HF Space) | [link](https://arjun-varma-tmdb-genre-demo.hf.space/?__theme=system&deep_link=uuDed8RzLJI) | | 📁 Github repo | [link](https://github.com/ArjunXvarma/Serverless-ML-pipeline.git)| The model provides: - ⭐ Ranked genre probabilities - ⭐ Adjustable confidence threshold - ⭐ Real‑time inference --- ## 🚀 Future Enhancements | Idea | Value | |-----|------| | Compare vs MiniLM Transformer | Benchmark credibility | | Add FastAPI inference service | Deployable microservice | | Visualize confidence & confusion | Explainable AI | --- ## ✍ Author **Arjun Varma** Machine Learning Engineer & Systems Developer Designed for real-world ML infrastructure readiness. ---