--- license: apache-2.0 language: - en library_name: pytorch tags: - action-recognition - human-action-classification - image-classification - computer-vision - pose-estimation - mediapipe - stanford40 - resnet - mobilenet datasets: - stanford40 metrics: - accuracy - f1 - precision - recall pipeline_tag: image-classification widget: - src: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/person_cooking.jpg example_title: "Cooking" - src: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/person_jumping.jpg example_title: "Jumping" model-index: - name: human-action-classification results: - task: type: image-classification name: Image Classification dataset: name: Stanford 40 Actions type: stanford40 metrics: - type: accuracy value: 86.4 name: Accuracy verified: false - type: f1 value: 0.8618 name: Macro F1-Score verified: false --- # Human Action Classification v2.0 State-of-the-art human action recognition model trained on Stanford 40 Actions dataset. GitHub project link -> [human-action-classification](https://github.com/dronefreak/human-action-classification) ![Demo](looking_through_a_telescope.jpg) ## Model Description This model performs real-time human action classification from images, recognizing 40 different human activities. It combines a ResNet34 backbone with optional MediaPipe pose estimation for enhanced accuracy. - **Developed by:** Saumya Kumaar Saksena ([@dronefreak](https://github.com/dronefreak)) - **Model type:** Image Classification (Action Recognition) - **Language(s):** English (action labels) - **Finetuned from:** ImageNet pretrained ResNet34 ## Key Features - 🎯 **86% accuracy** on Stanford 40 Actions test set - ⚡ **Real-time inference** (~25ms per image on GTX 1050 Ti) - 🎨 **Pose-aware** optional MediaPipe integration - 📦 **Easy to use** with simple Python API - 🔧 **Production-ready** with comprehensive evaluation metrics ## Model Variants All models trained on Stanford 40 Actions dataset: | Model | Accuracy | Macro F1 | Parameters | Size | Inference Time* | |-------|----------|----------|-----------|------|-----------------| | **ResNet50** | **88.5%** | **0.8842** | 23.5M | 94MB | ~30ms | | **ResNet34** (this model) | **86.4%** | **0.8618** | 21.3M | 85MB | ~25ms | | ResNet18 | 82.3% | 0.8178 | 11.2M | 45MB | ~18ms | | MobileNet V3 Large | 82.1% | 0.8169 | 5.4M | 20MB | ~15ms | | ViT Base | 76.8% | 0.7650 | 86M | 330MB | ~45ms | | MobileNet V3 Small | 74.35% | 0.7350 | 2.5M | 10MB | ~10ms | *Single image on NVIDIA GTX 1050 Ti ### Detailed Performance Comparison | Model | Accuracy (%) | Macro Precision | Macro Recall | Macro F1 | Weighted F1 | |-------|--------------|-----------------|--------------|----------|-------------| | ResNet50 | 88.5 | 0.8874 | 0.8850 | 0.8842 | 0.8842 | | **ResNet34** | **86.4** | **0.8686** | **0.8640** | **0.8618** | **0.8618** | | ResNet18 | 82.3 | 0.8211 | 0.8230 | 0.8178 | 0.8178 | | MobileNet V3 Large | 82.1 | 0.8216 | 0.8210 | 0.8169 | 0.8169 | | ViT Base Patch16 | 76.8 | 0.7774 | 0.7680 | 0.7650 | 0.7650 | | MobileNet V3 Small | 74.35 | 0.7382 | 0.7435 | 0.7350 | 0.7350 | **Trade-offs:** - **ResNet50**: Best accuracy but slower and larger - **ResNet34**: Optimal balance of accuracy and speed ⭐ - **MobileNet V3 Large**: Best mobile/edge deployment option - **MobileNet V3 Small**: Fastest inference for resource-constrained devices ## Supported Actions (40 Classes)
Click to expand full list - applauding - blowing_bubbles - brushing_teeth - cleaning_the_floor - climbing - cooking - cutting_trees - cutting_vegetables - drinking - feeding_a_horse - fishing - fixing_a_bike - fixing_a_car - gardening - holding_an_umbrella - jumping - looking_through_a_microscope - looking_through_a_telescope - playing_guitar - playing_violin - pouring_liquid - pushing_a_cart - reading - phoning - riding_a_bike - riding_a_horse - rowing_a_boat - running - shooting_an_arrow - smoking - taking_photos - texting_message - throwing_frisby - using_a_computer - walking_the_dog - washing_dishes - watching_TV - waving_hands - writing_on_a_board - writing_on_a_book
## Quick Start ### Installation ```bash pip install git+https://github.com/dronefreak/human-action-classification.git ``` ### Basic Usage ```python from hac import ActionPredictor # Initialize predictor predictor = ActionPredictor( model_path="hf://dronefreak/human-action-classification", device='cuda' ) # Predict on image result = predictor.predict_image('photo.jpg', top_k=3) # Print results print(f"Action: {result['action']['top_class']}") print(f"Confidence: {result['action']['top_confidence']:.2%}") # Top 3 predictions for pred in result['action']['predictions']: print(f" {pred['class']}: {pred['confidence']:.2%}") ``` ### With Pose Estimation ```python predictor = ActionPredictor( model_path="hf://dronefreak/human-action-classification", use_pose_estimation=True, # Enable MediaPipe device='cuda' ) result = predictor.predict_image('photo.jpg', return_pose=True) print(f"Detected pose: {result['pose']['class']}") print(f"Action: {result['action']['top_class']}") ``` ### Batch Prediction ```python from pathlib import Path image_paths = list(Path('images/').glob('*.jpg')) results = predictor.predict_batch(image_paths, batch_size=32) for img_path, result in zip(image_paths, results): print(f"{img_path.name}: {result['action']['top_class']}") ``` ## Performance Metrics Evaluated on Stanford 40 Actions test set (5,532 images): | Metric | Score | |--------|-------| | **Accuracy** | **86.4%** | | Macro F1-Score | 0.8618 | | Weighted F1-Score | 0.8618 | | Macro Precision | 0.8686 | | Macro Recall | 0.8640 | ### Top Performing Classes | Class | F1-Score | |-------|----------| | Applauding | 0.935 | | Jumping | 0.925 | | Running | 0.918 | | Waving Hands | 0.912 | | Drinking | 0.905 | ### Confusion Analysis Most commonly confused actions: - Cooking ↔ Washing Dishes (similar kitchen setting) - Reading ↔ Using Computer (similar seated poses) - Fixing Bike ↔ Fixing Car (similar repair actions) Full metrics available in [metrics.json](metrics.json) ## Training Details ### Training Data - **Dataset:** Stanford 40 Actions - **Training split:** ~4,000 images - **Test split:** ~5,532 images - **Classes:** 40 human action categories - **Image resolution:** 224×224 (resized) Please note that the proposed train-test split is a bit unconventional, which is why I had to create a custom train-test split of 80-20, which is a standard in machine learning practises. ### Training Procedure #### Preprocessing ```python # Training augmentation transforms.Compose([ transforms.RandomResizedCrop(224), transforms.RandomHorizontalFlip(), transforms.ColorJitter(brightness=0.2, contrast=0.2), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) ``` #### Training Hyperparameters - **Backbone:** ResNet34 (ImageNet pretrained) - **Optimizer:** AdamW - **Learning rate:** 1e-3 → 1e-5 (cosine decay) - **Weight decay:** 1e-3 - **Batch size:** 32 - **Epochs:** 200 - **Augmentation:** Mixup (α=0.4) - **Scheduler:** CosineAnnealingLR #### Training Hardware - **GPU:** NVIDIA RTX 4070 Super (12GB) - **Training time:** ~0.5 hours - **Framework:** PyTorch 2.0+ This approach reduced overfitting from 99% train / 62% test → 82% train / 86% test. ## Evaluation ```python from hac.evaluation import evaluate_model # Evaluate on test set metrics = evaluate_model( checkpoint='resnet34_best.pth', data_dir='stanford40/', split='test' ) print(f"Accuracy: {metrics['accuracy']:.2%}") print(f"F1-Score: {metrics['f1_macro']:.4f}") ``` ## Limitations - Trained on Stanford 40 which has limited diversity - Best performance on indoor/outdoor daily activities - May struggle with unusual camera angles or occlusions - Requires clear view of person performing action - Not suitable for fine-grained action recognition (e.g., different sports moves) ## Bias and Fairness The model inherits biases from the Stanford 40 dataset: - Limited demographic diversity - Western-centric activities - Imbalanced class distribution Users should evaluate performance on their specific use case. ## Citation ```bibtex @software{saksena2025hac, author = {Saksena, Saumya Kumaar}, title = {Human Action Classification v2.0}, year = {2025}, url = {https://github.com/dronefreak/human-action-classification}, version = {2.0} } ``` ## Model Card Authors Saumya Kumaar Saksena ## Model Card Contact - GitHub: [@dronefreak](https://github.com/dronefreak) - Repository: [human-action-classification](https://github.com/dronefreak/human-action-classification) ## Additional Resources - [GitHub Repository](https://github.com/dronefreak/human-action-classification) - [Demo Notebook](https://github.com/dronefreak/human-action-classification/blob/main/notebooks/demo.ipynb) - [Training Code](https://github.com/dronefreak/human-action-classification/blob/main/src/hac/training/train.py) - [Evaluation Metrics](metrics.json) ## License Apache License 2.0 - Free for research and commercial use. See [LICENSE](LICENSE) for full details.