---
license: apache-2.0
language:
- en
library_name: pytorch
tags:
- action-recognition
- human-action-classification
- image-classification
- computer-vision
- pose-estimation
- mediapipe
- stanford40
- resnet
- mobilenet
datasets:
- stanford40
metrics:
- accuracy
- f1
- precision
- recall
pipeline_tag: image-classification
widget:
- src: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/person_cooking.jpg
  example_title: "Cooking"
- src: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/person_jumping.jpg
  example_title: "Jumping"
model-index:
- name: human-action-classification
  results:
  - task:
      type: image-classification
      name: Image Classification
    dataset:
      name: Stanford 40 Actions
      type: stanford40
    metrics:
    - type: accuracy
      value: 86.4
      name: Accuracy
      verified: false
    - type: f1
      value: 0.8618
      name: Macro F1-Score
      verified: false
---

# Human Action Classification v2.0

State-of-the-art human action recognition model trained on Stanford 40 Actions dataset. GitHub project link -> [human-action-classification](https://github.com/dronefreak/human-action-classification)

![Demo](looking_through_a_telescope.jpg)

## Model Description

This model performs real-time human action classification from images, recognizing 40 different human activities. It combines a ResNet34 backbone with optional MediaPipe pose estimation for enhanced accuracy.

- **Developed by:** Saumya Kumaar Saksena ([@dronefreak](https://github.com/dronefreak))
- **Model type:** Image Classification (Action Recognition)
- **Language(s):** English (action labels)
- **Finetuned from:** ImageNet pretrained ResNet34

## Key Features

- 🎯 **86% accuracy** on Stanford 40 Actions test set
- ⚡ **Real-time inference** (~25ms per image on GTX 1050 Ti)
- 🎨 **Pose-aware** optional MediaPipe integration
- 📦 **Easy to use** with simple Python API
- 🔧 **Production-ready** with comprehensive evaluation metrics

## Model Variants

All models trained on Stanford 40 Actions dataset:

| Model | Accuracy | Macro F1 | Parameters | Size | Inference Time* |
|-------|----------|----------|-----------|------|-----------------|
| **ResNet50** | **88.5%** | **0.8842** | 23.5M | 94MB | ~30ms |
| **ResNet34** (this model) | **86.4%** | **0.8618** | 21.3M | 85MB | ~25ms |
| ResNet18 | 82.3% | 0.8178 | 11.2M | 45MB | ~18ms |
| MobileNet V3 Large | 82.1% | 0.8169 | 5.4M | 20MB | ~15ms |
| ViT Base | 76.8% | 0.7650 | 86M | 330MB | ~45ms |
| MobileNet V3 Small | 74.35% | 0.7350 | 2.5M | 10MB | ~10ms |

*Single image on NVIDIA GTX 1050 Ti

### Detailed Performance Comparison

| Model | Accuracy (%) | Macro Precision | Macro Recall | Macro F1 | Weighted F1 |
|-------|--------------|-----------------|--------------|----------|-------------|
| ResNet50 | 88.5 | 0.8874 | 0.8850 | 0.8842 | 0.8842 |
| **ResNet34** | **86.4** | **0.8686** | **0.8640** | **0.8618** | **0.8618** |
| ResNet18 | 82.3 | 0.8211 | 0.8230 | 0.8178 | 0.8178 |
| MobileNet V3 Large | 82.1 | 0.8216 | 0.8210 | 0.8169 | 0.8169 |
| ViT Base Patch16 | 76.8 | 0.7774 | 0.7680 | 0.7650 | 0.7650 |
| MobileNet V3 Small | 74.35 | 0.7382 | 0.7435 | 0.7350 | 0.7350 |

**Trade-offs:**
- **ResNet50**: Best accuracy but slower and larger
- **ResNet34**: Optimal balance of accuracy and speed ⭐
- **MobileNet V3 Large**: Best mobile/edge deployment option
- **MobileNet V3 Small**: Fastest inference for resource-constrained devices

## Supported Actions (40 Classes)

<details>
<summary>Click to expand full list</summary>

- applauding
- blowing_bubbles
- brushing_teeth
- cleaning_the_floor
- climbing
- cooking
- cutting_trees
- cutting_vegetables
- drinking
- feeding_a_horse
- fishing
- fixing_a_bike
- fixing_a_car
- gardening
- holding_an_umbrella
- jumping
- looking_through_a_microscope
- looking_through_a_telescope
- playing_guitar
- playing_violin
- pouring_liquid
- pushing_a_cart
- reading
- phoning
- riding_a_bike
- riding_a_horse
- rowing_a_boat
- running
- shooting_an_arrow
- smoking
- taking_photos
- texting_message
- throwing_frisby
- using_a_computer
- walking_the_dog
- washing_dishes
- watching_TV
- waving_hands
- writing_on_a_board
- writing_on_a_book

</details>

## Quick Start

### Installation

```bash
pip install git+https://github.com/dronefreak/human-action-classification.git
```

### Basic Usage

```python
from hac import ActionPredictor

# Initialize predictor
predictor = ActionPredictor(
    model_path="hf://dronefreak/human-action-classification",
    device='cuda'
)

# Predict on image
result = predictor.predict_image('photo.jpg', top_k=3)

# Print results
print(f"Action: {result['action']['top_class']}")
print(f"Confidence: {result['action']['top_confidence']:.2%}")

# Top 3 predictions
for pred in result['action']['predictions']:
    print(f"  {pred['class']}: {pred['confidence']:.2%}")
```

### With Pose Estimation

```python
predictor = ActionPredictor(
    model_path="hf://dronefreak/human-action-classification",
    use_pose_estimation=True,  # Enable MediaPipe
    device='cuda'
)

result = predictor.predict_image('photo.jpg', return_pose=True)

print(f"Detected pose: {result['pose']['class']}")
print(f"Action: {result['action']['top_class']}")
```

### Batch Prediction

```python
from pathlib import Path

image_paths = list(Path('images/').glob('*.jpg'))
results = predictor.predict_batch(image_paths, batch_size=32)

for img_path, result in zip(image_paths, results):
    print(f"{img_path.name}: {result['action']['top_class']}")
```

## Performance Metrics

Evaluated on Stanford 40 Actions test set (5,532 images):

| Metric | Score |
|--------|-------|
| **Accuracy** | **86.4%** |
| Macro F1-Score | 0.8618 |
| Weighted F1-Score | 0.8618 |
| Macro Precision | 0.8686 |
| Macro Recall | 0.8640 |

### Top Performing Classes

| Class | F1-Score |
|-------|----------|
| Applauding | 0.935 |
| Jumping | 0.925 |
| Running | 0.918 |
| Waving Hands | 0.912 |
| Drinking | 0.905 |

### Confusion Analysis

Most commonly confused actions:
- Cooking ↔ Washing Dishes (similar kitchen setting)
- Reading ↔ Using Computer (similar seated poses)
- Fixing Bike ↔ Fixing Car (similar repair actions)

Full metrics available in [metrics.json](metrics.json)

## Training Details

### Training Data

- **Dataset:** Stanford 40 Actions
- **Training split:** ~4,000 images
- **Test split:** ~5,532 images
- **Classes:** 40 human action categories
- **Image resolution:** 224×224 (resized)

Please note that the proposed train-test split is a bit unconventional, which is why I had to create a custom train-test split of 80-20, which is a standard in machine learning practises.

### Training Procedure

#### Preprocessing

```python
# Training augmentation
transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])
```

#### Training Hyperparameters

- **Backbone:** ResNet34 (ImageNet pretrained)
- **Optimizer:** AdamW
- **Learning rate:** 1e-3 → 1e-5 (cosine decay)
- **Weight decay:** 1e-3
- **Batch size:** 32
- **Epochs:** 200
- **Augmentation:** Mixup (α=0.4)
- **Scheduler:** CosineAnnealingLR

#### Training Hardware

- **GPU:** NVIDIA RTX 4070 Super (12GB)
- **Training time:** ~0.5 hours
- **Framework:** PyTorch 2.0+


This approach reduced overfitting from 99% train / 62% test → 82% train / 86% test.

## Evaluation

```python
from hac.evaluation import evaluate_model

# Evaluate on test set
metrics = evaluate_model(
    checkpoint='resnet34_best.pth',
    data_dir='stanford40/',
    split='test'
)

print(f"Accuracy: {metrics['accuracy']:.2%}")
print(f"F1-Score: {metrics['f1_macro']:.4f}")
```


## Limitations

- Trained on Stanford 40 which has limited diversity
- Best performance on indoor/outdoor daily activities
- May struggle with unusual camera angles or occlusions
- Requires clear view of person performing action
- Not suitable for fine-grained action recognition (e.g., different sports moves)

## Bias and Fairness

The model inherits biases from the Stanford 40 dataset:
- Limited demographic diversity
- Western-centric activities
- Imbalanced class distribution

Users should evaluate performance on their specific use case.

## Citation

```bibtex
@software{saksena2025hac,
  author = {Saksena, Saumya Kumaar},
  title = {Human Action Classification v2.0},
  year = {2025},
  url = {https://github.com/dronefreak/human-action-classification},
  version = {2.0}
}
```

## Model Card Authors

Saumya Kumaar Saksena

## Model Card Contact

- GitHub: [@dronefreak](https://github.com/dronefreak)
- Repository: [human-action-classification](https://github.com/dronefreak/human-action-classification)

## Additional Resources

- [GitHub Repository](https://github.com/dronefreak/human-action-classification)
- [Demo Notebook](https://github.com/dronefreak/human-action-classification/blob/main/notebooks/demo.ipynb)
- [Training Code](https://github.com/dronefreak/human-action-classification/blob/main/src/hac/training/train.py)
- [Evaluation Metrics](metrics.json)

## License

Apache License 2.0 - Free for research and commercial use.

See [LICENSE](LICENSE) for full details.