---
license: mit
pipeline_tag: robotics
---

# **SFHand – Official Checkpoint**

This repository provides the **official pretrained checkpoint** for **SFHand**, a streaming framework for **language-guided 3D hand forecasting and embodied manipulation**, as introduced in the paper [SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting](https://huggingface.co/papers/2511.18127).

---

## 🔗 Project Links

- **Paper:** [arXiv:2511.18127](https://arxiv.org/abs/2511.18127)
- **GitHub:** [ut-vision/SFHand](https://github.com/ut-vision/SFHand)
- **Dataset:** [EgoHaFL](https://huggingface.co/datasets/ut-vision/EgoHaFL)

---

## 📝 Introduction

SFHand is the first streaming architecture for language-guided 3D hand forecasting. It autoregressively predicts future hand dynamics from continuous egocentric video and text instructions, outputting hand type, 2D bounding boxes, 3D poses, and 3D trajectories. 

Key features include:
- **Streaming Framework:** Autoregressive multi-modal hand forecasting.
- **ROI-Enhanced Memory:** Captures temporal hand awareness while focusing on salient regions.
- **Embodied Ready:** Representations transfer effectively to downstream manipulation tasks.

---

## 🚀 Evaluation and Visualization

To evaluate the model and generate visualizations using this checkpoint, you can run the following command from the [official repository](https://github.com/ut-vision/SFHand):

```bash
python main.py --config_file configs/config/clip_base_eval.yml --eval --vis
```

Output visualizations will be saved to the `./render_results/` directory.

---

## 📚 Citation

If you use this model or find SFHand helpful in your research, please cite:

```bibtex
@article{liu2025sfhand,
  title={SFHand: A Streaming Framework for Language-guided 3D Hand Forecasting and Embodied Manipulation},
  author={Liu, Ruicong and Huang, Yifei and Ouyang, Liangyang and Kang, Caixin and Sato, Yoichi},
  journal={arXiv preprint arXiv:2511.18127},
  year={2025}
}
```