license: mit
pipeline_tag: robotics
SFHand β Official Checkpoint
This repository provides the official pretrained checkpoint for SFHand, a streaming framework for language-guided 3D hand forecasting and embodied manipulation, as introduced in the paper SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting.
π Project Links
- Paper: arXiv:2511.18127
- GitHub: ut-vision/SFHand
- Dataset: EgoHaFL
π Introduction
SFHand is the first streaming architecture for language-guided 3D hand forecasting. It autoregressively predicts future hand dynamics from continuous egocentric video and text instructions, outputting hand type, 2D bounding boxes, 3D poses, and 3D trajectories.
Key features include:
- Streaming Framework: Autoregressive multi-modal hand forecasting.
- ROI-Enhanced Memory: Captures temporal hand awareness while focusing on salient regions.
- Embodied Ready: Representations transfer effectively to downstream manipulation tasks.
π Evaluation and Visualization
To evaluate the model and generate visualizations using this checkpoint, you can run the following command from the official repository:
python main.py --config_file configs/config/clip_base_eval.yml --eval --vis
Output visualizations will be saved to the ./render_results/ directory.
π Citation
If you use this model or find SFHand helpful in your research, please cite:
@article{liu2025sfhand,
title={SFHand: A Streaming Framework for Language-guided 3D Hand Forecasting and Embodied Manipulation},
author={Liu, Ruicong and Huang, Yifei and Ouyang, Liangyang and Kang, Caixin and Sato, Yoichi},
journal={arXiv preprint arXiv:2511.18127},
year={2025}
}