HMotionGPT

Aligning Hand Motions and Natural Language for Activity Understanding with Smart Rings

Yang Gao, Dong She, Wolin Liang, Chiyue Wang, Yingjing Xiao, Xianrong Yao, Cong Liu, ZhiChao Huang, Zhanpeng Jin†

†Corresponding to: zjin@scut.edu.cn

Abstract

HMotionGPT is a multimodal framework that aligns smart-ring IMU signals with natural language for hand-centric activity understanding. The model projects motion representations into a language model so the system can support classification, captioning, and instruction-following activity analysis. This open-source release includes the core two-stage training pipeline, configurable language-model backbones, and minimal smoke-test assets for reproducible experimentation.

System Overview

HMotionGPT connects wearable smart rings, backend IMU processing, and language-guided activity understanding in one pipeline. The system is designed for hand-object interaction analysis, turning raw inertial streams into interpretable text outputs that can describe actions, classify activities, and support downstream interaction understanding tasks.

System overview of HMotionGPT

Model Architecture

The architecture first converts IMU sequences into motion representations, then aligns those representations with the language model through an IMU projector. This alignment enables the model to reason over hand motions with language-supervised objectives while preserving the temporal structure of wearable sensor signals.

Overall architecture of HMotionGPT

Two-Stage Training Pipeline

Stage 1 learns the IMU-to-language alignment by freezing the base language model and optimizing the projector. Stage 2 performs supervised fine-tuning with the aligned projector so the final model can follow task instructions and generate stronger task-specific outputs.

Two-stage training pipeline of HMotionGPT

Citation

Coming soon.