Learning Structured Representation of Human Motion for Understanding, Prediction, and Synthesis

Learning structured representations of human motion is pivotal for advancing the fields of video understanding, prediction, and synthesis. By capturing intrinsic spatiotemporal patterns, structured representations enable machines to interpret complex human behaviors effectively, anticipate future actions accurately, and generate realistic motion sequences. These representations integrate hierarchical and relational dynamics, facilitating more robust generalization across diverse activities and contexts. Consequently, structured motion modeling not only enhances human-computer interaction but also holds promise for applications such as robotics, virtual reality, and assistive technologies.

We introduce H-MoRe [1], an innovative pipeline designed to learn precise, human-centered motion representations. Our approach dynamically retains essential human motion features while filtering out background noise. H-MoRe employs a self-supervised learning paradigm directly from real-world scenarios, incorporating both human pose and body shape information. This method provides a detailed understanding of human motion, making it highly adaptable to various action-based applications, including action recognition, gait recognition, and human motion generation.



    Publications
    1. Zhanbo Huang, Xiaoming Liu, Yu Kong. Learning Human-centric Motion Representation for Action Analysis. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025 (Highlight). Project Webpage
    2. Junwen Chen, Gaurav Mittal, Ye Yu, Yu Kong, and Mei Chen. GateHUB: Gated History Unit with Background Suppression for Online Action Detection. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19893-19902, 2022.
    3. Yu Kong, and Yun Fu. Human Action Recognition and Prediction: A Survey. International Journal of Computer Vision (IJCV), volume 130, issue 5, pp 1366–1401, 2022
    4. Yu Kong, Zhiqiang Tao, Yun Fu. Adversarial Action Prediction Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 42(3):539-553, 2020
    5. Junwen Chen, Wentao Bao, and Yu Kong. Group Activity Prediction with Sequential Relational Anticipation Model. European Conference on Computer Vision (ECCV), pp. 581–597, 2020.
    6. Yu Kong, Shangqian Gao, Bin Sun, and Yun Fu. Action Prediction from Videos via Memorizing Hard-to-Predict Samples. AAAI Conference on Artificial Intelligence (AAAI), pp. 7000-7007, 2018.
    7. Yu Kong, Zhiqiang Tao, Yun Fu. Deep Sequential Context Networks for Action Prediction. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1473-1481, 2017.



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Ego-centric Video Understanding
  • Industrial Embodied Question-Answering
  • Visual Understanding in the Open World
  • Show Me:Generating Instructional Videos with Diffusion Models