计算机科学
动作识别
解析
边距(机器学习)
人工智能
机器学习
图形
领域知识
卷积神经网络
知识图
领域(数学分析)
语言模型
模式识别(心理学)
自然语言处理
理论计算机科学
数学分析
班级(哲学)
数学
标识
DOI:10.1109/icassp48485.2024.10445852
摘要
Transferring visual language models (VLMs) from the image domain to the video domain has recently yielded great success on human action recognition tasks. However, standard recognition paradigms overlook fine-grained action parsing knowledge that could enhance the recognition accuracy. In this paper, we propose a novel method that leverages both coarse-grained and fine-grained knowledge to recognize human actions in videos. Our method consists of a video-language graph convolutional network that integrates and fuses multi-modal knowledge in a progressive manner. We evaluate our method on the Kinetics-TPS, a large-scale action parsing dataset, and demonstrate that it outperforms the state-of-the-art methods by a significant margin. Moreover, our method achieves better results with less training data and competitive computational cost than the existing methods, showing the effectiveness and efficiency of using fine-grained knowledge for human video action recognition.
科研通智能强力驱动
Strongly Powered by AbleSci AI