桥接(联网)
Boosting(机器学习)
计算机科学
动作识别
人工智能
计算机视觉
模式识别(心理学)
自然语言处理
计算机安全
班级(哲学)
作者
Zhaoqilin Yang,Gaoyun An,Zhenxing Zheng,Shan Cao,Qiuqi Ruan
标识
DOI:10.1109/tcsvt.2024.3390133
摘要
The Contrastive Language-Image Pre-training (CLIP) model achieves strong generalization by using a large number of text-image pairs for contrastive learning. However, when it is transferred to action recognition, the following two questions remain to be solved: 1) How to guide the model to focus more on human-body-related regions to better align actions and text, and 2) How to make the model strengthen itself in a targeted manner to deal with difficult-to-classify categories. To solve these problems, a Guided alignment and adaptive Boosting CLIP (GBC) is proposed, which employs visual prior knowledge and benefits from both feature and decision aggregation in a boosting manner. During early training, visual prior knowledge related to human body is adopted, which enables the model to better align human actions with category text to be robust to distribution shift. At the later stage of training, the CLIP encoder is frozen, and multiple downstream feature & decision aggregation modules are sequentially generated and trained. In such way, the model is able to boost the performance from different perspectives in the Boosting manner and at a linearly increasing cost. Moreover, a class-adaptive re-weighting strategy is proposed to make the model focus more on optimizing categories that are difficult to classify. The effectiveness of our model is validated on six action recognition datasets (Kinetics-600, Kinetics-400, Jester, HMDB-51, UCF-101, and Mini-Kinetics-200), including both fully supervised and zero-shot experiments. Our model achieves superior results compared to state-of-the-art methods on all datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI