计算机科学
编码器
期限(时间)
编码
人工智能
动作(物理)
语义学(计算机科学)
模式识别(心理学)
自然语言处理
量子力学
物理
生物化学
化学
基因
操作系统
程序设计语言
作者
Jiaming Zhou,Kun-Yu Lin,Yu-Kun Qiu,Wei‐Shi Zheng
标识
DOI:10.1109/tmm.2023.3302471
摘要
The long-term action in untrimmed video generally contains multiple sub-actions, among which various semantic patterns exist ( e.g. , the co-occurrence or sequentiality between sub-actions). These semantic patterns are temporally coarse, and correlated with multiple local contexts which encode the local temporal evolution of visual elements ( e.g. , hands, objects) in videos. The local contexts and semantic patterns form the inherent fine-to-coarse temporal structure of long-term actions, which is neglected by existing works. Accordingly, in this work we propose TwinFormer, which exploits a novel fine-to-coarse temporal modeling manner to uncover the temporal structure of long-term actions. The proposed TwinFormer consists of a pair of twin encoders with the same structural design, namely Localcontext Encoder and Semantic-pattern Encoder, and a Temporalbridged Attention to bridge the two twin encoders. The Localcontext Encoder aims to model the local contexts in the longterm action. And the Temporal-bridged Attention is designed to correlate the local contexts with semantic patterns. Furthermore, the Semantic-pattern Encoder reveals the temporal evolution of semantic patterns. Experimental results on three benchmarks demonstrate the effectiveness of the proposed model.
科研通智能强力驱动
Strongly Powered by AbleSci AI