计算机科学
人工智能
自然语言处理
集合(抽象数据类型)
模式
过程(计算)
机器学习
语义学(计算机科学)
模态(人机交互)
人机交互
社会科学
社会学
程序设计语言
操作系统
作者
Siying Wu,Xueyang Fu,Feng Wu,Zheng-Jun Zha
标识
DOI:10.1109/tmm.2024.3358112
摘要
Vision-and-Language Navigation (VLN) requires that an agent can comprehensively understand the given instructions and the immediate visual information obtained from the environment, so as to make correct actions to achieve the navigation goal. Therefore, semantic alignment across modalities is crucial for the agent understanding its own state during the navigation process. However, the potential of semantic alignment has not been systematically explored in current studies, which limits the further improvement of navigation performance. To address this issue, we propose a new Latent Semantic Alignment Learning method to develop the semantically aligned relationships contained in the environment. Specifically, we introduce three novel pre-training tasks: Trajectory-conditioned Masked Fragment Modeling, Action Prediction of Masked Observation, and Hierarchical Triple Contrastive Learning. The first two tasks are used to reason about cross-modal dependencies, while the third one is able to learn semantically consistent representations across modalities. In this way, the Latent Semantic Alignment Learning method establishes a consistent perception of the environment and makes the agent's actions easier to explain. Experiments on common benchmarks verify the effectiveness of our proposed methods. For example, we improve the Success Rate by 1.6% on the R2R validation unseen set and 4.3% on the R4R validation unseen set over the baseline model.
科研通智能强力驱动
Strongly Powered by AbleSci AI