定位
计算机科学
粒度
情态动词
图形
利用
特征(语言学)
等级制度
自然语言处理
人工智能
情报检索
模式识别(心理学)
机器学习
理论计算机科学
操作系统
哲学
经济
语言学
计算机安全
化学
高分子化学
市场经济
标识
DOI:10.1109/jiot.2024.3390943
摘要
In recent years, scene text spotting approaches have evolved into a multi-modal-based framework. Although previous studies have highlighted the crucial importance of the intrinsic synergy between visual and linguistic features, recent advances in multi-modal-based methods typically adopt an implicit fusion strategy with single granularity features, which cannot fully exploit the prior contextual relationships embedded in visual and semantic information. We argue that directly integrating visual and semantic features is sub-optimal because the multi-granularity structure of scene text images is quite different from that of natural images. To address this, we introduce a novel model called the Multi-Granularity Visual Semantic Interactive Fusion Network (MGN-Net), which comprises a Visual Semantic Multi-Granularity Feature Extraction Network (VSMN) and a Multi-Granularity Graph Fusion Learning Network (MGFN). The VSMN adaptively extracts multi-granularity visual and semantic features from the text image, thereby enriching the textual contextual relations. In the MGFN, a cross-modal and cross-hierarchy graph is constructed to align features from different modalities for deep intra-and inter-fusion. This approach also alleviates the inflexibility of the sequential structure when dealing with images of irregularly curved objects. Furthermore, the cross-hierarchy semantic features are designed to facilitate the training of MGN-Net. Experimental results demonstrate that our model significantly outperforms previous state-of-the-art models. The code will be released in MGN-Net.
科研通智能强力驱动
Strongly Powered by AbleSci AI