计算机科学
变压器
编码器
预处理器
人工智能
语音识别
计算机工程
模式识别(心理学)
电压
工程类
操作系统
电气工程
作者
Hui Zhang,Guiyang Luo,Jian Kang,Shan Huang,Xiao Wang,Fei‐Yue Wang
标识
DOI:10.1109/tnnls.2023.3239696
摘要
Recent years have witnessed the growing popularity of connectionist temporal classification (CTC) and attention mechanism in scene text recognition (STR). CTC-based methods consume less time with few computational burdens, while they are not as effective as attention-based methods. To retain computational efficiency and effectiveness, we propose the global-local attention-augmented light Transformer (GLaLT), which adopts a Transformer-based encoder-decoder structure to orchestrate CTC and attention mechanism. The encoder integrates the self-attention module with the convolution module to augment the attention, where the self-attention module pays more attention to capturing long-term global dependencies and the convolution module focuses on local context modeling. The decoder consists of two parallel modules: one is the Transformer-decoder-based attention module and the other is the CTC module. The first one is removed in the testing phase and can guide the second one to extract robust features in the training phase. Extensive experiments on standard benchmarks demonstrate that GLaLT achieves state-of-the-art performance for both regular and irregular STR. In terms of tradeoffs, the proposed GLaLT is at or near the frontiers for maximizing speed, accuracy, and computational efficiency at the same time.
科研通智能强力驱动
Strongly Powered by AbleSci AI