隐藏字幕
计算机科学
扩散
图像(数学)
人工智能
计算机视觉
自然语言处理
热力学
物理
作者
Bing Liu,Wenjie Yang,Mingming Liu,Hao Liu,Yong Zhou,Peng Liu
摘要
Current diffusion model-based image captioning methods generally focus on generating descriptions in a non-autoregressive manner. Nevertheless, it is not trivial to employ such generative models to control the generation of discrete words while pursuing the balance between diversity and accuracy. Inspired by the success of continuous diffusions in image captioning, we introduce the Part-of-Speech (POS) information and classifier-free guidance into the diffusion model, and propose a novel controllable image captioning model, namely POS-Conditional Diffusion Networks (POSCD-Net), which consists of a Diffusion-based POS Generator (DPG) and a Diffusion-based Caption Generator (DCG). The DPG is built to produce diverse syntactic structures for each input image. The diverse POS sequences are further regarded as the control signals of the DCG, which produces the output sentences in a conditional diffusion process. In the DCG, a syntactic control module (SCM) is designed to strengthen the alignment progressively between words and the corresponding POS tags in a cascaded manner. Furthermore, to improve the controllability of POSCD-Net, the classifier-free guidance with learnable parameters is exploited to jointly optimize both the DPG and DCG in a non-autoregressive manner. Extensive experiments on the MSCOCO dataset demonstrate that our proposed method outperforms the state-of-the-art non-autoregressive counterparts and achieves promising performance compared with the autoregressive models.
科研通智能强力驱动
Strongly Powered by AbleSci AI