计算机科学
粒度
判决
人工智能
自然语言处理
语义鸿沟
桥接(联网)
生成语法
词(群论)
生成对抗网络
图像(数学)
情报检索
图像检索
哲学
操作系统
语言学
计算机网络
作者
Dehu Jin,Qi Yu,Yu Lan,Meng Qi
标识
DOI:10.1016/j.knosys.2024.111795
摘要
Text-to-image generation is a challenging task that aims to generate visually realistic images semantically consistent for a given text. Existing methods mainly exploit the global semantic information of a single sentence while ignoring fine-grained semantic information such as aspects and words, which are critical factors in bridging the semantic gap in text-to-image generation. We propose a Multi-granularity Text (Sentence-level, Aspect-level, and Word-level) Fusion Generative Adversarial Network (SAW-GAN), which comprehensively represents textual information from multiple granularities. To effectively fuse multi-granularity information, we design a Double-granularity-text Fusion Module (DFM) fusing sentence and aspect information through parallel affine transformation and a Triple granularity-text Fusion Module (TFM) fusing sentence, aspect and word information by designing a novel Coordinate Attention Module (CAM), which can precisely locate the visual areas associated with each aspect and word. Furthermore, we use CLIP (Contrastive Language-Image Pre-training) to provide visual information to bridge the semantic gap and improve the model's generalization ability. Our results show significant performance improvements over state-of-the-art methods using Conditional Generation Adversarial Network (CGAN) on CUB (FID from 13.91 to 10.45) and COCO (FID from 14.60 to 11.17) datasets with photorealistic images of richer details and text-image consistency.
科研通智能强力驱动
Strongly Powered by AbleSci AI