The CLIP visual feature-based image captioning models have developed rapidly and achieved remarkable results. However, existing models still struggle to produce descriptive and discriminative captions because they insufficiently exploit fine-grained visual cues and fail to model complex vision-language alignment. To address these limitations, we propose a Ranking Diffusion Transformer (RDT), which integrates a Ranking Visual Encoder (RVE) and a Ranking Loss (RL) for fine-grained image captioning. The RVE introduces a novel ranking attention mechanism that effectively mines diverse and discriminative visual information from CLIP features. Meanwhile, the RL leverages the ranking of generated caption quality as a global semantic supervisory signal, thereby enhancing the diffusion process and strengthening vision-language semantic alignment. We show that by collaborating RVE and RL via the novel RDT-and by gradually adding and removing noise in the diffusion process-more discriminative visual features are learned and precisely aligned with the language features. Experimental results on popular benchmark datasets demonstrate that our proposed RDT surpasses existing state-of-the-art image captioning models in the literature. The code is publicly available at: https://github.com/junwan2014/RDT.