Although text-guided infrared-visible image fusion helps improve content understanding under extreme illumination, existing methods usually ignore semantic differences between textual and visual features, resulting in limited improvement. To address this challenge, we propose a Text-Guided Semantic Alignment Network, termed TSANet, for extreme-illumination infrared-visible image fusion. The network follows an encoder-decoder structure, with two image encoders, two text encoders, and one decoder. It uses a Semantic Alignment and Fusion (SAF) block to bridge the two image encoders in each layer. Specifically, the SAF block consists of two parallel Semantic Alignment (SA) modules, corresponding to the infrared and visible modalities, respectively, and a Spatial-Frequency Interaction (SFI) module. The SA module aligns the visual feature from the image encoder with its corresponding textual feature from the text encoder, to guide the network focus on key semantic regions of infrared and visible images. The SFI module aggregates the spatial and frequency information extracted from the modality-aligned features of two SA modules for complementary representation learning. The network progressively complements two image modalities by connecting the SAF blocks from top to down, and finally provides a visually pleasing fusion effect by feeding the output of the last block into the decoder. Recognizing that existing datasets lack illumination diversity, we contribute a new dataset specifically designed for extreme-illumination image fusion. Extensive experiments show the effectiveness and superiority of TSANet over seven state-of-the-art methods. The source code and dataset are available at https://github.com/WentaoLi-CV/TSANet.