Multi-modal hashing aims to succinctly encode heterogeneous modalities into binary hash codes, facilitating efficient multimedia retrieval characterized by low storage demands and high retrieval speed. Despite the commendable achievements of existing methods, they still face three crucial challenges: 1) Inadequate bridging of the heterogeneous modality gap through coarse, global feature-level alignment and fusion. 2) The erosion of bit independence and consequent limitations on the semantic representation capacity of hash codes during feature-level hash code learning. 3) The insufficiency of binary label-based pairwise semantic preservation strategies in capturing intricate fine-grained semantic correlations within multi-modal data. To address these challenges, this paper introduces the Dynamic Bit-wise Semantic Transformer Hashing (DBSTH) framework. Remarkably, it treats each hash bit as a unique semantic concept, facilitating concept-level alignment of heterogeneous modalities. This safeguards bit independence and augments representation capabilities. Specifically, we devise a dynamic unit fusion strategy for the adaptive combination of local multi-modal information units, facilitating the acquisition of bit-wise semantic concepts. Subsequently, we incorporate a transformer encoder to refine these concepts by uncovering latent correlations among distinct concepts. Finally, we perform the multi-modal alignment and fusion on the fine-grained concept-level, independently encoding each concept to its corresponding hash bit. To provide enhanced guidance for concept learning, a label prototype learning mechanism is introduced, which learns prototype embeddings for all categories through the consideration of co-occurrence priors. This mechanism effectively captures fine-grained explicit semantic correlations and generates supervising hash codes. Additionally, to improve the robustness of the hashing model in handling noisy multi-modal data, a masked concept learning strategy is introduced, facilitating the acquisition of resilient semantic concepts. Extensive experiments conducted on three widely tested multi-modal retrieval datasets demonstrate the superiority of our method in conventional, noisy, and open-set retrieval scenarios.