Multimodal Large Language Models (MLLMs) have demonstrated significant potential across various multimodal tasks, including retrieval, summarization, and reasoning. However, it remains a substantial challenge for MLLMs to understand and precisely retrieve specific moments from a video, which require fine-grained spatial and temporal understanding of a video. To overcome this, we propose the Caption Assisted MLLM from Coarse to finE (CALCE), a novel two-stage framework designed for enhanced moment retrieval. Our pipeline begins with a first stage where captions extracted from the audio are utilized to assist the MLLM to provide a robust foundation for precise moment retrieval. To efficiently manage memory consumption from this additional data, a clustering algorithm is applied to the sparsely sampled video frames, categorizing them into key frames and non-key frames. The second stage focuses on recalling missed moments and achieving more fine-grained moment boundaries by adopting a higher sampling rate. In this process, predictions from the first stage cast votes for their correlated densely sampled frames, thereby filtering out less relevant frames. By repeating the process of the first stage with these selected frames, CALCE progressively retrieves video moments from coarse to precise. Experiments on QVHighlights and Charades-STA demonstrate the effectiveness of CALCE, which outperforms existing state-of-the-art methods. The code is available at https://github.com/tjhd1475/CALCE.