Automated sleep staging is crucial for assessing sleep quality and diagnosing sleep-related diseases. Single-channel EEG has attracted significant attention due to its portability and accessibility. Most existing automated sleep staging methods often emphasize temporal information and neglect spectral information, the relationship between sleep stage contextual features, and transition rules between sleep stages. To overcome these obstacles, this paper proposes an attention-based two stage temporalspectral fusion model (BiTS-SleepNet). The BiTS-SleepNet stage 1 network consists of a dual-stream temporal-spectral feature extractor branch and a temporal-spectral feature fusion module based on the cross-attention mechanism. These blocks are designed to autonomously extract and integrate the temporal and spectral features of EEG signals, leveraging temporal-spectral fusion information to discriminate between different sleep stages. The BiTS-SleepNet stage 2 network includes a feature context learning module (FCLM) based on Bi-GRU and a transition rules learning module (TRLM) based on the Conditional Random Field (CRF). The FCLM optimizes preliminary sleep stage results from the stage 1 network by learning dependencies between features of multiple adjacent stages. The TRLM additionally employs transition rules to optimize overall outcomes. We evaluated the BiTS-SleepNet on three public datasets: Sleep-EDF-20, SleepEDF-78, and SHHS, achieving accuracies of 88.50%, 85.09%, and 87.01%, respectively. The experimental results demonstrate that BiTS-SleepNet achieves competitive performance in comparison to recently published methods. This highlights its promise for practical applications.