Accurate automatic surgical instrument segmentation plays a crucial role in robot-assisted surgery, but analyzing surgical videos remains challenging due to factors such as rapid instrument movements, high inter-category similarity, and frequent object occlusions. Current surgical instrument segmentation models struggle to capture both inter-frame variations and intra-frame details in complex surgical scenarios. The Segment Anything Model (SAM) has shown significant potential in various segmentation tasks. However, it has not fully addressed the unique challenges posed by surgical videos. To tackle these issues, we propose a Temporal and Distance-Guided SAM model (TD-SAM) for accurate surgical instrument segmentation. Specifically, we introduce a dynamic cross-frame attention module that effectively captures temporal information across frames, allowing the model to track the dynamic changes of surgical instruments and their environment, thus improving segmentation accuracy. In addition, we present a distance-guided instance refinement module, which enhances the model's ability to distinguish between similar categories, mitigating the class ambiguity caused by inter-category similarity. Extensive experiments on the EndoVis18 and EndoVis17 datasets show that the proposed TD-SAM model outperforms existing models, achieving state-of-the-art performance without using any prompts.