Metal gears are an essential component of various important mechanical parts, and their quality directly impacts the overall performance and longevity of the automated system. Potential problems can be identified and solved promptly by detecting defects on the gear end face, improving overall product quality. In metal gear end face defect detection, the inhomogeneity of the gear end face structure and the multi-scale, small-sized defects are common issues, resulting in current detection methods performing poorly in terms of accuracy. To address the issues mentioned above, we propose an SF-YOLO metal gear end face defect detection method based on evolutionary algorithm optimization to complete automatic detection of gear end face defects. First, we offer a visual saliency region-based image extraction method that eliminates the interference of invalid features in non-processing regions and reduces image complexity. Then, the neck network feature extraction pyramid is replaced by a weighted bidirectional feature pyramid network to enhance the model's multi-scale adaptability and improve its fusion speed and efficiency. Afterwards, the spatial and channel attention mechanism (CBAM) is combined with the C3 module to constitute the CBAM-C3 attention module and improve the model's attention to small sizes. Finally, an improved sparrow algorithm is proposed to optimize the hyperparameters of the model's neural network and avoid the inadmissible determinism of manual parameter tuning. The experiments showed that the SF-YOLO model achieved an average accuracy of 98.01% on a test set of metal gear end face defects, with an F1 value of 0.99 and an average detection computation time per image of 0.025 s. Compared to other deep learning models, the proposed SF-YOLO model improves the accuracy and efficiency of gear end face defect detection, and it can efficiently detect small-size and multi-scale metal gear end face defects to meet enterprises' real-time online inspection needs.