Abstract Automating the harvesting of strawberries poses significant challenges due to the fruit’s small size, complex growing environments, and frequent occlusion by leaves and other objects. Existing vision systems for agricultural robots often struggle to accurately detect strawberry positions and key picking points under these conditions, limiting their effectiveness in real-world applications. To address these issues, this study proposes an improved vision model, YOLOv11-SKP, tailored for precise strawberry localization and key point detection in greenhouse environments. The model integrates a bidirectional feature pyramid (BiFPN) for robust multi-scale feature fusion, an SPPF-LSKA attention module to enhance the perception of fine details and contextual information, and a novel LADH_pose prediction head that boosts key point detection accuracy. Extensive experiments on field-collected datasets show that YOLOv11-SKP outperforms the original YOLOv11s-Pose, achieving a 3.6% increase in precision and a 3.2% gain in recall for key point detection, while maintaining high-speed inference at 166 FPS. These advances make the model well-suited for deployment in real-time strawberry picking robots, with the potential to enhance harvesting efficiency, reduce labor costs, and accelerate the adoption of intelligent agricultural systems.