The ascent of the low-altitude economy underscores the critical need for autonomous perception in Unmanned Aerial Vehicles (UAVs), particularly within complex environments such as urban ports. However, existing object detection models often perform poorly when dealing with land–sea mixed scenes, extreme scale variations, and dense object distributions from a UAV’s aerial perspective. To address this challenge, we propose AUP-DETR, a novel end-to-end object detection framework for UAVs. This framework, built upon an efficient DETR architecture, features the innovative Fusion with Streamlined Hybrid Core (Fusion-SHC) module. This module effectively fuses low-level spatial details with high-level semantics to strengthen the representation of small aerial objects. Additionally, a Synergistic Spatial Context Fusion (SSCF) module adaptively integrates multi-scale features to generate rich and unified representations for the detection head. Moreover, the proposed Spatial Agent Transformer (SAT) efficiently models global context and long-range dependencies to distinguish heterogeneous objects in complex scenes. To advance related research, we have constructed the Urban Coastal Aerial Detection (UCA-Det) dataset, which is specifically designed for urban port environments. Extensive experiments on our UCA-Det dataset show that AUP-DETR outperforms the YOLO series and other advanced DETR-based models. Our model achieves an mAP50 of 69.68%, representing a 4.41% improvement over the baseline. Furthermore, experiments on the public VisDrone dataset validate its excellent generalization capability and efficiency. This research delivers a robust solution and establishes a new dataset for precise UAV perception in low-altitude economy scenarios.