Passive acoustic monitoring (PAM) is emerging as a valuable tool for assessing fish populations in natural habitats. This study compares two deep learning–based frameworks: (1) a multi-label segmentation-based classification system (SegClas) combining convolutional neural networks and long short term memory networks and, (2) an object detection approach (ObjDet) using a You Only Look Once based model to detect, classify, and count sounds produced by soniferous fish in the Tagus Estuary, Portugal. The target species—Lusitanian toadfish (Halobatrachus didactylus), meagre (Argyrosomus regius), and weakfish (Cynoscion regalis)—exhibit overlapping vocalization patterns, posing classification challenges. Results show both methods achieve high accuracy (over 96%) and F1 scores above 87% for species-level sound identification, demonstrating their effectiveness under varied noise conditions. ObjDet generally offers slightly higher classification performance (F1 up to 92%) and can annotate each vocalization for more precise counting. However, it requires bounding-box annotations and higher computational costs (inference time of ca. 1.95 s/h of recording). In contrast, SegClas relies on segment-level labels and provides faster inference (ca. 1.46 s/h). This study also compares both counting strategies, each offering distinct advantages for different ecological and operational needs. Our results highlight the potential of deep learning–based PAM for fish population assessment.