Efficient behavior is supported by humans’ ability to rapidly recognize acoustically distinct sounds as members of a common category. Within the auditory cortex, critical unanswered questions remain regarding the organization and dynamics of sound categorization. We performed intracerebral recordings during epilepsy surgery evaluation as 20 patient-participants listened to natural sounds. We then built encoding models to predict neural responses using sound representations extracted from different layers within a deep neural network (DNN) pretrained to categorize sounds from acoustics. This approach yielded accurate models of neural responses throughout the auditory cortex. The complexity of a cortical site’s representation (measured by the depth of the DNN layer that produced the best model) was closely related to its anatomical location, with shallow, middle, and deep layers associated with core (primary auditory cortex), lateral belt, and parabelt regions, respectively. Smoothly varying gradients of representational complexity existed within these regions, with complexity increasing along a posteromedial-to-anterolateral direction in core and lateral belt and along posterior-to-anterior and dorsal-to-ventral dimensions in parabelt. We then characterized the time (relative to sound onset) when feature representations emerged; this measure of temporal dynamics increased across the auditory hierarchy. Finally, we found separable effects of region and temporal dynamics on representational complexity: sites that took longer to begin encoding stimulus features had higher representational complexity independent of region, and downstream regions encoded more complex features independent of temporal dynamics. These findings suggest that hierarchies of timescales and complexity represent a functional organizational principle of the auditory stream underlying our ability to rapidly categorize sounds.