As one of the world's most prevalent mental illnesses, depression is not easy to detect since it affects different people in different ways. Recently, linguistic features extracted from transcribed texts have been widely explored in depression detection because they contain a variety of cues about psychological activities. However, the detection performance is limited due to the following two reasons: 1) the dialogue structure is ignored, which causes the Inconsistent Context problem; and 2) Imbalanced Regression occurs due to the long-tailed distribution of depression datasets. To this end, in this paper we investigate the relationship between the local topic and global context in interview transcripts, and bridge the gap between depression symptoms and depression severity. In particular, we propose a model called Conditional Variational Topic-enriched Auto-Encoder (CVTAE), which can capture the spatial features from local topics via variational inference, and the temporal features from the global context with attention mechanism. Besides, we apply the re-weighting strategies to assigning weights to the depression labels with different values. Extensive experiments on the DAIC-WOZ dataset in English and a self-constructed database NCUDID in Chinese demonstrate the effectiveness and robustness of CVTAE, while the comprehensive ablation study and case study show its interpretability.