Echocardiography video segmentation is critical for cardiovascular disease diagnosis. However, it still suffers from the challenge of dual-level bias. This challenge derives from the frame-level bias in temporal dimension and the object-level bias in the spatial dimension on echocardiography video. In this paper, we propose a spatial-temporal consistency (STC) model based on semi-supervised learning for echocardiography video segmentation. STC aligns and fuses inter-frame and inter-object context-aware feature representations. First, STC explores a temporal context-aware module to focus on motion differences between frames. This module extracts temporal correlation by inter-frame attention to fuse important temporal semantic information. Second, STC proposes a multi-object semantic adaptation (MSA) module that not only adaptively calibrates frame-level feature and object-level feature, but also fuses these features at different layers. Finally, STC considers spatial-temporal consistency constraint to reduce prediction error among multiple MSA modules, thereby achieving low-entropy prediction. Extensive experiments demonstrate that the STC achieves SOTA performance for echocardiography video segmentation.