Medical image segmentation plays a crucial role in various healthcare applications, enabling accurate diagnosis, treatment planning, and disease monitoring. Convolutional Neural Networks (CNNs) have demonstrated exceptional performance in this domain due to their proficiency in learning complex patterns from raw data. In recent years, Vision Transformers (ViTs) have gained significant attention as an effective approach for various challenges in image analysis. However, they may lack image-related inductive bias and translational invariance that may affect their performance. To address this, Hybrid Vision Transformers (HVTs) have been introduced, combining CNNs and Transformer layers to effectively analyze features at both local and global scales. Building on the success of ViTs and HVTs, this paper reviews recent advancements in these architectures for medical image segmentation. We classify approaches based on architectural design and review state-of-the-art models for different imaging modalities, analyzing their limitations and potential solutions. Additionally, we highlight key challenges, discuss current trends and propose future research directions in the field. This review aims to provide valuable insights for researchers and professionals working on ViT-based medical image segmentation.