Background Commercially available large language models (LLMs) have demonstrated impressive capabilities in processing vast datasets and generating coherent narratives. However, their lack of domain-specific knowledge limits their reliability in clinical applications. This study aimed to develop and evaluate BariatricSurgeryGPT, a fine-tuned LLM specifically tailored for bariatric surgery to provide more accurate and clinically relevant responses to bariatric surgery-related questions. Methods We obtained 8764 bariatric surgery research abstracts published between January 1, 2020, and January 1, 2024, from PubMed. These abstracts were preprocessed and tokenized to fine-tune a pre-trained GPT-2 model using PyTorch and HuggingFace frameworks. The model’s performance was evaluated using BLEU, METEOR, and ROUGE-1 scores on 20 clinically relevant bariatric surgery questions, each tested across nine temperature settings (0.1-0.9) for both the fine-tuned and baseline GPT-2 models, yielding 360 total evaluation instances. Results BariatricSurgeryGPT demonstrated consistent improvements over the baseline GPT-2 model across all metrics. The fine-tuned model achieved a BLEU score of 0.165 (vs 0.147 for baseline, 12.8% improvement), a METEOR score of 0.633 (vs 0.585, 8.2% improvement), and a ROUGE-1 score of 0.267 (vs 0.243, 9.7% improvement). These improvements indicate enhanced precision, recall, and semantic relevance in generating bariatric surgery-specific content. Conclusion BariatricSurgeryGPT represents the first domain-specific LLM for bariatric surgery and demonstrates the feasibility of developing specialty-specific AI tools with improved accuracy for clinical applications. The specialty-specific models could enhance surgical education through interactive learning tools, improve patient communication via personalized educational materials, and support clinical decision-making by providing evidence-based information synthesis.