ABSTRACT Aim To evaluate the capability of large language models to generate nursing diagnoses based on NANDA‐I Taxonomy II and assess their performance across domains and overall. Background Large language models are emerging tools in nursing, showing potential to aid in diagnosis generation and education. However, their accuracy and applicability in clinical and educational settings remain underexplored. Methods This cross‐sectional comparative study used 10 realistic patient scenarios based on NANDA‐I Taxonomy II, covering 12 domains. The study aimed to evaluate the capability of four models to generate nursing diagnoses based on patient scenarios. The responses were assessed by five nursing experts for accuracy and alignment with NANDA‐I Taxonomy II in a single‐blind evaluation process. Results All models demonstrated similar performance across different domains and overall, with Claude attaining the highest overall performance score. Expert evaluations indicated moderate interrater reliability. Discussion Small variations between models and occasional omissions suggest that expert review is still required before clinical use. Conclusions Large language models are not yet sufficiently reliable for independent use in clinical settings and nursing education. Their application as supportive tools necessitates a cautious approach. Moreover, the development of specialized models designed to address the unique demands of the nursing field would be advantageous. Implications for nursing When large language models are used in nursing practice, their limitations should be considered, and the outputs they produce should be verified by nurses. Implications for nursing policy Ensuring the safe integration of artificial intelligence tools into nursing necessitates the establishment of robust regulatory policies to safeguard patient safety, the deployment of effective systems to monitor models’ performance, and the development of comprehensive guidelines and training programs.