As businesses increasingly rely on large language models (LLMs) for tasks such as customer service and information retrieval, ensuring the accuracy of their responses is a critical challenge. Traditional verification is costly, slow, and often requires scarce domain experts. We introduce the automated quality evaluation based on textual responses (AQER) framework, a novel, cost-effective method to assess the correctness of free-text answers from both LLMs and human workers without needing preexisting correct answers. AQER works by intelligently aggregating multiple responses to the same question, leveraging the wisdom of the crowd to create a reliable synthetic correct answer, followed by an iterative procedure that accounts for response quality cues. AQER obtains state-of-the-art performance compared with existing automated response evaluation baselines. For managers AQER offers a scalable, data-driven method to (i) evaluate and select the best performing LLMs for specific organizational needs and use cases, (ii) continuously monitor artificial intelligence (AI) performance to ensure reliability and accountability across different model versions, and (iii) manage the quality of crowd workers essential for high-quality AI training and validation. AQER, thus, offers a robust mechanism for improving model performance and mitigating the significant financial and reputational risks associated with deploying untrustworthy generative AI technologies.