计算机科学
代码生成
编码(集合论)
程序设计语言
自然语言处理
人工智能
计算机安全
集合(抽象数据类型)
钥匙(锁)
作者
Mihai Nadǎş,Laura Dioşan,Andreea Tomescu
出处
期刊:IEEE Access
[Institute of Electrical and Electronics Engineers]
日期:2025-01-01
卷期号:13: 134615-134633
被引量:8
标识
DOI:10.1109/access.2025.3589503
摘要
This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment or even substitute for real-world datasets, particularly in scenarios where labeled data is scarce, expensive, or sensitive. This paper surveys recent advances in leveraging LLMs to create synthetic text and code, highlighting key techniques such as prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement. We examine how these methods can enrich low-resource tasks (e.g., classification, question answering) and facilitate code-centric applications (e.g., instruction tuning, code translation, bug repair) through automated verification of functional correctness. Alongside potential benefits - cost-effectiveness, broad coverage, and controllable diversity - we discuss the accompanying challenges, including factual inaccuracies in generated text, insufficient stylistic or distributional realism, and risks of bias amplification. Proposed mitigation strategies range from filtering and weighting synthetic outputs to reinforcement learning with execution feedback in code domains. We conclude by outlining open research directions, such as automated prompt engineering, cross-modal data synthesis, and robust evaluation frameworks, underscoring the growing importance of LLM-generated synthetic data in accelerating AI development while emphasizing ethical and quality safeguards.
科研通智能强力驱动
Strongly Powered by AbleSci AI