AEON: a method for automatic evaluation of NLP test cases

计算机科学 人工智能 自然语言处理 考试(生物学) 生物 古生物学
作者
Jen-tse Huang,Jianping Zhang,Wenxuan Wang,Pinjia He,Yuxin Su,Michael R. Lyu
标识
DOI:10.1145/3533767.3534394
摘要

Due to the labor-intensive nature of manual test oracle construction, various automated testing techniques have been proposed to enhance the reliability of Natural Language Processing (NLP) software. In theory, these techniques mutate an existing test case (e.g., a sentence with its label) and assume the generated one preserves an equivalent or similar semantic meaning and thus, the same label. However, in practice, many of the generated test cases fail to preserve similar semantic meaning and are unnatural (e.g., grammar errors), which leads to a high false alarm rate and unnatural test cases. Our evaluation study finds that 44% of the test cases generated by the state-of-the-art (SOTA) approaches are false alarms. These test cases require extensive manual checking effort, and instead of improving NLP software, they can even degrade NLP software when utilized in model training. To address this problem, we propose AEON for Automatic Evaluation Of NLP test cases. For each generated test case, it outputs scores based on semantic similarity and language naturalness. We employ AEON to evaluate test cases generated by four popular testing techniques on five datasets across three typical NLP tasks. The results show that AEON aligns the best with human judgment. In particular, AEON achieves the best average precision in detecting semantic inconsistent test cases, outperforming the best baseline metric by 10%. In addition, AEON also has the highest average precision of finding unnatural test cases, surpassing the baselines by more than 15%. Moreover, model training with test cases prioritized by AEON leads to models that are more accurate and robust, demonstrating AEON's potential in improving NLP software.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
小刘不牛完成签到,获得积分10
刚刚
西门凡双发布了新的文献求助20
刚刚
司空剑封完成签到,获得积分10
1秒前
1秒前
1秒前
1秒前
1秒前
Mandy完成签到,获得积分10
1秒前
花果山发布了新的文献求助30
2秒前
上官若男应助qianqina采纳,获得30
2秒前
lxy6686完成签到,获得积分10
3秒前
时深完成签到 ,获得积分10
3秒前
Robert完成签到,获得积分10
3秒前
瓦解99发布了新的文献求助10
3秒前
小蘑菇应助qinqin采纳,获得10
3秒前
荷月发布了新的文献求助30
4秒前
鲤鱼玉米发布了新的文献求助10
4秒前
小林发布了新的文献求助20
4秒前
江城一霸发布了新的文献求助200
4秒前
salary发布了新的文献求助10
4秒前
希望天下0贩的0应助wwhh采纳,获得10
5秒前
5秒前
wzait07发布了新的文献求助10
5秒前
Hoo发布了新的文献求助10
5秒前
要吃虾饺发布了新的文献求助10
5秒前
6秒前
漪涙应助丘奇采纳,获得10
6秒前
6秒前
komisan完成签到 ,获得积分10
7秒前
科研通AI6.4应助一久采纳,获得10
7秒前
7秒前
qqq发布了新的文献求助10
8秒前
8秒前
陈英杰完成签到 ,获得积分10
8秒前
少年完成签到,获得积分10
9秒前
9秒前
9秒前
沉甸甸完成签到,获得积分10
10秒前
RUI完成签到 ,获得积分10
10秒前
11秒前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
The Organometallic Chemistry of the Transition Metals 800
Chemistry and Physics of Carbon Volume 18 800
The Organometallic Chemistry of the Transition Metals 800
Leading Academic-Practice Partnerships in Nursing and Healthcare: A Paradigm for Change 800
The formation of Australian attitudes towards China, 1918-1941 640
Signals, Systems, and Signal Processing 610
热门求助领域 (近24小时)
化学 材料科学 医学 生物 纳米技术 工程类 有机化学 化学工程 生物化学 计算机科学 物理 内科学 复合材料 催化作用 物理化学 光电子学 电极 细胞生物学 基因 无机化学
热门帖子
关注 科研通微信公众号,转发送积分 6437367
求助须知:如何正确求助?哪些是违规求助? 8251874
关于积分的说明 17556725
捐赠科研通 5495671
什么是DOI,文献DOI怎么找? 2898496
邀请新用户注册赠送积分活动 1875293
关于科研通互助平台的介绍 1716275