反事实思维
水准点(测量)
集合(抽象数据类型)
计算机科学
数据集
基线(sea)
人工智能
机器学习
试验装置
考试(生物学)
实验数据
数据挖掘
心理学
数学
统计
程序设计语言
地理
海洋学
古生物学
地质学
生物
社会心理学
大地测量学
作者
Jörg Frohberg,Frank Binder
出处
期刊:Cornell University - arXiv
日期:2021-12-22
被引量:9
标识
DOI:10.48550/arxiv.2112.11941
摘要
We introduce the CRASS (counterfactual reasoning assessment) data set and benchmark utilizing questionized counterfactual conditionals as a novel and powerful tool to evaluate large language models. We present the data set design and benchmark that supports scoring against a crowd-validated human baseline. We test six state-of-the-art models against our benchmark. Our results show that it poses a valid challenge for these models and opens up considerable room for their improvement.
科研通智能强力驱动
Strongly Powered by AbleSci AI