HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

编码(社会科学) 视觉推理 计算机科学 预测编码 认知心理学 认知科学 心理学 人工智能 数学 统计
作者
Fengji Zhang,Lisa Y. Wu,Hui‐Yu Bai,Guancheng Lin,Xiao Li,Xiao Yu,Yue Wang,Bei Chen,Jacky Keung
出处
期刊:Cornell University - arXiv
标识
DOI:10.48550/arxiv.2410.12381
摘要

Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they demand the comprehension of high-level instructions, complex reasoning, and the implementation of functional programs -- core capabilities for advancing Artificial General Intelligence. Despite the progress in Large Multimodal Models (LMMs), which extend LLMs with visual perception and understanding capabilities, there remains a notable lack of coding benchmarks that rigorously assess these models, particularly in tasks that emphasize visual reasoning. To address this gap, we introduce HumanEval-V, a novel and lightweight benchmark specifically designed to evaluate LMMs' visual understanding and reasoning capabilities through code generation. HumanEval-V includes 108 carefully crafted, entry-level Python coding tasks derived from platforms like CodeForces and Stack Overflow. Each task is adapted by modifying the context and algorithmic patterns of the original problems, with visual elements redrawn to ensure distinction from the source, preventing potential data leakage. LMMs are required to complete the code solution based on the provided visual context and a predefined Python function signature outlining the task requirements. Every task is equipped with meticulously handcrafted test cases to ensure a thorough and reliable evaluation of model-generated solutions. We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering significant challenges. Proprietary models like GPT-4o achieve only 13% pass@1 and 36.4% pass@10, while open-weight models with 70B parameters score below 4% pass@1. Ablation studies further reveal the limitations of current LMMs in vision reasoning and coding capabilities. These results underscore key areas for future research to enhance LMMs' capabilities. We have open-sourced our code and benchmark at https://github.com/HumanEval-V/HumanEval-V-Benchmark.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
PDF的下载单位、IP信息已删除 (2025-6-4)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
嘻嘻哈哈应助veggieg采纳,获得20
刚刚
无花果应助veggieg采纳,获得10
刚刚
刚刚
CodeCraft应助veggieg采纳,获得10
刚刚
刚刚
慕青应助veggieg采纳,获得10
刚刚
粗犷的沛容应助veggieg采纳,获得50
刚刚
科研通AI6应助echo采纳,获得10
1秒前
浮游应助研友_ZA7B7L采纳,获得10
1秒前
1秒前
小帅发布了新的文献求助10
1秒前
哈哈哈哈完成签到,获得积分10
2秒前
viauue9完成签到,获得积分10
2秒前
dynamoo应助YUJIALING采纳,获得10
2秒前
YYY发布了新的文献求助10
3秒前
乘风完成签到 ,获得积分10
3秒前
淡蓝时光完成签到,获得积分10
3秒前
1111发布了新的文献求助10
3秒前
caffeine发布了新的文献求助10
3秒前
赘婿应助柒染采纳,获得10
4秒前
___赵发布了新的文献求助10
4秒前
我是老大应助蒸馏水采纳,获得10
4秒前
aha发布了新的文献求助10
4秒前
4秒前
4秒前
4秒前
lllllljmjmjm发布了新的文献求助10
4秒前
5秒前
Jasper应助美满忆文采纳,获得10
5秒前
热美式完成签到,获得积分10
6秒前
阿治完成签到,获得积分10
6秒前
6秒前
凯隐皇帝完成签到,获得积分20
6秒前
6秒前
zzz完成签到,获得积分10
7秒前
7秒前
lezard完成签到,获得积分10
7秒前
细腻沅完成签到,获得积分10
7秒前
9秒前
纸飞机完成签到 ,获得积分10
9秒前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
Fermented Coffee Market 2000
Constitutional and Administrative Law 500
PARLOC2001: The update of loss containment data for offshore pipelines 500
Critical Thinking: Tools for Taking Charge of Your Learning and Your Life 4th Edition 500
Investigative Interviewing: Psychology and Practice 300
Atlas of Anatomy (Fifth Edition) 300
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 生物化学 物理 纳米技术 计算机科学 内科学 化学工程 复合材料 物理化学 基因 遗传学 催化作用 冶金 量子力学 光电子学
热门帖子
关注 科研通微信公众号,转发送积分 5286706
求助须知:如何正确求助?哪些是违规求助? 4439351
关于积分的说明 13821187
捐赠科研通 4321274
什么是DOI,文献DOI怎么找? 2371784
邀请新用户注册赠送积分活动 1367335
关于科研通互助平台的介绍 1330812