Boosting Multi-Modal Large Language Model With Enhanced Visual Features

计算机科学 Boosting(机器学习) 杠杆(统计) 人工智能 特征提取 机器学习 模式 语言模型 可视化 特征(语言学) 视觉语言 钥匙(锁) 语言理解 自然语言处理 语义学(计算机科学) 人机交互 视觉推理 深度学习 计算模型 多模式学习 利用 任务分析 视觉感受
作者
Yiwei Ma,Weihuang Lin,Zhibin Wang,Jiayi Ji,Xiaoshuai Sun,Chia-Wen Lin,Rongrong Ji
出处
期刊:IEEE Transactions on Pattern Analysis and Machine Intelligence [IEEE Computer Society]
卷期号:48 (4): 4524-4538
标识
DOI:10.1109/tpami.2025.3644851
摘要

Recent advancements in computer vision (CV) and large language models (LLMs) have spurred significant interest in multi-modal large language models (MLLMs), which aim to integrate visual and textual modalities for enhanced understanding and generation tasks. While much of the existing research focuses on optimizing projectors and LLMs to improve MLLM performance, a critical question remains underexplored: Has the full potential of visual features in MLLMs been realized? To address this question, we identify two key limitations in current MLLM architectures and propose vMLLM, a vision-enhanced MLLM designed to fully leverage the capabilities of visual features. vMLLM introduces two novel components: the Multi-level Aggregation Module (MAM) and the Intra- and inter-modal Enhancement Module (IEM). The MAM aggregates multi-layer features from the vision encoder, capturing both high-level semantic information and low-level spatial details, thereby enriching the visual representation. The IEM enhances visual features through intra- and inter-modal interactions, effectively suppressing irrelevant information while amplifying task-relevant features, leading to more robust multimodal understanding. We conduct extensive experiments on multiple benchmarks, evaluating vMLLM across diverse settings, including different vision encoders, training dataset scales, and varying sizes of LLMs. Our results demonstrate that vMLLM consistently achieves significant performance improvements, validating its effectiveness in harnessing the potential of visual features. These findings highlight the importance of optimizing visual feature extraction and interaction mechanisms in MLLMs, paving the way for more advanced multimodal AI systems..
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
cat发布了新的文献求助10
刚刚
刚刚
小二郎应助yyk采纳,获得10
刚刚
刚刚
1秒前
1秒前
WJ发布了新的文献求助10
1秒前
2秒前
2秒前
六六发布了新的文献求助10
2秒前
3秒前
3秒前
wmbgmt发布了新的文献求助10
4秒前
CCC发布了新的文献求助10
4秒前
4秒前
Bokuto发布了新的文献求助10
4秒前
li关注了科研通微信公众号
4秒前
5秒前
Gu完成签到,获得积分10
5秒前
FashionBoy应助Crazy采纳,获得10
5秒前
风_Feng发布了新的文献求助10
5秒前
5秒前
科研小王完成签到 ,获得积分10
6秒前
皮蛋完成签到,获得积分10
6秒前
欧米伽发布了新的文献求助10
6秒前
6秒前
6秒前
6秒前
zhu发布了新的文献求助10
7秒前
7秒前
7秒前
FGGFGGU发布了新的文献求助10
8秒前
8秒前
sheep发布了新的文献求助10
8秒前
8秒前
Banana发布了新的文献求助40
9秒前
9秒前
orixero应助南辰采纳,获得10
9秒前
科研爵士圣体完成签到,获得积分10
10秒前
我是老大应助哒哒采纳,获得10
10秒前
高分求助中
Overcoming Stigma and Bias in Obesity Management 800
Malcolm Fraser : a biography 700
Signals, Systems, and Signal Processing 610
Materials selection in mechanical design 500
Bounds for Statistical Estimation in Semiparametric Models 500
Forced degradation and stability indicating LC method for Letrozole: A stress testing guide 500
Ideology and Meaning-Making under the Putin Regime 450
热门求助领域 (近24小时)
化学 材料科学 医学 生物 纳米技术 工程类 有机化学 化学工程 生物化学 计算机科学 物理 内科学 复合材料 催化作用 物理化学 光电子学 电极 细胞生物学 基因 无机化学
热门帖子
关注 科研通微信公众号,转发送积分 6478602
求助须知:如何正确求助?哪些是违规求助? 8280115
关于积分的说明 17659941
捐赠科研通 5561094
什么是DOI,文献DOI怎么找? 2911191
邀请新用户注册赠送积分活动 1888194
关于科研通互助平台的介绍 1742021