Learning multi-modal representations by watching hundreds of surgical video lectures

计算机科学 情态动词 人工智能 计算机视觉 多媒体 人机交互 化学 高分子化学
作者
Kun Yuan,Vinkle Srivastav,Tong Yu,Joël L. Lavanchy,Jacques Marescaux,Pietro Mascagni,Nassir Navab,Nicolas Padoy
出处
期刊:Medical Image Analysis [Elsevier BV]
卷期号:105: 103644-103644 被引量:13
标识
DOI:10.1016/j.media.2025.103644
摘要

Recent advancements in surgical computer vision applications have been driven by vision-only models, which do not explicitly integrate the rich semantics of language into their design. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space. To effectively demonstrate the representational capability of the learned joint latent space, we introduce several vision-and-language surgical tasks and evaluate various vision-only tasks specific to surgery, e.g., surgical tool, phase, and triplet recognition. Extensive experiments across diverse surgical procedures and tasks demonstrate that the multi-modal representations learned by SurgVLP exhibit strong transferability and adaptability in surgical video analysis. Furthermore, our zero-shot evaluations highlight SurgVLP's potential as a general-purpose foundation model for surgical workflow analysis, reducing the reliance on extensive manual annotations for downstream tasks, and facilitating adaptation methods such as few-shot learning to build a scalable and data-efficient solution for various downstream surgical applications. The code is available at https://github.com/CAMMA-public/SurgVLP.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
wbx完成签到,获得积分10
1秒前
cdercder应助懒洋洋采纳,获得20
2秒前
2秒前
xinge3787发布了新的文献求助50
3秒前
drccheng发布了新的文献求助10
3秒前
stella完成签到 ,获得积分10
5秒前
shasha发布了新的文献求助10
5秒前
8秒前
Jerome完成签到,获得积分10
9秒前
杨桃完成签到,获得积分10
9秒前
小苞米完成签到,获得积分10
11秒前
12秒前
饭饭发布了新的文献求助10
14秒前
15秒前
16秒前
zrr完成签到,获得积分20
19秒前
20秒前
秋天来了发布了新的文献求助10
21秒前
molihuakai应助超男采纳,获得10
22秒前
24秒前
动听的鞋垫完成签到,获得积分10
24秒前
肥皂剧发布了新的文献求助10
25秒前
勿念发布了新的文献求助10
31秒前
只只完成签到,获得积分10
32秒前
33秒前
33秒前
33秒前
33秒前
Wei完成签到 ,获得积分10
34秒前
bobo完成签到 ,获得积分10
34秒前
勇yi完成签到,获得积分10
36秒前
爆米花应助柚子醒醒采纳,获得10
37秒前
Miao完成签到,获得积分10
38秒前
莓莓发布了新的文献求助10
39秒前
seven发布了新的文献求助10
40秒前
睿O宝宝O发布了新的文献求助10
40秒前
FashionBoy应助标点符号采纳,获得10
40秒前
40秒前
41秒前
华仔应助科研通管家采纳,获得10
41秒前
高分求助中
论现代体育科学研究的方法学特征 1000
Invited Discussant 63O and 64O 1000
Ideology and Meaning-Making under the Putin Regime 750
Prompt Engineering for Clinicians: Harnessing AI in Everyday Medical Practice 600
Safety Pharmacology 500
《KNN基无铅压电陶瓷电学性能优化与物理机理研究》 500
A Handbook of User Experience Research & Design in Libraries 400
热门求助领域 (近24小时)
化学 材料科学 医学 生物 纳米技术 工程类 有机化学 计算机科学 化学工程 生物化学 物理 内科学 复合材料 催化作用 光电子学 物理化学 电极 细胞生物学 基因 遗传学
热门帖子
关注 科研通微信公众号,转发送积分 6916842
求助须知:如何正确求助?哪些是违规求助? 8607781
关于积分的说明 18263230
捐赠科研通 6329639
什么是DOI,文献DOI怎么找? 3068574
关于科研通互助平台的介绍 2097046
邀请新用户注册赠送积分活动 2045905