Learning multi-modal representations by watching hundreds of surgical video lectures

计算机科学情态动词人工智能计算机视觉多媒体人机交互化学高分子化学

作者

Kun Yuan,Vinkle Srivastav,Tong Yu,Joël L. Lavanchy,Jacques Marescaux,Pietro Mascagni,Nassir Navab,Nicolas Padoy

出处

期刊：Medical Image Analysis [Elsevier BV]
日期：2025-06-04 卷期号：105: 103644-103644 被引量：13

链接

nih.gov hal.science hal.sciencedoi.org

标识

DOI：10.1016/j.media.2025.103644

摘要

Recent advancements in surgical computer vision applications have been driven by vision-only models, which do not explicitly integrate the rich semantics of language into their design. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space. To effectively demonstrate the representational capability of the learned joint latent space, we introduce several vision-and-language surgical tasks and evaluate various vision-only tasks specific to surgery, e.g., surgical tool, phase, and triplet recognition. Extensive experiments across diverse surgical procedures and tasks demonstrate that the multi-modal representations learned by SurgVLP exhibit strong transferability and adaptability in surgical video analysis. Furthermore, our zero-shot evaluations highlight SurgVLP's potential as a general-purpose foundation model for surgical workflow analysis, reducing the reliance on extensive manual annotations for downstream tasks, and facilitating adaptation methods such as few-shot learning to build a scalable and data-efficient solution for various downstream surgical applications. The code is available at https://github.com/CAMMA-public/SurgVLP.

求助该文献

最长约 10秒，即可获得该文献文件

Learning multi-modal representations by watching hundreds of surgical video lectures

今日热心研友