计算机科学
动词
人工智能
语义学(计算机科学)
任务(项目管理)
自然语言处理
视觉对象识别的认知神经科学
链码
计算机视觉
对象(语法)
图像(数学)
程序设计语言
管理
经济
作者
Nan Xi,Jingjing Meng,Junsong Yuan
标识
DOI:10.1145/3581783.3611898
摘要
Surgical triplet recognition aims to recognize surgical activities as triplets (i.e.,), which provides fine-grained information essential for surgical scene understanding. Existing methods for surgical triplet recognition rely on compositional methods that recognize the instrument, verb, and target simultaneously. In contrast, our method, called chain-of-look prompting, casts the problem of surgical triplet recognition as visual prompt generation from large-scale vision-language (VL) models, and explicitly decomposes the task into a series of video reasoning processes. Chain-of-Look prompting is inspired by: (1) the chain-of-thought prompting in natural language processing, which divides a problem into a sequence of intermediate reasoning steps; (2) the inter-dependency between motion and visual appearance in the human vision system. Since surgical activities are conveyed by the actions of physicians, we regard the verbs as the carrier of semantics in surgical endoscopic videos. Additionally, we utilize the BioMed large language model to calibrate the generated visual prompt features for surgical scenarios. Our approach captures the visual reasoning processes underlying surgical activities and achieves better performance compared to the state-of-the-art methods on the largest surgical triplet recognition dataset, CholecT50. The code is available at https://github.com/southnx/CoLSurgical.
科研通智能强力驱动
Strongly Powered by AbleSci AI