Video Fire Recognition Using Zero-shot Vision-language Models Guided by a Task-aware Object Detector

计算机科学弹丸任务（项目管理）探测器计算机视觉对象（语法）人工智能零（语言学）人机交互语言学电信哲学经济有机化学化学管理

作者

Diego Gragnaniello,Antonio Greco,Carlo Sansone,Bruno Vento

出处

期刊：ACM Transactions on Multimedia Computing, Communications, and Applications [Association for Computing Machinery]
日期：2025-03-03 被引量：2

标识

DOI：10.1145/3721291

摘要

Fire detection from images or videos has gained a growing interest in recent years due to the criticality of the application. Both reliable real-time detectors and efficient retrieval techniques, able to process large databases acquired by sensor networks, are needed. Even if the reliability of artificial vision methods improved in the last years, some issues are still open problems. In particular, literature methods often reveal a low generalization capability when employed in scenarios different from the training ones in terms of framing distance, surrounding environment, or weather conditions. This can be addressed by considering contextual information and, more specifically, using vision-language models capable of interpreting and describing the framed scene. In this work, we propose FIRE-TASTIC: FIre REcognition with Task-Aware Spatio-Temporal Image Captioning, a novel framework to use object detectors in conjunction with vision-language models for fire detection and information retrieval. The localization capability of the former makes it able to detect even tiny fire traces but expose the system to false alarms. These are strongly reduced by the impressive zero-shot generalization capability of the latter, which can recognize and describe fire-like objects without prior fine-tuning. We also present a variant of the FIRE-TASTIC framework based on Visual Question Answering instead of Image Captioning, which allows one to customize the retrieved information with personalized questions. To integrate the high-level information provided by both neural networks, we propose a novel method to query the vision-language models using the temporal and spatial localization information provided by the object detector. The proposal can improve the retrieval performance, as evidenced by the experiments conducted on two recent fire detection datasets, showing the effectiveness and the generalization capabilities of FIRE-TASTIC, which surpasses the state of the art. Moreover, the vision-language model, which is unsuitable for video processing due to its high computational load, is executed only on suspicious frames, allowing for real-time processing. This makes FIRE-TASTIC suitable for both real-time processing and information retrieval on large datasets.

求助该文献

Video Fire Recognition Using Zero-shot Vision-language Models Guided by a Task-aware Object Detector

今日热心研友