计算机科学
任务(项目管理)
情态动词
人工智能
多任务学习
机器学习
工程类
系统工程
化学
高分子化学
作者
Fabrice Wansi,Thabet Kacem
标识
DOI:10.1109/iwcmc65282.2025.11059710
摘要
This paper attempted to addresses the critical need for automated plant disease detection and analysis, which is crucial for bolstering global food security and agricultural sustainability. We propose a Multi-modal Multi-Task Learning (3MTL) system that combines state-of-the-art Transformers (ViT) and Bidirectional Encoder Representations from Transformers (BERT) models to perform joint visual and textual analysis. The system tackles fundamental key tasks including Image-based classification, caption and annotation generation, and question-answering generation. Leveraging multi-modal information to enhance the performance of machine learning tasks, from vision-language to natural language processing, such as question answering. Using a late fusion approach, we merge visual and language encoders to enhance performance across these tasks. Extensive experiments on a newly curated dataset, consisting of plant images paired with corresponding textual metadata, demonstrate the slightly superior accuracy of our BERT-ViT architecture compared to a Contractive Language-Image pre-trained (CLIP) mode in most tasks. The ViT-based model achieves a classification accuracy of 93% for predicting plant name, healthy status, and disease name and causes, a BLEU Score of 0.79 for generating annotations, and an accuracy of 89% for question-answering tasks. In contrast, the CLIP-based model achieves accuracy of 92%, 78%, and 88% for the same tasks. Our findings underscore the potential of integrating visual and textual cues to enhance plant disease management and ultimately resulting in scalable and intelligent agricultural solutions.
科研通智能强力驱动
Strongly Powered by AbleSci AI