计算机科学                        
                
                                
                        
                            人工智能                        
                
                                
                        
                            模态(人机交互)                        
                
                                
                        
                            手势                        
                
                                
                        
                            模式                        
                
                                
                        
                            稳健性(进化)                        
                
                                
                        
                            初始化                        
                
                                
                        
                            模式识别(心理学)                        
                
                                
                        
                            分类器(UML)                        
                
                                
                        
                            手势识别                        
                
                                
                        
                            语音识别                        
                
                                
                        
                            情态动词                        
                
                                
                        
                            计算机视觉                        
                
                                
                        
                            机器学习                        
                
                                
                        
                            社会学                        
                
                                
                        
                            化学                        
                
                                
                        
                            高分子化学                        
                
                                
                        
                            程序设计语言                        
                
                                
                        
                            基因                        
                
                                
                        
                            生物化学                        
                
                                
                        
                            社会科学                        
                
                        
                    
            作者
            
                Natalia Neverova,Christian Wolf,Graham W. Taylor,Florian Nebout            
         
                    
        
    
            
            标识
            
                                    DOI:10.1109/tpami.2015.2461544
                                    
                                
                                 
         
        
                
            摘要
            
            We present a method for gesture detection and localisation based on multi-scale and multi-modal deep learning. Each visual modality captures spatial information at a particular spatial scale (such as motion of the upper body or a hand), and the whole system operates at three temporal scales. Key to our technique is a training strategy which exploits: i) careful initialization of individual modalities; and ii) gradual fusion involving random dropping of separate channels (dubbed ModDrop) for learning cross-modality correlations while preserving uniqueness of each modality-specific representation. We present experiments on the ChaLearn 2014 Looking at People Challenge gesture recognition track, in which we placed first out of 17 teams. Fusing multiple modalities at several spatial and temporal scales leads to a significant increase in recognition rates, allowing the model to compensate for errors of the individual classifiers as well as noise in the separate channels. Futhermore, the proposed ModDrop training technique ensures robustness of the classifier to missing signals in one or several channels to produce meaningful predictions from any number of available modalities. In addition, we demonstrate the applicability of the proposed fusion scheme to modalities of arbitrary nature by experiments on the same dataset augmented with audio.
         
            
 
                 
                
                    
                    科研通智能强力驱动
Strongly Powered by AbleSci AI