计算机科学
分类器(UML)
人工智能
视位
语音识别
卷积神经网络
模式识别(心理学)
阿拉伯语
深度学习
语音处理
声学模型
语言学
哲学
作者
Zamen Jabr,Sauleh Etemadi,Nasser Mozayani
出处
期刊:IEEE Access
[Institute of Electrical and Electronics Engineers]
日期:2024-01-01
卷期号:12: 111611-111626
标识
DOI:10.1109/access.2024.3440646
摘要
Two main challenges faced by deep learning systems are related to the amount of data and the complexity of the model concerning the number and type of layers and the number of training parameters. In this paper, we propose an End-to-End Arabic lip-reading system that can be trained on a limited dataset, which combines a visual model consist of Convolutional Neural Networks (CNNs) and a temporal model Gated Recurrent Units (GRUs ) layers, taking into account the balance between the size of the dataset and the number of model parameters. For this purpose, we created a limited Arabic dataset that involved 20 words uttered by 40 native Arabic speakers; then, we exploited the redundant frames found in video sequences to train the Arabic visemes classifier separately. This classifier was later used as a visual model, as a pre-trained model, in our end-to-end system to extract the spatial features from videos, while the temporal model was used to process the context. Our proposed method is evaluated on 1) our dataset, we obtained an accuracy equal to 83.02%; 2) the W. Dweik et al. dataset [1], we obtained an improvement rate of ≈ 3% on the result recorded by their work. In addition, we employed the visemes classifier model for person identification using the viseme shape and obtained a high result.
科研通智能强力驱动
Strongly Powered by AbleSci AI