黏着语
泰卢固语
计算机科学
印地语
自然语言处理
语系
人工智能
词(群论)
机器翻译
语言学
哲学
语素
作者
Santwana Chimalamarri,Dinkar Sitaram,Ashritha Jain
出处
期刊:ACM Transactions on Asian and Low-Resource Language Information Processing
日期:2020-06-21
卷期号:19 (5): 1-15
被引量:10
摘要
Crosslingual word embeddings developed from multiple parallel corpora help in understanding the relationships between languages and improving the prediction quality of machine translation. However, in low resource languages with complex and agglutinative morphologies, inducing good-quality crosslingual embeddings becomes challenging due to the problem of complex morphological forms and rare words. This is true even for languages that share common linguistic structure. In our work, we have shown that performing a simple morphological segmentation upon the corpora prior to the generation of crosslingual word embeddings for both roots and suffixes greatly improves the prediction quality and captures semantic similarities more effectively. To exhibit this, we have chosen two related languages: Telugu and Kannada of the Dravidian language family. We have also tested our method upon a widely spoken North Indian language, Hindi, belonging to the Indo-European language family, and have observed encouraging results.
科研通智能强力驱动
Strongly Powered by AbleSci AI