Combining Latent Semantic Analysis and Pre-trained Model for Vietnamese Text Readability Assessment

可读性 计算机科学 自然语言处理 越南语 人工智能 特征(语言学) 任务(项目管理) 语义学(计算机科学) 光学(聚焦) 标杆管理 理解力 领域(数学分析) 情报检索 语言学 经济 营销 业务 管理 哲学 程序设计语言 数学分析 物理 数学 光学
作者
Nam-Thuan Doan,Thi-Anh-Thi Le,An-Vinh Lương,Điền Đinh
标识
DOI:10.1145/3548636.3548643
摘要

Together with the rapid development of text processing, readability assessment is an important and challenging task of measuring how easy or difficult it is to read a text. Despite the foundation and enhancement of this task in high-resource languages such as English where there are a ton of NLP tools and corpus, this task is not an advantage for low-resource languages, especially Vietnamese. Most previous studies for Vietnamese text readability assessment focus on shallow text characteristics, which have yet to address deeper readability features. In our study, we propose a novel finding in Vietnamese to create the construction of features reflecting in terms of semantics. In view of this, we notice that the difficulty level of terms affects the difficulty level of knowledge that strongly involves text comprehension. Particularly, our approach based on the difficulty distribution of terms in a text generated by Latent Semantic Analysis (LSA) technique decreases the dependence of experts in annotating and discovering the typical feature in a narrow domain. Our proposed feature is efficient to be considered as a new and automatic feature for Vietnamese text readability assessment. Furthermore, LSA is a statistical approach that is more stable and feasible for low-resource languages. In addition, we also integrate PhoBERT, a pre-trained language model for Vietnamese, to generate the bidirectional contextual representation of a word for Vietnamese long-sequence as a semantic feature. Through the experiments in Vietnamese readability dataset, our proposed approach achieves promising performance against the strong competitive baselines. The best performance with up to an accuracy of 94.52% and a weighted F1 score of 94.09%.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
PDF的下载单位、IP信息已删除 (2025-6-4)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
Allen完成签到,获得积分10
刚刚
默默的妙竹完成签到 ,获得积分10
1秒前
Leohp完成签到,获得积分10
2秒前
Delight完成签到 ,获得积分0
2秒前
2秒前
clay_park完成签到,获得积分10
3秒前
岩浆果冻完成签到,获得积分10
5秒前
KongHN完成签到,获得积分10
6秒前
MissXia完成签到,获得积分10
6秒前
山神与你有约完成签到,获得积分10
6秒前
ll完成签到,获得积分10
7秒前
8秒前
好好完成签到,获得积分10
8秒前
复杂真完成签到,获得积分10
9秒前
雪白幻巧完成签到,获得积分10
9秒前
漫溢阳光完成签到 ,获得积分10
10秒前
机智li完成签到 ,获得积分20
10秒前
果冻橙完成签到,获得积分10
12秒前
可爱的函函应助踏雪飞鸿采纳,获得10
13秒前
NaNA完成签到,获得积分10
14秒前
familiar_people完成签到,获得积分10
14秒前
harperwan完成签到 ,获得积分10
14秒前
16秒前
机智马里奥完成签到 ,获得积分10
16秒前
uncle完成签到,获得积分10
16秒前
fantexi113完成签到,获得积分10
17秒前
研友_ZGAeoL完成签到,获得积分10
17秒前
英姑应助贾舒涵采纳,获得50
18秒前
ZYQ完成签到 ,获得积分10
18秒前
11完成签到 ,获得积分10
20秒前
...完成签到,获得积分10
21秒前
LJJ完成签到,获得积分10
22秒前
谨慎翎完成签到 ,获得积分10
23秒前
xu完成签到,获得积分10
23秒前
DXDXJX完成签到 ,获得积分10
23秒前
依人如梦完成签到,获得积分10
24秒前
24秒前
小羊完成签到 ,获得积分10
24秒前
依人如梦发布了新的文献求助10
26秒前
iOhyeye23完成签到 ,获得积分10
27秒前
高分求助中
(应助此贴封号)【重要!!请各位详细阅读】【科研通的精品贴汇总】 10000
Les Mantodea de Guyane: Insecta, Polyneoptera [The Mantids of French Guiana] 3000
F-35B V2.0 How to build Kitty Hawk's F-35B Version 2.0 Model 2500
줄기세포 생물학 1000
The Netter Collection of Medical Illustrations: Digestive System, Volume 9, Part III - Liver, Biliary Tract, and Pancreas (3rd Edition) 600
INQUIRY-BASED PEDAGOGY TO SUPPORT STEM LEARNING AND 21ST CENTURY SKILLS: PREPARING NEW TEACHERS TO IMPLEMENT PROJECT AND PROBLEM-BASED LEARNING 500
2025-2031全球及中国蛋黄lgY抗体行业研究及十五五规划分析报告(2025-2031 Global and China Chicken lgY Antibody Industry Research and 15th Five Year Plan Analysis Report) 400
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 生物化学 物理 内科学 纳米技术 计算机科学 化学工程 复合材料 遗传学 基因 物理化学 催化作用 冶金 细胞生物学 免疫学
热门帖子
关注 科研通微信公众号,转发送积分 4486242
求助须知:如何正确求助?哪些是违规求助? 3941478
关于积分的说明 12222035
捐赠科研通 3597544
什么是DOI,文献DOI怎么找? 1978676
邀请新用户注册赠送积分活动 1015574
科研通“疑难数据库(出版商)”最低求助积分说明 908789