Text Complexity of Chinese Elementary School Textbooks: Analysis of Text Linguistic Features Using Machine Learning Algorithms

计算机科学 人工智能 词汇多样性 自然语言处理 语言序列复杂性 语言学 凝聚力(化学) 判决 计算语言学 词汇 哲学 有机化学 化学
作者
Miaomiao Liu,Yixun Li,Yongqiang Su,Hong Li
出处
期刊:Scientific Studies of Reading [Taylor & Francis]
卷期号:28 (3): 235-255 被引量:1
标识
DOI:10.1080/10888438.2023.2244620
摘要

ABSTRACTPurpose This study sought to 1) identify linguistic features important for Chinese text complexity with a theory-based and systematic approach, and 2) address how feature sets and algorithms affect the performance of Chinese text complexity models.Method Texts from Chinese language arts textbooks from Grades 1 to 6 (N = 1,478) in Mainland China were analyzed. The predictor variables were 265 linguistic features of texts: 154 lexical features and 111 sentence and discourse features. The outcome variable was the complexity level of texts; a one-semester-scale was applied, thus 12 levels in total (two semesters per grade).Results Features of the categories of character and word frequency, character and word semantic features, lexical diversity, part-of-speech syntactic categories, and referential cohesion were found the most important. With the important features identified, we found that text complexity models with features at all levels outperformed those with features at only one level. Models using the two machine learning algorithms (Random Forest Regression and Support Vector Regression) outperformed those using Linear Regression.Conclusion This work clarifies important linguistic features for Chinese text complexity, and points to the necessity of considering features across levels and using machine learning algorithms in future text complexity research. Acknowledgments We thank Hailey Gibbs at the University of Maryland, College Park, for her kind help with proofreading.Disclosure statementNo potential conflict of interest was reported by the author(s).Notes1. There are two scripts in the modern Chinese language, the Traditional Chinese script used in Hong Kong, Taiwan, and Macau, and the Simplified Chinese script mainly used in Mainland China. Although visually distinct, the two scripts carry the characteristics of the Chinese writing system in the same manner. Thus, we found it feasible to consider findings from both scripts in the context of text complexity research.2. The regression models were used in our study under the consideration that the complexity levels of texts increase continuously throughout elementary school, without a clear boundary between two adjacent semester levels as claimed in Phani et al. (Citation2019).3. We acknowledge that using absolute accuracy to evaluate regression models may not be appropriate (François & Miltsakaki, Citation2012), and we decided to include absolute accuracy here only to compare our results with previous Chinese text complexity studies, some of which merely reported absolute accuracy of their models (Sung et al., Citation2016; Tseng et al., Citation2019; Wu et al., Citation2020). We used a rounding method to convert continuous estimated values to categorical levels, e.g., an estimated value between 3.5 and 4.4 was considered a complexity level of 4 following previous practice (François & Miltsakaki, Citation2012).4. We employed a 5-fold cross-validation, and thus there were five data points for each evaluation indices (e.g., R2) under each of the nine conditions (in the combination of three feature sets and three algorithms).5. Both of our models would have achieved an absolute accuracy of .76 if we had used a two-grade-level scale like the existing models (.59–.64, Wu et al., Citation2020). Our models would have achieved the absolute accuracy of .49 (RFR) and .51 (SVR) if we have used a one-grade-level scale as existing models (.44–.72, Sung et al., Citation2016; .49–.76, Tseng et al., Citation2019).Additional informationFundingThis research was supported by grants from the Ministry of Education of the People's Republic of China [17YJA190009] to Hong Li. The writing of this paper was partially supported by a Seed Funding Grant at The Education University of Hong Kong [RG 37/2021-2022 R] to Yixun Li.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
PDF的下载单位、IP信息已删除 (2025-6-4)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
刚刚
LmY大帅比完成签到,获得积分10
1秒前
辛勤的芾发布了新的文献求助10
2秒前
所所应助宋祥廷采纳,获得10
2秒前
Orange应助锂电阳离子无序采纳,获得10
3秒前
今后应助WAM采纳,获得30
4秒前
斯文败类应助Luo采纳,获得10
4秒前
口子口戈发布了新的文献求助10
5秒前
眼睛大的惜萱完成签到,获得积分10
5秒前
6秒前
7秒前
汉堡包应助天津科技大学采纳,获得10
8秒前
量子星尘发布了新的文献求助10
9秒前
野生英子完成签到,获得积分20
10秒前
十四季白发布了新的文献求助10
10秒前
哈基咪发布了新的文献求助10
10秒前
大力出奇迹完成签到,获得积分10
11秒前
12秒前
清脆天空发布了新的文献求助10
13秒前
昭奚发布了新的文献求助10
13秒前
14秒前
漾漾发布了新的文献求助10
16秒前
完美的一天完成签到,获得积分10
17秒前
17秒前
核桃应助乔乔采纳,获得10
17秒前
今后应助小土豆采纳,获得10
18秒前
18秒前
20秒前
英俊的铭应助Lucille采纳,获得10
20秒前
珊珊4532完成签到 ,获得积分10
21秒前
22秒前
23秒前
23秒前
小超人发布了新的文献求助10
24秒前
kumiko完成签到 ,获得积分10
25秒前
科研小白发布了新的文献求助10
25秒前
搬砖人完成签到,获得积分10
25秒前
开朗立世发布了新的文献求助10
26秒前
hjjjjj1发布了新的文献求助10
26秒前
上官若男应助科研通管家采纳,获得10
27秒前
高分求助中
The Oxford Encyclopedia of the History of Modern Psychology 2000
Chinesen in Europa – Europäer in China: Journalisten, Spione, Studenten 1200
Deutsche in China 1920-1950 1200
Astrochemistry 1000
Applied Survey Data Analysis (第三版, 2025) 850
Mineral Deposits of Africa (1907-2023): Foundation for Future Exploration 800
Electron microscopy study of magnesium hydride (MgH2) for Hydrogen Storage 800
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3874939
求助须知:如何正确求助?哪些是违规求助? 3417384
关于积分的说明 10703287
捐赠科研通 3141758
什么是DOI,文献DOI怎么找? 1733530
邀请新用户注册赠送积分活动 836086
科研通“疑难数据库(出版商)”最低求助积分说明 782355