性格(数学)
字体
计算机科学
人工智能
模式识别(心理学)
自然语言处理
数学
几何学
作者
Zhenjiang Li,Weilan Wang,Yiqun Wang,Qianxue Zhang
摘要
. A offline character dataset of Tibetan Historical document in Uchen font, THCU, is presented to facilitate the research of Tibetan Historical document recognition. The dataset THCU includes two subsets: THCU-M and THCU-S. The THCU-M is annotated manually in original document images, including 121214 character samples and 238 character categories. The subset THCU-S is a simulation dataset, and its samples are generated based on the idea of component combination. There are four subsets in THCU-S, in which the numbers of character category are 7238, 2908, 562 and 245 respectively, and the numbers of sample in each category are 5000, 3000, 600 and 600 respectively. We also evaluate THCU dataset using a CNN based model as a baseline performance. The experiment shows that the performance of the model on the real data is greatly improved by adding the generated samples.
科研通智能强力驱动
Strongly Powered by AbleSci AI