非结构化数据
计算机科学
可扩展性
情报检索
数据科学
数据挖掘
大数据
自然语言处理
数据库
作者
Lin William Cong,Tengyuan Liang,Xiao Zhang
出处
期刊:Social Science Research Network
[Social Science Electronic Publishing]
日期:2018-01-01
被引量:48
摘要
We introduce a general framework for analyzing large-scale text-based data, combining the strengths of neural-network language processing and generative statistical modeling. Our methodology generates textual factors by (i) representing texts using vector word embedding, (ii) clustering words using locality-sensitive hashing, and (iii) identifying spanning vector clusters through topic modeling. Our data-driven approach captures complex linguistic structures while ensuring computational scalability and economic interpretability. We also discuss applications of textual factors in (i) prediction and inference, (ii) interpreting (non-text-based) models and variables, and (iii) constructing new text-based metrics and explanatory variables, with illustrations using topics in finance and economics such as macroeconomic forecasting and factor asset pricing.
科研通智能强力驱动
Strongly Powered by AbleSci AI