离群值
计算机科学
插补(统计学)
数据预处理
缺少数据
数据挖掘
数据类型
预处理器
推论
机器学习
数据建模
人工智能
数据库
程序设计语言
作者
Hiroko Nagashima,Yuka Kato
出处
期刊:International Conference on Big Data
日期:2020-12-10
被引量:1
标识
DOI:10.1109/bigdata50022.2020.9377818
摘要
Recent years have seen an increase in the use of data acquired by sensors and wearable devices. However, depending on the type of sensor or wearable device, the data may be irregular with missing data, outliers, and different units of measurement. The use of these data as direct input into a machine-learning model would not produce the correct results. Therefore, analysts would be required to pre-process the data before data analysis to obtain accurate results. In particular, sensor data may contain more outliers and missing data because of network congestion and the limited life of sensor batteries than data acquired by other means. To efficiently perform such preprocessing, we previously proposed APREP-S (automatic preprocessing of sensor data) using Bayesian inference based on programming by example. APREP-S defines one model for each imputation method, as the workflow selects models based on the features of the imputation area. Therefore, this APREP-S model must be regenerated when data with a different periodicity are used. In other words, depending on whether the data are affected by the weekday or weekend, weather conditions, seasons, etc., the imputation model would have to be generated to consider these features. In this study, we enhanced the method for selecting the optimal imputation model in APREP-S, allowing multiple models to be defined for each input method. We evaluated APREP-S, which uses two types of data, by the mean squared error of these data: 1) human activity data as short-term periodic data, and 2) temperature and humidity data as long-term periodic data. As a result, we concluded that APREP-S is an efficient imputation method.
科研通智能强力驱动
Strongly Powered by AbleSci AI