计算机科学
Python(编程语言)
加速
集合(抽象数据类型)
特征(语言学)
人工智能
特征向量
任务(项目管理)
生物学数据
模式识别(心理学)
数据挖掘
机器学习
生物信息学
程序设计语言
生物
并行计算
哲学
经济
管理
语言学
作者
Sare Amerifar,Mahammad Norouzi,Mahmoud Ghandi
摘要
With the advances in sequencing technologies, a huge amount of biological data is extracted nowadays. Analyzing this amount of data is beyond the ability of human beings, creating a splendid opportunity for machine learning methods to grow. The methods, however, are practical only when the sequences are converted into feature vectors. Many tools target this task including iLearnPlus, a Python-based tool which supports a rich set of features. In this paper, we propose a holistic tool that extracts features from biological sequences (i.e. DNA, RNA and Protein). These features are the inputs to machine learning models that predict properties, structures or functions of the input sequences. Our tool not only supports all features in iLearnPlus but also 30 additional features which exist in the literature. Moreover, our tool is based on R language which makes an alternative for bioinformaticians to transform sequences into feature vectors. We have compared the conversion time of our tool with that of iLearnPlus: we transform the sequences much faster. We convert small nucleotides by a median of 2.8X faster, while we outperform iLearnPlus by a median of 6.3X for large sequences. Finally, in amino acids, our tool achieves a median speedup of 23.9X.
科研通智能强力驱动
Strongly Powered by AbleSci AI