DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines

支持向量机 超平面 人工智能 鉴定(生物学) 序列(生物学) 计算生物学 DNA 计算机科学 蛋白质测序 DNA测序 化学 数学 组合数学 生物 肽序列 植物 遗传学 基因
作者
Yiheng Zhu,Jun Hu,Xiaoning Song,Dong‐Jun Yu
出处
期刊:Journal of Chemical Information and Modeling [American Chemical Society]
卷期号:59 (6): 3057-3071 被引量:67
标识
DOI:10.1021/acs.jcim.8b00749
摘要

Accurate identification of protein–DNA binding sites is significant for both understanding protein function and drug design. Machine-learning-based methods have been extensively used for the prediction of protein–DNA binding sites. However, the data imbalance problem, in which the number of nonbinding residues (negative-class samples) is far larger than that of binding residues (positive-class samples), seriously restricts the performance improvements of machine-learning-based predictors. In this work, we designed a two-stage imbalanced learning algorithm, called ensembled hyperplane-distance-based support vector machines (E-HDSVM), to improve the prediction performance of protein–DNA binding sites. The first stage of E-HDSVM designs a new iterative sampling algorithm, called hyperplane-distance-based under-sampling (HD-US), to extract multiple subsets from the original imbalanced data set, each of which is used to train a support vector machine (SVM). Unlike traditional sampling algorithms, HD-US selects samples by calculating the distances between the samples and the separating hyperplane of the SVM. The second stage of E-HDSVM proposes an enhanced AdaBoost (EAdaBoost) algorithm to ensemble multiple trained SVMs. As an enhanced version of the original AdaBoost algorithm, EAdaBoost overcomes the overfitting problem. Stringent cross-validation and independent tests on benchmark data sets demonstrated the superiority of E-HDSVM over several popular imbalanced learning algorithms. Based on the proposed E-HDSVM algorithm, we further implemented a sequence-based protein–DNA binding site predictor, called DNAPred, which is freely available at http://csbio.njust.edu.cn/bioinf/dnapred/ for academic use. The computational experimental results showed that our predictor achieved an average overall accuracy of 91.7% and a Mathew's correlation coefficient of 0.395 on five benchmark data sets and outperformed several state-of-the-art sequence-based protein–DNA binding site predictors.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
最后一名完成签到,获得积分10
1秒前
1秒前
1秒前
1秒前
肖淑美完成签到 ,获得积分10
1秒前
paper发布了新的文献求助30
1秒前
神启完成签到 ,获得积分10
3秒前
卡卡龍特完成签到,获得积分10
3秒前
三岁发布了新的文献求助10
3秒前
SYLH应助Corioreos采纳,获得10
4秒前
黄黄完成签到,获得积分0
4秒前
瓶子完成签到,获得积分10
5秒前
99完成签到,获得积分10
5秒前
5秒前
果果完成签到,获得积分10
5秒前
5秒前
lotus完成签到,获得积分10
5秒前
搜集达人应助牛哥采纳,获得10
6秒前
jianglili完成签到,获得积分10
6秒前
七七发布了新的文献求助10
7秒前
9秒前
9秒前
9秒前
无糖零脂发布了新的文献求助10
10秒前
Superman完成签到 ,获得积分10
10秒前
种烟草的狗大户完成签到,获得积分10
10秒前
lotus发布了新的文献求助10
11秒前
an完成签到,获得积分10
11秒前
Megumi发布了新的文献求助10
12秒前
小樱没有魔法阵完成签到,获得积分10
12秒前
13秒前
汉堡包应助cccc采纳,获得10
14秒前
14秒前
郭翔完成签到,获得积分10
15秒前
慌慌完成签到,获得积分10
15秒前
123关闭了123文献求助
16秒前
16秒前
肉卷子完成签到,获得积分10
16秒前
16秒前
huangsi完成签到,获得积分10
16秒前
高分求助中
The world according to Garb 600
Разработка метода ускоренного контроля качества электрохромных устройств 500
Mass producing individuality 500
Chinesen in Europa – Europäer in China: Journalisten, Spione, Studenten 500
Arthur Ewert: A Life for the Comintern 500
China's Relations With Japan 1945-83: The Role of Liao Chengzhi // Kurt Werner Radtke 500
Two Years in Peking 1965-1966: Book 1: Living and Teaching in Mao's China // Reginald Hunt 500
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3820351
求助须知:如何正确求助?哪些是违规求助? 3363257
关于积分的说明 10422060
捐赠科研通 3081685
什么是DOI,文献DOI怎么找? 1695190
邀请新用户注册赠送积分活动 814957
科研通“疑难数据库(出版商)”最低求助积分说明 768692