语义相似性
基因
计算生物学
计算机科学
生物
系统发育树
遗传学
相似性(几何)
功能(生物学)
序列(生物学)
语义学(计算机科学)
DNA测序
人工智能
序列分析
脚本语言
序列比对
号码簿
序列母题
鉴定(生物学)
基因组
基因预测
匹配(统计)
相似
数据挖掘
系统发育学
标识
DOI:10.24433/co.1588963.v1
摘要
## PhytoBabel The rich information encoded in cis-regulatory DNA sequences has not been fully exploited for gene function prediction in reverse genetics. Here we show that orthologous cis-regulatory sequences that diverged approximately 160 million years ago share little sequence similarity, yet remarkably retain semantic similarity that can be effectively captured by a deep learning model, PhytoBabel. Although trained solely on orthologous cis-regulatory sequence pairs from 15 angiosperms, PhytoBabel implicitly learned spatio-temporal gene expression patterns, conserved non-coding sequences, semantically similar fragments, and phylogenetic relationships among species. Furthermore, PhytoBabel enables the discovery of evolutionarily unrelated but semantically similar cis-regulatory sequences, facilitating the identification of novel genes with functions of interest. As a proof-of-concept, we identified in maize new somatic embryogenesis-related morphogenic regulators exhibiting semantic similarity to known Arabidopsis morphogenic regulators. By bridging the gap in the cis-regulatory sequence → semantics → gene function information chain, PhytoBabel provides a valuable tool for gene function prediction in reverse genetics. ### Semantic similarity of cis-regulatory DNA sequences prediction `python model_predict.py -r ath_ref_gene.csv -q zma_query_gene.csv -m PhytoBabel_model -s pred_out.csv` ### Parameters -r : Cis-regulatory sequences of reference gene file -q : Cis-regulatory sequences of query gene file -m : The directory containing all models for prediction -s : Semantic similarity prediction results file path -g : Specify the gpu usage ### The scripts used for article analysis can all be found in ‘code used in the manuscript’
科研通智能强力驱动
Strongly Powered by AbleSci AI