當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

七种常见的核酸序列蛋白编码能力预测工具 | ncRNAs | lncRNA

發(fā)布時間：2023/12/18 编程问答 46 豆豆

生活随笔收集整理的這篇文章主要介紹了七种常见的核酸序列蛋白编码能力预测工具 | ncRNAs | lncRNA 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

注：這些工具的應(yīng)用都是受限的，有些本來就是只能用于預(yù)測動物，在使用之前務(wù)必用ground truth數(shù)據(jù)來測試一些。我想預(yù)測某一個植物的轉(zhuǎn)錄本，所以可以拿已經(jīng)注釋得比較好的擬南芥來測試一下。（測試的結(jié)果還是比較驚人的）

CPC

（熟悉的名字，原來是北京大學(xué)的高歌、魏麗萍開發(fā)的）

搜文章時才發(fā)現(xiàn)2017年已經(jīng)出了CPC2了

CPC可在線使用
a Support Vector Machine-based classifier, named Coding Potential Calculator (CPC), to assess the protein-coding potential of a transcript based on six biologically meaningful sequence features.
Coding Potential Calculator distinguish protein-coding from non-coding RNAs based on the sequence features of the input transcripts. Our preliminary performance assessment suggests the CPC can reliably discriminate the coding and non-coding transcripts in ~98% accuracy. We provide an online version of CPC here.
自稱有98%的準(zhǔn)確率

bin/run_predict.sh (input_seq) (result_in_table) (working_dir) (result_evidence)

CPC RESULTS （The first column is input sequence ID; the second column is input sequence length; the third column is coding status and the four column is the coding potential score (the "distance" to the SVM classification hyper-plane in the features space).）

AF282387 528 coding 3.32462 Tsix_mus 4300 noncoding -1.30047

HOMO EVIDENCE
ORF EVIDENCE

AF282387 ORF_FRAMEFINDER 4 529 99.43 109.41 Full Tsix_mus ORF_FRAMEFINDER 4077 4206 3.00 27.50 Full

FRAME FINDER

>AF282387 Filobasidiella neoformans calcineurin B regulatory subunit (CNB1) mRNA, complete cds [framefinder (3,528) score=109.41 used=99.43% {forward,strict} ] MGAAESSMFNSLEKNSNFSGPELMRLKKRFMKLDKDGSGSIDKDEFLQIPQIANNPLAHR MIAIFDEDGSGTVDFQEFVGGLSAFSSKGGRDEKLRFAFKVYDMDRDGYISNGELYLVLK QMVGNNLKDQQLQQIVDKTIMEADKDGDGKLSFEEFTQMVASTDIVKQMTLEDLF >Tsix_mus NR_002844.1 Mus musculus X (inactive)-specific transcript, antisense (Tsix) on chromosome X [framefinder (4076,4205) score=27.50 used=3.00% {forward,strict} ] MKGYVLKLSSWAGEIAQWLGVLTALPEGLSSILNNFVVAHSHL

BLAST RESULT

CPC2

CPC2 runs ～1000 times faster than CPC1 and exhibits superior accuracy compared with CPC1, especially for long non-coding transcripts. Moreover, the model of CPC2 is species-neutral, making it feasible for ever-growing non-model organism transcriptomes.

個人測試，CPC1不用blast還是比較快的，但是blast起來真的是奇慢無比，它后臺居然還在調(diào)用blastall這種古老的軟件，現(xiàn)在我們連blast都嫌慢，都只用diamond了。

CPC2用python改寫了，還是在調(diào)用libvm來進(jìn)行分類。

CPC的大致原理：

1. 特征選擇，Feature Selection。four intrinsic features as Fickett TESTCODE score, open reading frame (ORF) length, ORF integrity and isoelectric point (pI).

2. 使用svm構(gòu)建分類模型，trained a support vector machine (SVM) model

3. 使用多個物種的數(shù)據(jù)來驗(yàn)證模型的性能。評價指標(biāo)：sensitivity, specificity and accuracy

這么簡單的方法，是不是瞬間有種我也能發(fā)NAR的錯覺~~?

PLEK

(predictor of?long non-coding RNAs and messenger RNAs based on an improved?k-mer scheme)

an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes.

貌似沒有website，也沒有GitHub，程序放在了sourceforge.

基本原理：

核心：kmer和svm

It is suitable for vertebrates lacking high-quality genome sequences and annotation information and is especially effective for the?de novo?assembled transcriptome data generated by PacBio or 454 sequencing platforms.

A?k-mer pattern is a specific string with?k?nucleotides, each can be?A,?C,?G?or?T. For?k = 1 to 5, we had 4 + 16 + 64 + 256 + 1024 = 1,364 patterns: 4?one-mer patterns, 16?two-mer patterns, 64?three-mer patterns, 256?four-mer patterns, and 1,024?five-mer patterns.

選了5種kmer

非常常規(guī)的特征選擇，最后還是調(diào)用libsvm，發(fā)了BMCBioinformatics?？戳酥笫遣皇亲约阂蚕氚l(fā)一篇。

CNCI

Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts

特征選擇

To distinguish protein-coding sequences from the non-coding sequences, we extracted five features, i.e. the length and S-score of MLCDS, length-percentage, score-distance and codon-bias. The length and S-score of MLCDS were used as the first two features, which assess the extent and quality of the MLCDS, respectively. Moreover, as demonstrated earlier in the text, protein-coding transcripts possess a special reading frame obviously distinct from the other five in the distribution of ANT. We analyzed six MLCDS candidates outputted by dynamic programming of the six reading frames for each transcript, with the assumption that there must exist one best MLCDS (as described earlier in the text); however, this phenomenon does not generally exist for non-coding transcripts. Thus, we defined other two features, length-percentage and score-distance, as follows:?

測試結(jié)果：cnci不能直接處理fasta序列，輸入fasta出來的結(jié)果為空。于是我就輸入gtf和基因組2bit文件，才能出來有效的結(jié)果。

CPAT

CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model

使用說明文檔：http://rna-cpat.sourceforge.net/

特征選擇：

The first feature was the maximum length of the open reading frame (ORF).

The second feature was ORF coverage defined as the ratio of ORF to transcript lengths.?

The third feature we used was the Fickett TESTCODE score (termed ‘Fickett score’ hereafter), which is a simple linguistic feature that distinguishes protein-coding RNA and ncRNA according to the combinational effect of nucleotide composition and codon usage bias (22).?

The fourth feature we used was hexamer usage bias (termed ‘hexamer score’ hereafter). This may be the most discriminating feature because of the dependence between adjacent amino acids in proteins (23).??

We build a logistic regression model using these four linguistic features as predictor variables. A χ2?test was used to evaluate whether our logit model with predictors fits the training data significantly better than the null model, which had only an intercept.

FEELnc

FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome

OrfPredictor

OrfPredictor: predicting protein-coding regions in EST-derived sequences

PhyloCSF

PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions

lncRNA的編碼性預(yù)測——PhyloCSF的使用

后面會一一測試。

待續(xù)~~~

轉(zhuǎn)載于:https://www.cnblogs.com/leezx/p/8594138.html

總結(jié)

以上是生活随笔為你收集整理的七种常见的核酸序列蛋白编码能力预测工具 | ncRNAs | lncRNA的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Java 单测回滚
下一篇： SNF平台从sql server兼容or