當前位置：首頁 > 编程资源 > 综合教程 >内容正文

综合教程

序列比对之Biostrings包

發布時間：2024/1/3 综合教程 48 生活家

生活随笔收集整理的這篇文章主要介紹了序列比对之Biostrings包小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

基本概念

Biostrings包很重要的3個功能是進行Pairwise sequence alignment 和Multiple sequence alignment及 Pattern finding in a sequence

序列比對一般有2個過程：

1）構建計分矩陣公式（the scoring matrix formulation)

2）比對(alignment itself)

global alignment methods （全局比對）：align every residue in the sequences ，例如Needleman-Wunsch algorithm.

local alignment technique（局部比對）：align regions of high similarity in the sequences，例如Smith-Waterman algorithm

安裝

if("Biostrings" %in% rownames(installed.packages()) == FALSE) {source("http://bioconductor.org/biocLite.R");biocLite("Biostrings")}
suppressMessages(library(Biostrings))
ls('package:Biostrings')

----------------Pairwise sequence alignment---------------

步驟：首先構建罰分規則，然后按照規則進行比對。用pairwiseAlignment（）函數

舉例1：核酸序列

(myScoringMat <- nucleotideSubstitutionMatrix(match = 1, mismatch = -1, baseOnly = TRUE))#構建罰分規則
gapOpen <- 2             #gap分為2
gapExtend <- 1           #延伸gap分為1

sequence1 <- "GAATTCGGCTA"  #序列1
sequence2 <- "GATTACCTA"    #序列2
myAlignment <- pairwiseAlignment(sequence1, sequence2, 
                                 substitutionMatrix = myScoringMat, gapOpening = gapOpen,
                                 gapExtension = gapExtend, type="global", scoreOnly = FALSE)  #進行比對
myAlignment

舉例2：對蛋白序列進行比對

蛋白比對會更復雜，因此模型更多，

data(package="Biostrings")  #查看所有數據集
data(BLOSUM62)               #這里選擇BLOSUM62數據
subMat <- "BLOSUM62"         #賦值
gapOpen <- 2
gapExtend <- 1
sequence1 <- "PAWHEAE"
sequence2 <- "HEAGAWGHE"

myAlignProt <- pairwiseAlignment(sequence1, sequence2, 
                                 substitutionMatrix = subMat, gapOpening = gapOpen, gapExtension = 
                                   gapExtend, type="global", scoreOnly = FALSE)  #全局比對

myAlignProt2 <- pairwiseAlignment(sequence1, sequence2, 
                                 substitutionMatrix = subMat, gapOpening = gapOpen, gapExtension = 
                                   gapExtend, type="local",scoreOnly = FALSE)  ##局部比對

可以看到局部比對返回的是，高度相似的序列部分.

3)可視化，對于序列可以用最經典的對角線來可視化（以人和黑猩猩的hemoglobin beta為例）

library(seqinr) # 為了讀取fasta序列
myseq <- read.fasta(file = "F:/R/Bioconductor/biostrings/prtein_example_seq.fas")
dotPlot(myseq[[1]], myseq[[2]], col=c("white", "red"), xlab="Human", ylab="Chimpanzee")

##########Multiple sequence alignment############

一般多序列比對可以用于進化分析

install.packages("muscle")  #需要安裝該包,因為該包在我的版本上沒法安裝，所以這里就不講了
library(muscle)

######Phylogenetic analysis and tree plotting########

這里先不做分析

#########blast格式的解析######

install.packages("RFLPtools",dependencies=TRUE)
library(RFLPtools)
data(BLASTdata)    #先查看數據集了解一下相關數據格式情況
head(BLASTdata)
colnames(BLASTdata)

DIR <- system.file("extdata", package = "RFLPtools") #用自帶數據集
MyFile <- file.path(DIR, "BLASTexample.txt")
MyBLAST <- read.blast(file = MyFile)
mySimMat <- simMatrix(MyBLAST)   #可以根據blast結果用來生成相似性矩陣，太厲害了

#########Pattern finding in a sequence######

library(Biostrings)
mynucleotide <- DNAString("aacataatgcagtagaacccatgagccc")
matchPattern(DNAString("ATG"), mynucleotide)  #示例1
matchPattern("TAA", mynucleotide)       #示例2
##以下函數可以用來尋找orf（需要修改）
myCodonFinder <- function(sequence){
  startCodon = DNAString("ATG") # 指定起始密碼子
  stopCodons = list("TAA", "TAG", "TGA") # 指定終止密碼子
  codonPosition = list() #initialize the output to be returned as a list
  codonPosition$Start = matchPattern(startCodon, sequence) # search start codons
  x=list()
  for(i in 1:3){ # iterate over all stop codons
    x[[i]]= matchPattern(DNAString(stopCodons[[i]]), sequence)
    codonPosition$Stop=x
  }
  return(codonPosition) # returns results
}
StartStops <- myCodonFinder(mynucleotide)

總結

以上是生活随笔為你收集整理的序列比对之Biostrings包的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：负载均衡篇不同层次的负载均衡（2/3/
下一篇： WKE——Webkit精简的纯C接口的浏