airway之workflow
1)airway簡介
在該workflow中,所用的數(shù)據(jù)集來自RNA-seq,氣道平滑肌細(xì)胞(airway smooth muscle cells )用氟美松(糖皮質(zhì)激素,抗炎藥)處理。例如,哮喘患者使用糖皮質(zhì)激素來減少呼吸道炎癥,在該實驗設(shè)計中,4種細(xì)胞類型(airway smooth muscle cell lines )用1微米地塞米松處理18個小時。每一個cell lines都有對照與處理組(a treated and an untreated sample).關(guān)于實驗設(shè)計的更多信息可以查看PubMed entry 24926665 ,原始數(shù)據(jù) GEO entry GSE52778.
2)Preparing count matrices(準(zhǔn)備制作表達矩陣)
這個矩陣必須是原始的表達量,不能是任何矯正過的表達量(RPKM等)。因為DESeq2的統(tǒng)計模型在處理原始表達矩陣會展示出強大實力,且會自動處理library size differences。
2.1 Aligning reads to a reference genome
for f in 'cat files'
do
STAR --genomeDir ../STAR/ENSEMBL.homo_sapiens.release-75 --readFilesIn fastq/$f\_1.fastq fastq/$f\_2.fastq --runThreadN 12 --outFileNamePrefix aligned/$f.
done #比對
for f in 'cat files'
do
samtools view -bS aligned/$f.Aligned.out.sam -o aligned/$f.bam; done #轉(zhuǎn)bam文件
done
因為按照原文獻跑流程,則數(shù)據(jù)量太大,耗時間,因此在airway中只用了8個樣本的一些子集,來減少運行時間及內(nèi)存。
library("airway")
dir <- system.file("extdata", package="airway", mustWork=TRUE) #查看數(shù)據(jù)集所在路徑
list.files(dir)
csvfile <- file.path(dir,"sample_table.csv") #文件所在絕對路徑
(sampleTable <- read.csv(csvfile,row.names=1)) #讀取表格
filenames <- file.path(dir, paste0(sampleTable$Run, "_subset.bam"))
file.exists(filenames) #驗證這8個文件在不
library("Rsamtools")
bamfiles <- BamFileList(filenames, yieldSize=2000000) #提取2000000序列,降低時間和消耗,文件1:用于構(gòu)建assay信息
seqinfo(bamfiles[1]) #查看信息
sampleTable信息:
2.2 Defining gene models( 最主要的一步是構(gòu)建TxDb 對象)
Next, we need to read in the gene model that will be used for counting reads/fragments.Here,we want to make a list of exons grouped by gene for counting read/fragments.
如果是自己的數(shù)據(jù),首先下載gtf/gff3文件,并用GenomicFeatures 包中的makeTxDbFromGFF()函數(shù)構(gòu)建TxDb object。(TxDb對象是一個包含frange-based objects, such as exons, transcripts, and genes)
library("GenomicFeatures")
gtffile <- file.path(dir,"Homo_sapiens.GRCh37.75_subset.gtf") #airway測試自帶的gtf文件,自己的數(shù)據(jù)需要下載gtf文件
(txdb <- makeTxDbFromGFF(gtffile, format="gtf", circ_seqs=character())) #讀取gtf文件構(gòu)建TxDb object對象
(ebg <- exonsBy(txdb, by="gene")) #構(gòu)建外顯子groped by gene ,文件2:rowRanges()信息
2.3 Read counting step
上述準(zhǔn)備工作做完之后,統(tǒng)計表達量非常簡單。 利用GenomicAlignments包中的summarizeOverlaps()函數(shù),返回一個包含多種信息的SummarizedExperiment 對象。同時由于數(shù)據(jù)量大,對于一個包含30 million aligned reads的單個文件,summarizeOverlaps()函數(shù)至少需要30 minutes。為了節(jié)省時間可以利用BiocParallel包中的register()函數(shù)進行多核運行。
register(SerialParam()) #設(shè)置多核運行
se <- summarizeOverlaps(features=ebg, reads=bamfiles, #創(chuàng)建SummarizedExperiment object with counts,上面文件1和2
mode="Union", # describes what kind of read overlaps will be counted.
singleEnd=FALSE, ######表示是pair-end
ignore.strand=TRUE, ##是否是鏈特異性 strand-specific
fragments=TRUE ) ##specify if unpaired reads should be counted
save(se,file='count_matrix.Rdata')
2.4 SummarizedExperiment
上述產(chǎn)生的SummarizedExperiment 對象,由3部分組成:粉色框是表達矩陣(matrix of counts,)儲存在assay中,如果是多個矩陣可以儲存在assays當(dāng)中。淺藍色框是rowRanges,包含genomic ranges的信息,及gtf中的gene model信息。綠色框包含樣本信息(information about the samples)。
dim(se) ## [1] 20 8 assayNames(se) ## "counts" head(assay(se), 3) #查看表達矩陣 colSums(assay(se)) rowRanges(se) #rowRanges信息 str(metadata(rowRanges(se))) #metadata是構(gòu)建基因模型的gtf文件信息 colData(se) #查看sample信息,因為還沒有制作所以目前是空的 (colData(se) <- DataFrame(sampleTable)) #將樣本實驗設(shè)計信息傳遞給colData(se),文件3:colData信息。涉及到實驗設(shè)計
-------------------------------------------------------------------------------------------------華麗麗的分割線----------------------------------------------------------------------
At this point, we have counted the fragments which overlap the genes in the gene model we specified。這里統(tǒng)計完之后,我們可以用一系列的包,進行差異分析,包括(limma,edgeR,DESeq2等)。下面我們將用DESeq2。在DESeq2中最主要的是構(gòu)建DESeqDataSet 對象,可以用上述 SummarizedExperiment objects轉(zhuǎn)化為 DESeqDataSet objects。兩者之間的區(qū)別是:
1)SummarizedExperiment 對象中的assay slot 被用 DESeq2中的counts accessor function代替, 且 DESeqDataSet class 強制表達矩陣中的values值為非負(fù)整數(shù)。
2)DESeqDataSet 包含design formula(設(shè)計公式)。實驗設(shè)計在最初的分析中被指定,將 告訴DESeq2在分析中如何處理samples。
A second difference is that the DESeqDataSet has an associated design formula. The experimental design is specified at the beginning of the analysis, as it will inform many of the DESeq2 functions how to treat the samples in the analysis (one exception is the size factor estimation, i.e., the adjustment for differing library sizes, which does not depend on
the design formula). The design formula tells which columns in the sample information table (colData) specify the experimental design and how these factors should be used in the analysis
最簡單的實驗設(shè)計是 ~ condition,這里condition是 column in colData(dds),指定樣本所在的分組。在airway experiment,這里我們指定實驗設(shè)計為~ cell + dex表示希望檢測對于不同cell line(cell). 進行dexamethasonecontrolling (dex,氟美松處理) 的效果。
----------------------------------------------------------------------------------------------------分割完畢-----------------------------------------------------------------------------
3、The DESeqDataSet, sample information, and the design formula
有兩種方法構(gòu)建DESeqDataSet 對象:
3.1 Starting from SummarizedExperiment( 直接將SummarizedExperiment轉(zhuǎn)化為DESeqDataSet )
以airway自帶數(shù)據(jù)為例:
data("airway")
se <- airway
se$dex <- relevel(se$dex, "untrt") #specify that untrt is the reference level for the dex variable
se$dex
round( colSums(assay(se)) / 1e6, 1 ) #count that uniquely aligned to the genes 。round表示四舍五入,1表示小數(shù)點后一位
colData(se) #因為airway數(shù)據(jù)集制作好了這個矩陣,自己的數(shù)據(jù)需要用前面的方法制作這個。
library("DESeq2")
dds <- DESeqDataSet(se, design = ~ cell + dex) #構(gòu)建DESeqDataSet對象
3.2 Starting from count matrices( 如果你之前沒有準(zhǔn)備SummarizedExperiment, 只有count matrix 和 sample information ,利用這兩個文件及DESeqDataSetFromMatrix()函數(shù)構(gòu)建 DESeqDataSet對象 )
countdata <- assay(se) ##這里從前面的assay中取出表達矩陣用于演示,countdata: a table with the fragment counts
head(countdata, 3)
coldata <- colData(se) ##這里從前面的assay中取出樣本信息用于演示,coldata: a table with information about the samples
(ddsMat <- DESeqDataSetFromMatrix(countData = countdata,
colData = coldata,
design = ~ cell + dex) #從countdata表達矩陣和coldata樣本信息構(gòu)建DESeqDataSet對象
4)Exploratory analysis and visualization(數(shù)據(jù)探索分析及可視化)
4.1 Pre-filtering the dataset
nrow(dds) #64102 dds <- dds[ rowSums(counts(dds)) > 1, ]#remove rows of the DESeqDataSet that have no counts, or only a single count across all samples: nrow(dds) #29391
4.2 The rlog transformation
需要rlog原因:
When the expected amount of variance is approximately the same across different mean values, the data is said to be homoskedastic(同方差). For RRNA-seq raw counts, however, the variance grows with the mean. For example, if one performs PCA directly on a matrix of size-factor-normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples. A simple and often used strategy to avoid this is to take the logarithm of the normalized count values plus a small pseudocount; however, now the genes with the very lowest counts will tend to dominate the results because, due to the strong Poisson noise inherent to small count values, and the fact that the logarithm amplifies differences for the smallest values, these low count genes will show the strongest relative differences between samples.As a solution, DESeq2 offers transformations for count data that stabilize the variance across the mean. One such transformation is the regularized-logarithm transformation or rlog2. For genes with high counts, the rlog transformation will give similar result to the ordinary log2 transformation of normalized counts. For genes with lower counts, however, the values are shrunken towards the genes’ averages across all samples. Using an empirical Bayesian prior on inter-sample differences in the form of a ridge penalty, the rlog-transformed data then becomes approximately homoskedastic, and can be used directly for computing distances between samples and making PCA plots. Another transformation, the variance stabilizing transformation17, is discussed alongside the rlog in the DESeq2 vignette.
rlog()函數(shù)返回一個SummarizedExperiment對象,其中 assay slot包含 rlog-轉(zhuǎn)換后的值。
rld <- rlog(dds, blind=FALSE) head(assay(rld), 3) par( mfrow = c( 1, 2 ) ) dds <- estimateSizeFactors(dds) #估計size factors to account for sequencing depth plot(log2(counts(dds, normalized=TRUE)[,1:2] + 1),pch=16, cex=0.3) #then specify normalized=TRUE plot(assay(rld)[,1:2],pch=16, cex=0.3) #Sequencing depth correction is done automatically for the rlog method (and for varianceStabilizingTransformation).
普通的log2(左)及rlog2(右)之間的比較
4.3 Sample distances
評估樣本間關(guān)系(即總體相似度),哪些樣本更相近,哪些更遠,是否符合實驗設(shè)計?
方法一:用歐式距離(Euclidean distance ),用rlog-transformed data(即normalized之后的矩陣)
sampleDists <- dist( t( assay(rld) ) ) #歐氏距離,為確保所有基因大致相同的contribution用rlog-transformed data sampleDists
樣本間關(guān)系,歐式距離。
library("pheatmap") ##用熱圖展示樣品間關(guān)系
library("RColorBrewer")
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep="-" )
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)
方法二:用PoiClaClu 包計算泊松距離(Poisson Distance),必須是原始表達矩陣。The PoissonDistance function takes the original count matrix
(not normalized) with samples as rows instead of columns, so we need to transpose the counts in dds.
library("PoiClaClu")
poisd <- PoissonDistance(t(counts(dds))) ##對數(shù)據(jù)轉(zhuǎn)置
samplePoisDistMatrix <- as.matrix( poisd$dd ) #計算樣品間關(guān)系
rownames(samplePoisDistMatrix) <- paste( rld$dex, rld$cell, sep="-" )
colnames(samplePoisDistMatrix) <- NULL
pheatmap(samplePoisDistMatrix,
clustering_distance_rows=poisd$dd,
clustering_distance_cols=poisd$dd,
col=colors)
4.4 PCA plot(另一個展示樣品間關(guān)系的圖形)
一種方法是用DESeq2包中的plotPCA()函數(shù)
plotPCA(rld, intgroup = c("dex", "cell")) #DESeq2包plotPCA()函數(shù)
另一種方法是用ggplot2.需要調(diào)用 plotPCA()返回用于畫圖的數(shù)據(jù)
(data <- plotPCA(rld, intgroup = c( "dex", "cell"), returnData=TRUE)) #調(diào)用plotPCA生成要畫圖的data
percentVar <- round(100 * attr(data, "percentVar")) #PC1,pC2
library("ggplot2") #畫圖
ggplot(data, aes(PC1, PC2, color=dex, shape=cell)) + geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance"))
data:
4.4 MDS plot(另一種PCA圖,multidimensional scaling (MDS) function )
在沒有表達矩陣值,但只有一個距離矩陣的情況下使用。This is useful when we don’t have a matrix of data, but only a matrix of distances. Here we compute the MDS for the distances calculated from the rlog transformed counts and plot these。
##用歸一化后的數(shù)據(jù)產(chǎn)生的距離矩陣(calculated from the rlog transformed) mdsData <- data.frame(cmdscale(sampleDistMatrix)) mds <- cbind(mdsData, as.data.frame(colData(rld))) ggplot(mds, aes(X1,X2,color=dex,shape=cell)) + geom_point(size=3) ###用原始數(shù)據(jù)產(chǎn)生的泊松距離矩陣(calculated from the raw) mdsPoisData <- data.frame(cmdscale(samplePoisDistMatrix)) mdsPois <- cbind(mdsPoisData, as.data.frame(colData(dds))) ggplot(mdsPois, aes(X1,X2,color=dex,shape=cell)) + geom_point(size=3)
5)Differential expression analysis(差異分析)
5,1)Running the differential expression pipeline
用DESeq()函數(shù)對原始表達矩陣進行差異表達分析(run the differential expression pipeline on the raw counts with a single call to the function DESeq)
dds <- DESeq(dds) #差異表達分析
5.2)Building the results table
利用result函數(shù)(),將返回一個數(shù)據(jù)框?qū)ο?包含有 metadata information(每一列的意義)。
(res <- results(dds)) #產(chǎn)生log2 fold change 及p值 mcols(res, use.names=TRUE) #查看meta信息 summary(res) #產(chǎn)看其他信息
每列的含義:
baseMean:is a just the average of the normalized count values, dividing by size factors, taken over all samples in the DESeqDataSet
log2FoldChange: the effect size estimate. It tells us how much the gene’s expression seems to have changed due to treatment with dexamethasone in comparison to untreated samples
lfcSE:the standard error estimate for the log2 fold change estimate,(the effect size estimate has an uncertainty associated with it,)
p value:statistical test , the result of this test is reported as a p value. Remember that a p value indicates the probability that a fold change as strong as the observed one, or even stronger, would be seen under the situation described by the null hypothesis.
padj:adjusted p-value
2種更嚴(yán)格的方法去篩選顯著差異基因:
降低false discovery rate threshold (the threshold on padj in the results table)
提升log2 fold change threshold(from 0 using the lfc Threshold argument of results)
#如果要降低false discovery rate threshold,將要設(shè)置的值傳遞給results(),使用該閾值進行獨立的過濾: res.05 <- results(dds, alpha=.05) #設(shè)置閾值FDR=0.05 table(res.05$padj < .05) #產(chǎn)看結(jié)果 #如果提升log2 fold change threshold show more substantial changes due to treatment, we simply supply a value on the log2 scale. resLFC1 <- results(dds, lfcThreshold=1) table(resLFC1$padj < 0.1)
Sometimes a subset of the p values in res will be NA (“not available”). This is DESeq’s way of reporting that all counts for this gene were zero, and hence no test was applied. In addition, p values can be assigned NA if the gene was excluded from analysis because it contained an extreme count outlier. For more information, see the outlier detection section of the DESeq2 vignette.
5.2)Other comparisons
可以利用result函數(shù)中的contrast參數(shù)來選擇任何2組數(shù)據(jù)進行比較
results(dds, contrast=c("cell", "N061011", "N61311"))
5.3)Multiple testing(詳細(xì)看文檔理解FDR)
sum(res$pvalue < 0.05, na.rm=TRUE) sum(!is.na(res$pvalue)) sum(res$padj < 0.1, na.rm=TRUE) resSig <- subset(res, padj < 0.1) head(resSig[ order(resSig$log2FoldChange), ]) head(resSig[ order(resSig$log2FoldChange, decreasing=TRUE), ])
5.4)Plotting results
用 plotCounts()函數(shù)查看特定基因的表達量:需要三個參數(shù),DESeqDataSet, gene name, and the group over which to plot the counts (Figure 8).
topGene <- rownames(res)[which.min(res$padj)]
plotCounts(dds, gene=topGene, intgroup=c("dex"))
也可以用ggplot2畫傳統(tǒng)的散點圖
data <- plotCounts(dds, gene=topGene, intgroup=c("dex","cell"), returnData=TRUE)
ggplot(data, aes(x=dex, y=count, color=cell)) +
scale_y_log10() +
geom_point(position=position_jitter(width=.1,height=0), size=3)
ggplot(data, aes(x=dex, y=count, fill=dex)) +
scale_y_log10() +
geom_dotplot(binaxis="y", stackdir="center")
ggplot(data, aes(x=dex, y=count, color=cell, group=cell)) +
scale_y_log10() + geom_point(size=3) + geom_line()
An MA-plot provides a useful overview for an experiment with a two-group comparison:
plotMA(res, ylim=c(-5,5))
可以看到DESeq通過統(tǒng)計學(xué)技術(shù),narrowing垂直方向的離散(左部分)
make an MA-plot for the results table in which we raised the log2 fold change threshold,并標(biāo)記除特定的一個基因:
plotMA(resLFC1, ylim=c(-5,5))
topGene <- rownames(resLFC1)[which.min(resLFC1$padj)]
with(resLFC1[topGene, ], {
points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})
histogram of the p values:
hist(res$pvalue[res$baseMean > 1], breaks=0:20/20, col="grey50", border="white")
5.5)Plotting results
提取前20個highest variance across samples.基因進行聚類
library("genefilter")
topVarGenes <- head(order(rowVars(assay(rld)),decreasing=TRUE),20)
mat <- assay(rld)[ topVarGenes, ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(rld)[,c("cell","dex")])
pheatmap(mat, annotation_col=df)
5.6)Independent filtering
qs <- c(0, quantile(resLFC1$baseMean[resLFC1$baseMean > 0], 0:6/6))
bins <- cut(resLFC1$baseMean, qs)
levels(bins) <- paste0("~",round(signif(.5*qs[-1] + .5*qs[-length(qs)],2)))
ratios <- tapply(resLFC1$pvalue, bins, function(p) mean(p < .05, na.rm=TRUE))
barplot(ratios, xlab="mean normalized count", ylab="ratio of small p values")
5.7)Annotating and exporting results
因為這里只有Ensembl gene IDs,而其它的名稱也非常有用。用 mapIds()函數(shù), 其中column 參數(shù)表示需要哪種信息, multiVals參數(shù)表示如果有多個對應(yīng)值的時候如何選擇.,這里選擇第一個. 調(diào)用兩次函數(shù) add the gene symbol and Entrez ID.
library("AnnotationDbi")
library("org.Hs.eg.db")
columns(org.Hs.eg.db)
res$symbol <- mapIds(org.Hs.eg.db,
keys=row.names(res),
column="SYMBOL", #需要哪種信息
keytype="ENSEMBL",
multiVals="first") #有多個對應(yīng)值,這里選第一個
res$entrez <- mapIds(org.Hs.eg.db,
keys=row.names(res),
column="ENTREZID",
keytype="ENSEMBL",
multiVals="first")
resOrdered <- res[order(res$padj),]
head(resOrdered)
5.8)Annotating and exporting results
resOrderedDF <- as.data.frame(resOrdered)[1:100,]
write.csv(resOrderedDF, file="results.csv") #保存結(jié)果
library("ReportingTools") #生成網(wǎng)頁版結(jié)果
htmlRep <- HTMLReport(shortName="report", title="My report",
reportDirectory="./report")
publish(resOrderedDF, htmlRep)
url <- finish(htmlRep)
browseURL(url)
5.9)Plotting fold changes in genomic space
在基因組上展示差異結(jié)果
(resGR <- results(dds, lfcThreshold=1, format="GRanges"))
resGR$symbol <- mapIds(org.Hs.eg.db, names(resGR), "SYMBOL", "ENSEMBL") #加基因名
library("Gviz") #用于可視化for plotting the GRanges and associated metadata: the log fold changes due to dexam- ethasone treatment.
window <- resGR[topGene] + 1e6
strand(window) <- "*"
resGRsub <- resGR[resGR %over% window]
naOrDup <- is.na(resGRsub$symbol) | duplicated(resGRsub$symbol)
resGRsub$group <- ifelse(naOrDup, names(resGRsub), resGRsub$symbol)
sig <- factor(ifelse(resGRsub$padj < .1 & !is.na(resGRsub$padj),"sig","notsig"))
options(ucscChromosomeNames=FALSE)
g <- GenomeAxisTrack()
a <- AnnotationTrack(resGRsub, name="gene ranges", feature=sig)
d <- DataTrack(resGRsub, data="log2FoldChange", baseline=0,
type="h", name="log2 fold change", strand="+")
plotTracks(list(g,d,a), groupAnnotation="group", notsig="grey", sig="hotpink")
5.10)Removing hidden batch effects
詳細(xì)看文檔
library("sva")
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
svseq$sv
par(mfrow=c(2,1),mar=c(3,5,3,1))
stripchart(svseq$sv[,1] ~ dds$cell,vertical=TRUE,main="SV1")
abline(h=0)
stripchart(svseq$sv[,2] ~ dds$cell,vertical=TRUE,main="SV2")
abline(h=0)
ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + dex
ddssva <- DESeq(ddssva)
5.11)Time course experiments
詳細(xì)看文檔
library("fission")
data("fission")
ddsTC <- DESeqDataSet(fission, ~ strain + minute + strain:minute)
ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ strain + minute)
resTC <- results(ddsTC)
resTC$symbol <- mcols(ddsTC)$symbol
head(resTC[order(resTC$padj),],4)
data <- plotCounts(ddsTC, which.min(resTC$padj),
intgroup=c("minute","strain"), returnData=TRUE)
ggplot(data, aes(x=minute, y=count, color=strain, group=strain)) +
geom_point() + stat_smooth(se=FALSE,method="loess") + scale_y_log10()
resultsNames(ddsTC)
res30 <- results(ddsTC, name="strainmut.minute30", test="Wald")
res30[which.min(resTC$padj),]
furthermore cluster significant genes by their profiles:
betas <- coef(ddsTC)
colnames(betas)
library("pheatmap")
topGenes <- head(order(resTC$padj),20)
mat <- betas[topGenes, -c(1,2)]
thr <- 3
mat[mat < -thr] <- -thr
mat[mat > thr] <- thr
pheatmap(mat, breaks=seq(from=-thr, to=thr, length=101),
cluster_col=FALSE)
總結(jié)
以上是生活随笔為你收集整理的airway之workflow的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: android 原生分享界面_这些技巧和
- 下一篇: 昨天看王建硕的blog,在说《胡适日记》