當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

典型医学设计实验GEO数据分析 (step-by-step) - 数据获取到标准化

發布時間：2025/3/15 编程问答 23 豆豆

生活随笔收集整理的這篇文章主要介紹了典型医学设计实验GEO数据分析 (step-by-step) - 数据获取到标准化小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

GEO是當今最大、最全的公共基因數據資源庫，包括基因的表達、突變、修飾等信息，涵蓋幾乎所有的疾病，且單個實驗檢測樣品數目較多，是我們分析、學習的很好資源。

實驗設計

原始文章對14個潰瘍性結腸炎病人 (Ulcerative colitis, UC)和15個克羅恩病病人 (Crohn’s disease, CD)的發炎組織和未發炎組織活檢采樣，用Affy芯片檢測基因表達譜，研究發炎組織和未發炎組織的基因表達差異。(Genome-wide Pathway Analysis Using Gene Expression Data of Colonic Mucosa in Patients with Inflammatory Bowel Disease. Inflamm Bowel Dis. 雜志影響因子不高，但是領域的專業雜志)

檢測并安裝依賴的包

a = rownames(installed.packages())install_bioc <- c("Biobase","oligoClasses","ArrayExpress", "pd.hugene.1.0.st.v1", "hugene10sttranscriptcluster.db", "oligo", "arrayQualityMetrics","limma", "topGO", "ReactomePA", "clusterProfiler", "gplots", "ggplot2","geneplotter", "RColorBrewer", "pheatmap", "dplyr","stringr","genefilter")for(i in install_bioc) {if(! i %in% a) BiocManager::install(i, update=F)}# General Bioconductor packages library(Biobase) library(oligoClasses)# Annotation and data import packages library(ArrayExpress) library(pd.hugene.1.0.st.v1) library(hugene10sttranscriptcluster.db)# Quality control and pre-processing packages library(oligo) library(arrayQualityMetrics)# Analysis and statistics packages library(limma) library(topGO) library(ReactomePA) library(clusterProfiler)# Plotting and color options packages library(gplots) library(ggplot2) library(geneplotter) library(RColorBrewer) library(pheatmap)# Formatting/documentation packages #library(rmarkdown) #library(BiocStyle) library(dplyr) #library(tidyr)# Helpers: library(stringr) library(matrixStats) library(genefilter) #library(openxlsx) #library(devtools)

從ArrayExpress下載原始數據

實驗檢測采用的是Affy芯片 (A-AFFY-141 - Affymetrix GeneChip Human Gene 1.0 ST Array [HuGene-1_0-st-v1])，原始數據為CEL格式，存儲于ArrayExpress，索引號是E-MTAB-2967。

使用getAE函數下載原始數據和注釋數據到當前工作目錄 (Rmd所在目錄)，返回一個包含所有下載文件名的列表。

# type: 'raw' to download and extract only the raw data, 'processed' to download and extract only the processed data or 'full' to have both raw and processed data. # anno_AE <- getAE("E-MTAB-2967", type = "raw")

若自動下載不成功，手動點擊下載圖中File部分所有文件，放置到當前目錄。(也可去后臺回復affy數據?獲取)

# type: 'raw' to download and extract only the raw data, 'processed' to download and extract only the processed data or 'full' to have both raw and processed data. # 修改參數local=T，讀入當前目錄的數據 anno_AE <- getAE("E-MTAB-2967", type = "raw", local=T)## Warning in getAE("E-MTAB-2967", type = "raw", local = T): No processed data ## files found in directory /disk1/train/GEO/f1000

anno_AE里面存儲了所有的文件的路徑和名字信息。并把原始數據解壓縮，釋放出CEL文件到當前目錄，方便后續讀取。

anno_AE## $path ## [1] "/disk1/train/GEO/f1000" ## ## $rawFiles ## [1] "164_I_.CEL" "164_II.CEL" "183_I.CEL" "183_II.CEL" "2114_I.CEL" ## [6] "2114_II.CEL" "2209_A.CEL" "2209_B.CEL" "2255_I.CEL" "2255_II.CEL" ## [11] "2400_I.CEL" "2400_II.CEL" "2424_A.CEL" "2424_B.CEL" "255_I.CEL" ## [16] "255_II.CEL" "2826_I.CEL" "2826_II.CEL" "2853_I.CEL" "2853_II.CEL" ## [21] "2978_I.CEL" "2978_II.CEL" "2987_I.CEL" "2987_II.CEL" "2992_I.CEL" ## [26] "2992_II.CEL" "2995_I.CEL" "2995_II.CEL" "321_I.CEL" "321_II.CEL" ## [31] "3222_I.CEL" "3222_II.CEL" "3223_I.CEL" "3223_II.CEL" "3226_I.CEL" ## [36] "3226_II.CEL" "3233_I.CEL" "3233_II.CEL" "3258_I.CEL" "3258_II.CEL" ## [41] "3259_I.CEL" "3259_II.CEL" "3262_I.CEL" "3262_II.CEL" "3266_I.CEL" ## [46] "3266_II.CEL" "3269_I.CEL" "3269_II.CEL" "3271_I.CEL" "3271_II.CEL" ## [51] "3302_I.CEL" "3302_II.CEL" "3332_I.CEL" "3332_II.CEL" "848_A.CEL" ## [56] "848_B.CEL" "888_I.CEL" "888_II.CEL" ## ## $rawArchive ## [1] "E-MTAB-2967.raw.1.zip" "E-MTAB-2967.raw.2.zip" ## ## $processedFiles ## NULL ## ## $processedArchive ## character(0) ## ## $sdrf ## [1] "E-MTAB-2967.sdrf.txt" ## ## $idf ## [1] "E-MTAB-2967.idf.txt" ## ## $adf ## [1] "A-AFFY-141.adf.txt"

原始數據解釋

ArrayExpress的每個數據根據MAGE-TAB?(MicroArray Gene Expression Tabular)指南存儲，包含5種不同類型的文件：

IDF (Investigation Description Format)：研究描述文件，包含實驗信息如題目、描述、提交者聯系方式、實驗操作過程等。
ADF (Array Design Format): 芯片設計文件
SDRF (Sample and Data Relationship Format): 樣本屬性文件，如實驗分組、處理方式、取樣部位等。
原始數據文件 (raw)
加工后的數據文件 (processed data files)

ExpressionSets數據結構描述

組學數據通常比較復雜，包含很多不同的部分，如實驗樣品信息、基因組注釋信息和實驗數據，對應到芯片數據是樣品屬性和分組信息、不同來源的基因ID信息和功能注釋信息、基因表達矩陣。

為了更好的組織這些數據，Bioconductor的Biobase包定義了一個標準化的數據結構 (ExpressionSet類)存儲這些數據。ExpressionSet類包含下面幾部分：

assayData: 芯片實驗的表達數據，探針在行，樣品名字在列，用函數exprs獲取
metaData
phenoData: 樣品描述信息，樣品名字在行，實驗分組、處理方式、取樣部位在列。通常是SDRF文件的信息的讀入。用函數pData獲取。
featureData: 基因注釋特征信息，行為探針名字，列為探針對應的基因、轉錄本名字和相應的注釋信息等。用函數fData獲取。
experimentData: 對實驗描述的其它信息。

ExpressionSet類可以幫我們協調數據修改、提取過程中的所有信息的一致性。但是要注意phenoData的行名字必須與assayData的列名字一致，都是樣品標識符；assayData的行名字必須與featureData的行名字一致，都是基因標識符，具體見下圖：

導入數據，存儲為”ExpressionSet”

讀入SDRF數據。

sdrf_location <- file.path("E-MTAB-2967.sdrf.txt") SDRF <- read.delim(sdrf_location)

我們查看下，讀進來的數據長什么樣子，有哪些信息？

SDRF[1:3,1:5]

總共有哪些樣品相關的信息？取樣個體標記、物種、疾病、取樣部分、是否患病、raw data存儲的位置和名字等。

t(SDRF[1, ])## 1 ## Source.Name "164_I" ## Characteristics.individual. "164" ## Characteristics.organism. "Homo sapiens" ## Characteristics.disease. "Crohn's disease" ## Characteristics.organism.part. "colon" ## Characteristics.phenotype. "non-inflamed colonic mucosa" ## Material.Type "organism part" ## Protocol.REF "P-MTAB-41361" ## Protocol.REF.1 "P-MTAB-41363" ## Extract.Name "164_I" ## Protocol.REF.2 "P-MTAB-41364" ## Labeled.Extract.Name "164_I:Biotin" ## Label "biotin " ## Protocol.REF.3 "P-MTAB-41366" ## Assay.Name "164_I_" ## Technology.Type "array assay" ## Array.Design.REF "A-AFFY-141" ## Term.Source.REF "ArrayExpress" ## Protocol.REF.4 "P-MTAB-41367" ## Array.Data.File "164_I_.CEL" ## Comment..ArrayExpress.FTP.file. "ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip" ## Factor.Value.disease. "Crohn's disease" ## Factor.Value.phenotype. "non-inflamed colonic mucosa"

Array.Data.File存儲了樣品名字信息 (CEL文件名)，用作行名字。把SDRF數據表轉為AnnotatedDataFrame格式用于下游構建ExpressionSet對象。

rownames(SDRF) <- SDRF$Array.Data.File SDRF <- AnnotatedDataFrame(SDRF)

Affymetrix芯片的原始數據是CEL文件，里面包含檢測到的探針雜交密度值，代表原始基因表達量。另外每個CEL還含有額外信息，如芯片類型、掃描時間等，常用于做批次校正。

oligo包的read.celfiles函數可以讀取這些文件。(雖然不影響，但有幾個文件命名不規律，如164_I_多寫了個下劃線，有幾個樣品用A,B代替了I,II。不規范的名字是不太利于批量分析和下游識別的，引以為戒。)

raw_data <- oligo::read.celfiles(filenames = as.character(SDRF$Array.Data.File),verbose = FALSE, phenoData = SDRF) stopifnot(validObject(raw_data))

獲得的raw_data就是一個ExpressionSet對象。raw_data@phenoData@data(或者pData(raw_data))是對應的樣品屬性信息，與SDRF信息一致。raw_data@annotation獲取注釋平臺信息pd.hugene.1.0.st.v1。

str(raw_data)## Formal class 'GeneFeatureSet' [package "oligoClasses"] with 9 slots ## ..@ manufacturer : chr "Affymetrix" ## ..@ intensityFile : chr NA ## ..@ assayData :<environment: 0x815aba0> ## ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots ## .. .. ..@ varMetadata :'data.frame': 23 obs. of 2 variables: ## .. .. .. ..$ labelDescription: chr [1:23] NA NA NA NA ... ## .. .. .. ..$ channel : Factor w/ 2 levels "exprs","_ALL_": 2 2 2 2 2 2 2 2 2 2 ... ## .. .. ..@ data :'data.frame': 58 obs. of 23 variables: ## .. .. .. ..$ Source.Name : Factor w/ 58 levels "164_I","164_II",..: 1 2 3 4 5 6 7 8 9 10 ... ## .. .. .. ..$ Characteristics.individual. : int [1:58] 164 164 183 183 2114 2114 2209 2209 2255 2255 ... ## .. .. .. ..$ Characteristics.organism. : Factor w/ 1 level "Homo sapiens": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Characteristics.disease. : Factor w/ 2 levels "Crohn's disease",..: 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Characteristics.organism.part. : Factor w/ 1 level "colon": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Characteristics.phenotype. : Factor w/ 2 levels "inflamed colonic mucosa",..: 2 1 2 1 2 1 2 1 2 1 ... ## .. .. .. ..$ Material.Type : Factor w/ 1 level "organism part": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Protocol.REF : Factor w/ 1 level "P-MTAB-41361": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Protocol.REF.1 : Factor w/ 1 level "P-MTAB-41363": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Extract.Name : Factor w/ 58 levels "164_I","164_II",..: 1 2 3 4 5 6 7 8 9 10 ... ## .. .. .. ..$ Protocol.REF.2 : Factor w/ 1 level "P-MTAB-41364": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Labeled.Extract.Name : Factor w/ 58 levels "164_I:Biotin",..: 1 2 3 4 5 6 7 8 9 10 ... ## .. .. .. ..$ Label : Factor w/ 1 level "biotin ": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Protocol.REF.3 : Factor w/ 1 level "P-MTAB-41366": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Assay.Name : Factor w/ 58 levels "164_I_","164_II",..: 1 2 3 4 5 6 7 8 9 10 ... ## .. .. .. ..$ Technology.Type : Factor w/ 1 level "array assay": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Array.Design.REF : Factor w/ 1 level "A-AFFY-141": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Term.Source.REF : Factor w/ 1 level "ArrayExpress": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Protocol.REF.4 : Factor w/ 1 level "P-MTAB-41367": 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Array.Data.File : Factor w/ 58 levels "164_I_.CEL","164_II.CEL",..: 1 2 3 4 5 6 7 8 9 10 ... ## .. .. .. ..$ Comment..ArrayExpress.FTP.file.: Factor w/ 2 levels "ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip",..: 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Factor.Value.disease. : Factor w/ 2 levels "Crohn's disease",..: 1 1 1 1 1 1 1 1 1 1 ... ## .. .. .. ..$ Factor.Value.phenotype. : Factor w/ 2 levels "inflamed colonic mucosa",..: 2 1 2 1 2 1 2 1 2 1 ... ## .. .. ..@ dimLabels : chr [1:2] "rowNames" "columnNames" ## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot ## .. .. .. .. ..@ .Data:List of 1 ## .. .. .. .. .. ..$ : int [1:3] 1 1 0 ## ..@ featureData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots ## .. .. ..@ varMetadata :'data.frame': 0 obs. of 1 variable: ## .. .. .. ..$ labelDescription: chr(0) ## .. .. ..@ data :'data.frame': 1102500 obs. of 0 variables ## .. .. ..@ dimLabels : chr [1:2] "featureNames" "featureColumns" ## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot ## .. .. .. .. ..@ .Data:List of 1 ## .. .. .. .. .. ..$ : int [1:3] 1 1 0 ## ..@ experimentData :Formal class 'MIAME' [package "Biobase"] with 13 slots ## .. .. ..@ name : chr "" ## .. .. ..@ lab : chr "" ## .. .. ..@ contact : chr "" ## .. .. ..@ title : chr "" ## .. .. ..@ abstract : chr "" ## .. .. ..@ url : chr "" ## .. .. ..@ pubMedIds : chr "" ## .. .. ..@ samples : list() ## .. .. ..@ hybridizations : list() ## .. .. ..@ normControls : list() ## .. .. ..@ preprocessing : list() ## .. .. ..@ other : list() ## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot ## .. .. .. .. ..@ .Data:List of 2 ## .. .. .. .. .. ..$ : int [1:3] 1 0 0 ## .. .. .. .. .. ..$ : int [1:3] 1 1 0 ## ..@ annotation : chr "pd.hugene.1.0.st.v1" ## ..@ protocolData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots ## .. .. ..@ varMetadata :'data.frame': 2 obs. of 2 variables: ## .. .. .. ..$ labelDescription: chr [1:2] "Names of files used in 'exprs'" "Run dates for files used in 'exprs'" ## .. .. .. ..$ channel : Factor w/ 2 levels "exprs","_ALL_": 2 2 ## .. .. ..@ data :'data.frame': 58 obs. of 2 variables: ## .. .. .. ..$ exprs: Factor w/ 58 levels "164_I_.CEL","164_II.CEL",..: 1 2 3 4 5 6 7 8 9 10 ... ## .. .. .. ..$ dates: Factor w/ 58 levels "2010-06-09T11:39:13Z",..: 56 55 9 10 6 8 58 57 5 7 ... ## .. .. ..@ dimLabels : chr [1:2] "rowNames" "columnNames" ## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot ## .. .. .. .. ..@ .Data:List of 1 ## .. .. .. .. .. ..$ : int [1:3] 1 1 0 ## ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot ## .. .. ..@ .Data:List of 4 ## .. .. .. ..$ : int [1:3] 3 5 1 ## .. .. .. ..$ : int [1:3] 2 42 0 ## .. .. .. ..$ : int [1:3] 1 3 0 ## .. .. .. ..$ : int [1:3] 1 0 0# 如果用Rstudio, View(raw_data)顯示更好 # View(raw_data)

查看樣本屬性信息

head(Biobase::pData(raw_data))## Source.Name Characteristics.individual. Characteristics.organism. ## 164_I_.CEL 164_I 164 Homo sapiens ## 164_II.CEL 164_II 164 Homo sapiens ## 183_I.CEL 183_I 183 Homo sapiens ## 183_II.CEL 183_II 183 Homo sapiens ## 2114_I.CEL 2114_I 2114 Homo sapiens ## 2114_II.CEL 2114_II 2114 Homo sapiens ## Characteristics.disease. Characteristics.organism.part. ## 164_I_.CEL Crohn's disease colon ## 164_II.CEL Crohn's disease colon ## 183_I.CEL Crohn's disease colon ## 183_II.CEL Crohn's disease colon ## 2114_I.CEL Crohn's disease colon ## 2114_II.CEL Crohn's disease colon ## Characteristics.phenotype. Material.Type Protocol.REF ## 164_I_.CEL non-inflamed colonic mucosa organism part P-MTAB-41361 ## 164_II.CEL inflamed colonic mucosa organism part P-MTAB-41361 ## 183_I.CEL non-inflamed colonic mucosa organism part P-MTAB-41361 ## 183_II.CEL inflamed colonic mucosa organism part P-MTAB-41361 ## 2114_I.CEL non-inflamed colonic mucosa organism part P-MTAB-41361 ## 2114_II.CEL inflamed colonic mucosa organism part P-MTAB-41361 ## Protocol.REF.1 Extract.Name Protocol.REF.2 Labeled.Extract.Name ## 164_I_.CEL P-MTAB-41363 164_I P-MTAB-41364 164_I:Biotin ## 164_II.CEL P-MTAB-41363 164_II P-MTAB-41364 164_II:Biotin ## 183_I.CEL P-MTAB-41363 183_I P-MTAB-41364 183_I:Biotin ## 183_II.CEL P-MTAB-41363 183_II P-MTAB-41364 183_II:Biotin ## 2114_I.CEL P-MTAB-41363 2114_I P-MTAB-41364 2114_I:Biotin ## 2114_II.CEL P-MTAB-41363 2114_II P-MTAB-41364 2114_II:Biotin ## Label Protocol.REF.3 Assay.Name Technology.Type Array.Design.REF ## 164_I_.CEL biotin P-MTAB-41366 164_I_ array assay A-AFFY-141 ## 164_II.CEL biotin P-MTAB-41366 164_II array assay A-AFFY-141 ## 183_I.CEL biotin P-MTAB-41366 183_I array assay A-AFFY-141 ## 183_II.CEL biotin P-MTAB-41366 183_II array assay A-AFFY-141 ## 2114_I.CEL biotin P-MTAB-41366 2114_I array assay A-AFFY-141 ## 2114_II.CEL biotin P-MTAB-41366 2114_II array assay A-AFFY-141 ## Term.Source.REF Protocol.REF.4 Array.Data.File ## 164_I_.CEL ArrayExpress P-MTAB-41367 164_I_.CEL ## 164_II.CEL ArrayExpress P-MTAB-41367 164_II.CEL ## 183_I.CEL ArrayExpress P-MTAB-41367 183_I.CEL ## 183_II.CEL ArrayExpress P-MTAB-41367 183_II.CEL ## 2114_I.CEL ArrayExpress P-MTAB-41367 2114_I.CEL ## 2114_II.CEL ArrayExpress P-MTAB-41367 2114_II.CEL ## Comment..ArrayExpress.FTP.file. ## 164_I_.CEL ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip ## 164_II.CEL ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip ## 183_I.CEL ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip ## 183_II.CEL ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip ## 2114_I.CEL ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip ## 2114_II.CEL ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip ## Factor.Value.disease. Factor.Value.phenotype. ## 164_I_.CEL Crohn's disease non-inflamed colonic mucosa ## 164_II.CEL Crohn's disease inflamed colonic mucosa ## 183_I.CEL Crohn's disease non-inflamed colonic mucosa ## 183_II.CEL Crohn's disease inflamed colonic mucosa ## 2114_I.CEL Crohn's disease non-inflamed colonic mucosa ## 2114_II.CEL Crohn's disease inflamed colonic mucosa

這么展示更利于查看

t(Biobase::pData(raw_data)[1,])## 164_I_.CEL ## Source.Name "164_I" ## Characteristics.individual. "164" ## Characteristics.organism. "Homo sapiens" ## Characteristics.disease. "Crohn's disease" ## Characteristics.organism.part. "colon" ## Characteristics.phenotype. "non-inflamed colonic mucosa" ## Material.Type "organism part" ## Protocol.REF "P-MTAB-41361" ## Protocol.REF.1 "P-MTAB-41363" ## Extract.Name "164_I" ## Protocol.REF.2 "P-MTAB-41364" ## Labeled.Extract.Name "164_I:Biotin" ## Label "biotin " ## Protocol.REF.3 "P-MTAB-41366" ## Assay.Name "164_I_" ## Technology.Type "array assay" ## Array.Design.REF "A-AFFY-141" ## Term.Source.REF "ArrayExpress" ## Protocol.REF.4 "P-MTAB-41367" ## Array.Data.File "164_I_.CEL" ## Comment..ArrayExpress.FTP.file. "ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-2967/E-MTAB-2967.raw.1.zip" ## Factor.Value.disease. "Crohn's disease" ## Factor.Value.phenotype. "non-inflamed colonic mucosa"

篩選并保留關心的樣本屬性信息，個體信息 (Source.Name,Characteristics.individual.)，疾病信息 (Factor.Value.disease.), 發炎與否 (Factor.Value.phenotype.)。

Biobase::pData(raw_data) <- Biobase::pData(raw_data)[, c("Source.Name", "Characteristics.individual.", "Factor.Value.disease.", "Factor.Value.phenotype.")]

原始數據質控

exprs(raw_data)可獲取原始的表達信息，行代表芯片探針在芯片上的位置，列代表每個樣品。

Biobase::exprs(raw_data)[1:5, 1:5]## 164_I_.CEL 164_II.CEL 183_I.CEL 183_II.CEL 2114_I.CEL ## 1 4496 5310 4492 4511 2872 ## 2 181 280 137 101 91 ## 3 4556 5104 4379 4608 2972 ## 4 167 217 99 79 82 ## 5 89 110 69 58 47

對表達數據取對數，并進行主成分分析。每個點代表一個樣品，顏色代表是否發炎，形狀代表疾病類型。從圖中可以看出，在第一主成分兩種疾病區分比較明顯，在第二主成分疾病的區別也略大于是否發炎。而我們的關注點是發炎，而不是不同類型的疾病，需要在下游分析中考慮移除疾病的影響。

exp_raw <- log2(Biobase::exprs(raw_data)) PCA_raw <- prcomp(t(exp_raw), scale. = FALSE)percentVar <- round(100*PCA_raw$sdev^2/sum(PCA_raw$sdev^2),1) sd_ratio <- sqrt(percentVar[2] / percentVar[1])dataGG <- data.frame(PC1 = PCA_raw$x[,1], PC2 = PCA_raw$x[,2],Disease = pData(raw_data)$Factor.Value.disease.,Phenotype = pData(raw_data)$Factor.Value.phenotype.,Individual = pData(raw_data)$Characteristics.individual.)ggplot(dataGG, aes(PC1, PC2)) +geom_point(aes(shape = Disease, colour = Phenotype)) +ggtitle("PCA plot of the log-transformed raw expression data") +xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) +ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) +theme(plot.title = element_text(hjust = 0.5))+coord_fixed(ratio = sd_ratio) +scale_shape_manual(values = c(4,15)) + scale_color_manual(values = c("darkorange2", "dodgerblue4"))

另外，需要看下原始數據的表達分布。從圖中看出，不同的芯片原始表達分布不均一，需要做合適的標準化處理。

# oligo::boxplot(raw_data, target = "core", # main = "Boxplot of log2-intensitites for the raw data") dataBoxplot <- data.frame(t(exp_raw), Sample=colnames(exp_raw),Disease = pData(raw_data)$Factor.Value.disease.,Phenotype = pData(raw_data)$Factor.Value.phenotype.,Individual = pData(raw_data)$Characteristics.individual.)dataBoxplot <- reshape2::melt(dataBoxplot, id.vars=c("Sample","Disease","Phenotype","Individual"))ggplot(dataBoxplot, aes(x=Sample, y=value)) + geom_boxplot(aes(color=Disease)) + theme_bw() + theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1))

arrayQualityMetrics可以對芯片進行更多層面的檢測，并生成一個檢測報告。

arrayQualityMetrics(expressionset = raw_data,outdir = "ysx",force = TRUE, do.logtransform = TRUE,intgroup = c("Factor.Value.disease.", "Factor.Value.phenotype."))

背景校正

鑒于探針的熒光強度會受到非特異性雜交和光學檢測系統噪音的影響，因此需要對信號強度進行背景校正，從而能更真實的反映和比較特異性雜交產生的信號，更好的反應基因表達情況。

芯片間標準化 (calibration)

不同的芯片雜交受很多因素影響，比如反轉錄效率，標記或雜交反應，芯片物理性質，試劑批次和實驗室環境等，校正后，才可以在不同芯片之間可比。

Summarization

在Affymetrix平臺，一個轉錄本通常用多個探針檢測。對每個基因來說，需要根據所有探針的信息估算出其相對于總的轉錄本的表達值，也就是Summarization。隨后就可以使用注釋數據集獲得基因的symbols或ENSEMBL ID。我們這個平臺的注釋信息存儲在hugene10sttranscriptcluster.db包中。

head(ls("package:hugene10sttranscriptcluster.db"))## [1] "hugene10sttranscriptcluster" ## [2] "hugene10sttranscriptcluster_dbconn" ## [3] "hugene10sttranscriptcluster_dbfile" ## [4] "hugene10sttranscriptcluster_dbInfo" ## [5] "hugene10sttranscriptcluster_dbschema" ## [6] "hugene10sttranscriptcluster.db"

Affymetrix芯片”probesets”的變更

在采用芯片檢測樣品前，需要先提取RNA，加上一個熒光標簽，并進行至少一輪PCR擴增以增加檢測量。傳統上這一步是根據mRNA``3'端的poly-A序列設計引物完成反轉。這樣最終的擴增產物里面3’端的序列會更多些。而且如果起始RNA片段化了或降解了，5’端被檢測到的幾率更低了。因此設計探針時，更多的探針也是針對3’端設置的，這就是3' IVT Affymetrix芯片。它是基于探針集的，探針集的一組固定探針用于檢測一個特定的基因或轉錄本。在一些情況下，如選擇性Poly-A位點出現時，一個基因需要設計多個探針集檢測。

而后來的Gene/ExonAffymetrix芯片是基于外顯子的，探針集來源于外顯子區。這時為了獲得基因的表達，需要做兩步summarization，第一步是從探針集獲得外顯子的表達，然后根據多個來源于同一個基因或轉錄本的外顯子獲取基因的表達。因此需要合適的基因注釋包，如我們這用到的hugene10sttranscriptcluster.db。

Gene芯片相比于Exon芯片探針更少，只保留Exon芯片的一部分”good”探針。在Exon芯片上至少需要4個探針來代表一個Exon,而在Gene芯片上，很多代表一個基因的探針集只有3個或更少的探針。

如下圖所示，用單一顏色代表一個探針集，一個基因的探針集用一個顏色集合表示，如所有黃色探針代表一個外顯子，所有黃色、橙色、紅色探針集都用于檢測同一個基因。

左圖中，每個外顯子或探針集包含很多探針 (同樣顏色的塊出現多次)，因此做探針集或外顯子水平的summarization是合適的。而對于gene芯片，每個探針集的探針數目很少，因此在探針集或外顯子水平的summarization是不推薦的，雖然也可以通過包hugene10stprobeset.db來完成。

而且在Gene和Exon芯片上不再有錯配探針 (mismatch probes)。錯配探針本來是用做背景校正排除非特異性雜交和背景噪音的，但被更好的計算方式取代。

一步完成背景校正、標準化、表達整合

oligo包提供了一個函數rma允許一步完成基于去卷積的背景校正、分位數校正標準化 (quantile normalization)和RMA (robust multichip average) 表達整合 (summarization)。

相對log表達 (RLE) 質控分析

我們先利用rma函數進行背景校正和summarization獲得基因的表達量，而略過標準化。rma函數輸出的結果已經做過對數處理。

palmieri_eset <- oligo::rma(raw_data, target = "core", normalize = FALSE)## Background correcting ## Calculating Expression

然后進行RLE (Relative Log Expression)分析，首先計算每個轉錄本在所有樣品中的表達中位數，然后每個轉錄本的表達減去這個中位數，整理格式進行繪圖。

row_medians_assayData <- Biobase::rowMedians(as.matrix(Biobase::exprs(palmieri_eset)))# 減去中值 RLE_data <- sweep(Biobase::exprs(palmieri_eset), 1, row_medians_assayData)RLE_data <- as.data.frame(RLE_data) # 轉換為長格式 RLE_data_gathered <- tidyr::gather(RLE_data, patient_array, log2_expression_deviation)ggplot2::ggplot(RLE_data_gathered, aes(patient_array, log2_expression_deviation)) + geom_boxplot(outlier.shape = NA) + ylim(c(-2, 2)) + theme(axis.text.x = element_text(colour = "aquamarine4", angle = 60, size = 6.5, hjust = 1 ,face = "bold"))

圖中，Y-軸代表每個芯片中轉錄本表達與該轉錄本在所有芯片表達中位數的偏差。箱體越延展表明芯片中轉錄本的表達量與中位表達差別越大，箱體在Y-軸的偏移表示大部分轉錄本的表達定量存在系統的增高或降低。這可能存在質量問題或是檢測批次效應造成的。因此如果一個樣品的boxplot的形狀和中值與其它樣品差別很大，需要進一步評估是否為異常樣本，并考慮移除。

圖中5個中位數在-0.7~-0.8之間的樣品?2826_I,?2826_II,?3262_II,3302_II和3332_II可能是異常樣品。需要結合后面的聚類分析，再確認是否移除。

校正三部曲

background-correct, normalize and summarize.

palmieri_eset_norm <- oligo::rma(raw_data, target = "core")## Background correcting ## Normalizing ## Calculating Expression

參數target定義summarization的程度，默認是core表示獲得基因水平的表達 (using transcript clusters containing “safely” annotated genes) (對gene芯片，只有這么一個選項)。如果是外顯子芯片，還可以選擇extended和full。如果想獲得外顯子水平的表達，也可以用probeset做參數，但是一般不推薦。

除了RMA還有其它的背景校正和標準化方法，但RMA通常是一個好的默認選擇。RMA在所有芯片間使用quantile normalization標準化方法使得不同樣品檢測的表達值可比。

出校準化后的表達值

norm_exprs <- Biobase::exprs(palmieri_eset_norm) norm_exprs <- data.frame(ID=rownames(norm_exprs), norm_exprs) write.table(norm_exprs, "normalized_probe_expr.tsv", sep="\t", row.names=F, quote=F)

校準化后數據的質量評估

主成分分析

exp_palmieri <- Biobase::exprs(palmieri_eset_norm) PCA <- prcomp(t(exp_palmieri), scale = FALSE)percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1) sd_ratio <- sqrt(percentVar[2] / percentVar[1])dataGG <- data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2],Disease = Biobase::pData(palmieri_eset_norm)$Factor.Value.disease.,Phenotype = Biobase::pData(palmieri_eset_norm)$Factor.Value.phenotype.)ggplot(dataGG, aes(PC1, PC2)) +geom_point(aes(shape = Disease, colour = Phenotype)) +ggtitle("PCA plot of the calibrated, summarized data") +xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) +ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) +theme(plot.title = element_text(hjust = 0.5)) +coord_fixed(ratio = sd_ratio) +scale_shape_manual(values = c(4,15)) + scale_color_manual(values = c("darkorange2", "dodgerblue4"))

第一主坐標軸大體可以分開發炎與未發炎樣品，第二主坐標軸分開兩種不同的疾病。(從圖上看，還是疾病的區分更明顯)

熱圖聚類分析

通過熱圖展示樣品之間的相似度關系，并且標記樣品的屬性如是否發炎或是否來自同一個疾病。

# 生成行注釋 phenotype_names <- ifelse(str_detect(pData(palmieri_eset_norm)$Factor.Value.phenotype., "non"), "non_infl.", "infl.")disease_names <- ifelse(str_detect(pData(palmieri_eset_norm)$Factor.Value.disease., "Crohn"), "CD", "UC")annotation_for_heatmap <- data.frame(Phenotype = phenotype_names, Disease = disease_names)row.names(annotation_for_heatmap) <- row.names(pData(palmieri_eset_norm))

計算Manhattan距離，并進行聚類分析

# 使用dist計算樣本之間的距離，首選做下轉置，因為dist是對列進行操作的 # dist默認是歐式距離，反應直線路徑的平方距離 # 這里用Manhattan距離，計算的是直角路徑的絕對距離，計算結果更穩定 dists <- as.matrix(dist(t(exp_palmieri), method = "manhattan"))rownames(dists) <- row.names(pData(palmieri_eset_norm)) hmcol <- rev(colorRampPalette(RColorBrewer::brewer.pal(9, "YlOrRd"))(255)) colnames(dists) <- NULL # 設置對角距離為NA，增強顏色的展示力 diag(dists) <- NA# www.ehbio.com/ImageGP ann_colors <- list(Phenotype = c(non_infl. = "chartreuse4", infl. = "burlywood3"),Disease = c(CD = "blue4", UC = "cadetblue2")) pheatmap(dists, col = (hmcol), annotation_row = annotation_for_heatmap,annotation_colors = ann_colors,legend = TRUE, treeheight_row = 0,legend_breaks = c(min(dists, na.rm = TRUE), max(dists, na.rm = TRUE)), legend_labels = (c("small distance", "large distance")),main = "Clustering heatmap for the calibrated samples")

熱圖中上部黃色的條帶可能是異常樣品 (2826_II, 3262_II, 3271_I, 2978_II 和 3332_II)，部分樣品與之前RLE圖推測可能是異常的樣品重合，可以考慮移除。這里為了跟原文一致，所有樣品都保留了下來。

另外arrayQualityMetrics有更精確的計算異常樣品的方式，我們后續也會提供基于網絡圖距離的異常值剔除方法。

過濾低表達基因

芯片數據中有一大部分探針的信號值與背景值相差不大，而且在不同樣品中差別不大，不能給我們的生物問題提供更多信息。(注：雖然有部分表達低的基因表達變化比較小時就可以有比較大的作用，但因為其表達低，檢測噪音也大，結果可信度一般)

首選繪制下每個轉錄本在所有芯片中的表達中位數的分布 (共計33,297轉錄本)。可以看到左側區域富集低表達的轉錄本，具體選擇移除的閾值主觀性比較強，這里選擇表達中位數需要大于4，給后續分析多保留一些基因。也可以排序一下，保留前75%的基因。

palmieri_medians <- rowMedians(Biobase::exprs(palmieri_eset_norm))hist_res <- hist(palmieri_medians, 100, col = "cornsilk1", freq = FALSE, main = "Histogram of the median intensities", border = "antiquewhite4",xlab = "Median intensities")

man_threshold <- 4hist_res <- hist(palmieri_medians, 100, col = "cornsilk", freq = FALSE, main = "Histogram of the median intensities",border = "antiquewhite4",xlab = "Median intensities")abline(v = man_threshold, col = "coral4", lwd = 2)

假如最小的實驗組有5個樣品，如果一個轉錄本檢測到表達值大于之前設定的閾值 (man_threshold)的樣本數小于5，則移除。

首先獲得各個實驗組的樣品數

no_of_samples <- table(paste0(pData(palmieri_eset_norm)$Factor.Value.disease., "_", pData(palmieri_eset_norm)$Factor.Value.phenotype.)) no_of_samples ## ## Crohn's disease_inflamed colonic mucosa ## 15 ## Crohn's disease_non-inflamed colonic mucosa ## 15 ## ulcerative colitis_inflamed colonic mucosa ## 14 ## ulcerative colitis_non-inflamed colonic mucosa ## 14# 獲得最小樣品數 samples_cutoff <- min(no_of_samples)# sum(x > man_threshold)：轉錄本在多少樣品中表達值高于給定閾值 idx_man_threshold <- apply(Biobase::exprs(palmieri_eset_norm), 1,function(x){sum(x > man_threshold) >= samples_cutoff}) table(idx_man_threshold)## idx_man_threshold ## FALSE TRUE ## 10493 22804# 提取過濾轉錄本 palmieri_manfiltered <- subset(palmieri_eset_norm, idx_man_threshold)

注釋轉錄本簇

在我們構建線性模型做差異分析之前，先把探針映射到Gene symbol，并只保留這部分基因用于后續分析。

# 使用select函數，獲得探針ID對應的Gene symbol和描述 anno_palmieri <- AnnotationDbi::select(hugene10sttranscriptcluster.db,keys = (featureNames(palmieri_manfiltered)),columns = c("SYMBOL", "GENENAME"),keytype = "PROBEID") # 過濾掉沒有Symbol的探針 anno_palmieri <- subset(anno_palmieri, !is.na(SYMBOL)) head(anno_palmieri)

如果一個探針對應于多個Gene symbol，我們沒有辦法判斷到底是哪個，只能移除。

# 按探針ID分組 anno_grouped <- group_by(anno_palmieri, PROBEID)# 計算每組Symbol的個數 anno_summarized <- dplyr::summarize(anno_grouped, no_of_matches = n_distinct(SYMBOL))head(anno_summarized)# 過濾Symbol個數大于1的 anno_filtered <- filter(anno_summarized, no_of_matches > 1)head(anno_filtered)probe_stats <- anno_filtered nrow(probe_stats)## [1] 1763

有1763個探針ID對應多個Gene Symbol，從Assya數據中移除這些探針ID。

# featureNames(palmieri_manfiltered): 獲取所有探針名字 ids_to_exlude <- (featureNames(palmieri_manfiltered) %in% probe_stats$PROBEID)table(ids_to_exlude)## ids_to_exlude ## FALSE TRUE ## 21041 1763palmieri_final <- subset(palmieri_manfiltered, !ids_to_exlude)validObject(palmieri_final)## [1] TRUE

前面是從assay數據中移除了這些探針ID，后面需要從feature data中移除:

head(anno_palmieri)

現在的feature data是空的，增加一列PROBID。

# 增加一列PROBID fData(palmieri_final)$PROBEID <- rownames(fData(palmieri_final)) # 使用left_join根據PROBID列，合并，增加注釋信息 fData(palmieri_final) <- left_join(fData(palmieri_final), anno_palmieri)## Joining, by = "PROBEID"# restore rownames after left_join rownames(fData(palmieri_final)) <- fData(palmieri_final)$PROBEID validObject(palmieri_final)## [1] TRUE

總結

以上是生活随笔為你收集整理的典型医学设计实验GEO数据分析 (step-by-step) - 数据获取到标准化的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： PATH和path，傻傻分不清
下一篇：改写教科书！人类细胞可将RNA序列写入D