當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

TCGA_改版后STAR-count处理方法

發(fā)布時(shí)間：2023/12/8 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 TCGA_改版后STAR-count处理方法小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

TCGA改版后，workflow.type只有STAR-counts數(shù)據(jù)，先對(duì)所嘗試的幾種處理方法進(jìn)行記錄：

R version 4.1.2 ； TCGAbiolinks version?2.23.11

方法1

最新版TCGA 矩陣整理，百分百復(fù)現(xiàn)成功_sayhello1025的博客-CSDN博客

一、從TCGA網(wǎng)站上下載tsv文件和json文件

1 tsv文件?

query <- GDCquery(project = "TCGA-PRAD", #項(xiàng)目名data.category = "Transcriptome Profiling",data.type = "Gene Expression Quantification",workflow.type = "STAR - Counts")t_samplesDown <- getResults(query,cols=c("cases")) #從sampleDown檢索出TP t_dataSmTP <- TCGAquery_SampleTypes(barcode = t_samplesDown,typesample = "TP") #可選#篩選完成的query queryDown <- GDCquery(project = "TCGA-PRAD",data.category = "Transcriptome Profiling",data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts", barcode = t_dataSmTP) #下載GDCquery的結(jié)果 GDCdownload(queryDown,method = "api",directory = "GDCdata_star_count",files.per.chunk = 6) #網(wǎng)速慢可以設(shè)小一點(diǎn)

?跟如下從download 中下載cart效果一樣

2 json文件

點(diǎn)擊Metadata下載json文件

?二、整合tsv文件、json文件到一個(gè)文件夾

1 右上角搜索“.tsv”

2?復(fù)制到新文件夾里面，整理如下：

本機(jī)路徑：'E:/R/PRAD Data Mining/PRAD_data_mining/TCGA/GDCdata_star_count.tsv/all/'

?三、提取count矩陣

rm(list=ls()) options(stringsAsFactors = F)library("rjson")

1 整理“all”中的文件

result <- fromJSON(file = "E:/R/PRAD Data Mining/PRAD_data_mining/TCGA/GDCdata_star_count.tsv/metadata.cart.2022-04-17.json") metadata <- data.frame(t(sapply(result,function(x){id <- x$associated_entities[[1]]$entity_submitter_idfile_name <- x$file_nameall <- cbind(id,file_name) }))) rownames(metadata) <- metadata[,2]t_dir <- 'E:/R/PRAD Data Mining/PRAD_data_mining/TCGA/GDCdata_star_count.tsv/all/' t_samples=list.files(t_dir) sampledir <- paste0(t_dir,t_samples) #各個(gè)文件路徑

View(metadata)

metadata中記錄barcode文件名的對(duì)應(yīng)關(guān)系

example <- data.table::fread('E:/R/PRAD Data Mining/PRAD_data_mining/TCGA/GDCdata_star_count.tsv/all/005d2b9e-722c-40bd-aa5c-bd4e8842cb04.rna_seq.augmented_star_gene_counts.tsv',data.table = F)#讀入一個(gè)tsv文件，查看需要的列數(shù)，“unstranded”raw <- do.call(cbind,lapply(sampledir, function(x){rt <- data.table::fread(x,data.table = F) #data.table::fread函數(shù)rownames(rt) <- rt[,1]rt <- rt[,4]###第4列為“unstranded” }))

View(example)

載入一個(gè)例子，需要的 ”unstranded count”?值數(shù)據(jù)是第4列；同時(shí)第5行之后才可讀

?View(raw)

行名、列名未注釋??

2?替換列名和行名

重點(diǎn)說一下sapply(strsplit(sampledir,'/'),'[',8) #數(shù)字可選。意思是將每個(gè)sampledir按“/”分隔后，取第“8”個(gè)，對(duì)應(yīng)的是tsv文件的文件名。比方說我的sampledir[1]：

'E:/R/PRAD Data Mining/PRAD_data_mining/TCGA/GDCdata_star_count.tsv/all/005d2b9e-722c-40bd-aa5c-bd4e8842cb04.rna_seq.augmented_star_gene_counts.tsv'

按“/”分隔后，第“8”個(gè)部分是tsv文件的文件名，所以我這里的參數(shù)為“8”

colnames(raw)=sapply(strsplit(sampledir,'/'),'[',8) #數(shù)字可選,'8'為文件名005d2b9e-722c-40bd-aa5c-bd4e8842cb04.rna_seq.augmented_star_gene_counts.tsv rownames(raw) <- example$gene_id ##行名 raw_t <- t(raw)t_same <- intersect(rownames(metadata),rownames(raw_t))dataPrep2 <- cbind(metadata[t_same,],raw_t[t_same,]) rownames(dataPrep2) <- dataPrep2[,1] dataPrep2 <- t(dataPrep2) dataPrep2 <-dataPrep2[-c(1:6),] #dataPrep2為未注釋count矩陣

View(raw)

View(dataPrep2)

?dataPrep2為未基因注釋的count矩陣，格式為“matrix”

3?dataPrep2中數(shù)據(jù)類型為“character”，需要轉(zhuǎn)為“numeric”

puried_data=apply(dataPrep2,2,as.numeric)

?View(puried_data)

行名沒了?

4 基因注釋

puried_data的格式為“matrix”：“matrix”行名可以重復(fù)，因此直接替換行名。

此處仍然借用example進(jìn)行基因注釋

rownames(puried_data)=example[5:nrow(example),'gene_name']

View(puried_data)

dim(puried_data)

?[1] 60660 ? 490

5 去除重復(fù)基因名

取行平均count值最大的行

t_index=order(rowMeans(puried_data),decreasing = T)#計(jì)算所有行平均值，按降序排列 t_data_order=puried_data[t_index,]#調(diào)整表達(dá)譜的基因順序 keep=!duplicated(rownames(t_data_order))#對(duì)于有重復(fù)的基因，保留第一次出現(xiàn)的那個(gè)，即行平均值大的那個(gè) puried_data=t_data_order[keep,]#得到最后處理之后的表達(dá)譜矩陣

dim(puried_data)

?[1] 59427 ? 490

6 預(yù)處理

需要注意DESeq2包要求數(shù)據(jù)未經(jīng)過標(biāo)準(zhǔn)化

dataFilt =puried_data[rowMeans(puried_data)>10,] #剔除低表達(dá)基因write.csv(dataFilt,file = "dataFilt.csv",quote = FALSE)

得到?dataFilt?后就可以正常分析了！

?方法2（未成功）

嘗試使用TCGAbiolinksR包

library("BiocManager") library("TCGAbiolinks") library("SummarizedExperiment")BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data") BiocManager::install("BioinformaticsFMRP/TCGAbiolinks") #比BiocManager::install(TCGAbiolinks)安裝更新版的TCGAbiolinks包 query <- GDCquery(project = "TCGA-PRAD", #項(xiàng)目名data.category = "Transcriptome Profiling",data.type = "Gene Expression Quantification",workflow.type = "STAR - Counts")t_samplesDown <- getResults(query,cols=c("cases")) #從sampleDown檢索出TP t_dataSmTP <- TCGAquery_SampleTypes(barcode = t_samplesDown,typesample = "TP") #可選#篩選完成的query queryDown <- GDCquery(project = "TCGA-PRAD",data.category = "Transcriptome Profiling",data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts", barcode = t_dataSmTP) #下載GDCquery的結(jié)果 GDCdownload(queryDown,method = "api",directory = "GDCdata_star_count",files.per.chunk = 6) #網(wǎng)速慢可以設(shè)小一點(diǎn)dataPrep1 <- GDCprepare(query = queryDown, save = T,directory = 'GDCdata_star_count', #默認(rèn)為“GDCdata”save.filename ="dataPrep1_PRAD_star.rda")dataPrep2 <- TCGAanalyze_Preprocessing(object = dataPrep1,cor.cut = 0.6#datatype = "HTSeq - Counts")

View(dataPrep2)

t_purityDATA <- TCGAtumor_purity(colnames(dataPrep1), 0, 0, 0, 0, 0.6) t_purity.barcodes<-t_purityDATA$pure_barcodes #腫瘤樣本barcodes，為“character” t_normal.barcodes<-t_purityDATA$filtered #正常組織的數(shù)據(jù)barcodes，為“character”puried_data <-dataPrep2[,t_purity.barcodes]#篩選后數(shù)據(jù)rownames(puried_data)<-rowData(dataPrep1)$external_gene_name

rownames(puried_data)<-rowData(dataPrep1)$external_gene_name （基因注釋步驟）

View(puried_data)

?基因注釋未成功，暫時(shí)沒有找到解決方法

總結(jié)

以上是生活随笔為你收集整理的TCGA_改版后STAR-count处理方法的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：功能测试与抓包工具Fiddler(htt
下一篇：微信第三方开发行业解决方案