當(dāng)前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

2022-TCGA数据库重大更新后RNASeq的STAR-Counts数据的下载与整理

發(fā)布時(shí)間：2023/12/8 数据库 70 豆豆

生活随笔收集整理的這篇文章主要介紹了 2022-TCGA数据库重大更新后RNASeq的STAR-Counts数据的下载与整理小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

TCGA?|?GEO?|?文獻(xiàn)閱讀?|?數(shù)據(jù)庫?|?理論知識

R語言?|?Bioconductor?| 服務(wù)器與Linux

最近有粉絲留言，TCGA數(shù)據(jù)庫發(fā)生更新，下載的數(shù)據(jù)和之前的不一樣。比如轉(zhuǎn)錄組，之前是HTSeq流程的數(shù)據(jù)，現(xiàn)在是STAR-Counts的數(shù)據(jù)。具體的數(shù)據(jù)信息參考：
https://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-320

下載后的數(shù)據(jù)，打開是這樣的。都放在了一個文件中。

這里分享一下怎么提取數(shù)據(jù)。

數(shù)據(jù)的下載和之前的教程一樣【14-TCGA數(shù)據(jù)庫下載整理】。只不過這里選擇的是STAR-Counts了。加入購物車后下載下面的文件。

我先寫2個函數(shù)，一個是處理讀入json文件的函數(shù)，該文件包括文件信息和樣本barcode的關(guān)系。

processingJsonFiles <- function(jsonFile){library(rjson)metadata_json_File <- fromJSON(file=jsonFile)json_File_Info <- data.frame(filesName = c(),TCGA_Barcode = c())for(i in 1:length(metadata_json_File)){TCGA_Barcode <- metadata_json_File[[i]][["associated_entities"]][[1]][["entity_submitter_id"]]file_name <- metadata_json_File[[i]][["file_name"]]json_File_Info <- rbind(json_File_Info,data.frame(filesName = file_name,TCGA_Barcode = TCGA_Barcode))}rownames(json_File_Info) <- json_File_Info[,1]json_File_Info <-json_File_Info[-1]return(json_File_Info) }

jsonFile是下載的json文件的完整路徑。

下面的函數(shù)是提取數(shù)據(jù)的函數(shù)。

getTCGA_RNAseq_data = function(filepath,jsonFileInfo,data_type){datamatrix = data.frame()for(wd in filepath){#每一個循環(huán)讀取一個文件tempPath <- unlist(strsplit(wd,"/"))filename <- tempPath[length(unlist(strsplit(wd,"/")))]message(paste0("微信公眾號:MedBioInfoCloud提示:正在讀入文件:\n",filename))oneSampExp <- read.table(wd,comment.char = "#",header = T,sep = "\t")oneSampExp = oneSampExp[-c(1:4),]# 根據(jù)jsonFileInfo文件中文件名稱與barcode對應(yīng)關(guān)系，命名列名if(wd == filepath[1]){oneSampExp = oneSampExp[,c("gene_id","gene_name","gene_type",data_type)]colnames(oneSampExp) <- c("gene_id","gene_name","gene_type",jsonFileInfo[filename,"TCGA_Barcode"])datamatrix = oneSampExp}else{oneSampExp = oneSampExp[,c("gene_id",data_type)]colnames(oneSampExp) <- c("gene_id",jsonFileInfo[filename,"TCGA_Barcode"])datamatrix = merge(datamatrix,oneSampExp,by = "gene_id")}}return(datamatrix) }

filepath?是下載的數(shù)據(jù)路徑。通過dir等類似的函數(shù)獲取的路徑向量。比如，我們下載的數(shù)據(jù)是一個壓縮包，解壓后，將文件名重新命名為data。

filepath = dir(path = "./data",pattern = "counts.tsv$",full.names = T,recursive = T)

jsonFileInfo是processingJsonFiles函數(shù)獲取的結(jié)果。

data_type是下面中的一種。

"unstranded";
"stranded_first";
"stranded_second";
"tpm_unstranded";
"fpkm_unstranded";
"fpkm_uq_unstranded"

對應(yīng)文件中的信息

下面就可以獲取數(shù)據(jù)了，想要什么就獲取什么。一般就是TPM和FPKM。

jsonFileInfo <- processingJsonFiles(jsonFile = "metadata.cart.2022-04-05.json ") filepath = dir(path = "./data",pattern = "counts.tsv$",full.names = T,recursive = T) dat = getTCGA_RNAseq_data(filepath =filepath,jsonFileInfo = jsonFileInfo,data_type = "fpkm_unstranded") head(dat)[,1:5]

原來TCGA數(shù)據(jù)庫的下載，使用TCGAbiolinks包是否還可以處理數(shù)據(jù)，我還沒有試，但下載數(shù)據(jù)應(yīng)該是沒有問題的。

對于之前版本的數(shù)據(jù)。我之前文章【數(shù)據(jù)庫數(shù)據(jù) | TCGA數(shù)據(jù)庫33種癌癥的 transcriptome profiling (RNA-Seq) 數(shù)據(jù)】有已經(jīng)處理好的數(shù)據(jù)，大家可以下載。

最后，有用的給個贊賞！

經(jīng)? ? 典? ? 欄? ? 目

總結(jié)

以上是生活随笔為你收集整理的2022-TCGA数据库重大更新后RNASeq的STAR-Counts数据的下载与整理的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：微信pc内嵌二维码的自定义样式更改
下一篇： VS2017 如何连接mysql数据库依