Assembling an RMA Quantification Matrix for the TCGA Ovarian Data ================================================================= Keith A. Baggerly ## 1 Executive Summary ### 1.1 Introduction We want to produce an RData file with a matrix of robust multi-array average (RMA) expression values for the TCGA ovarian cancer samples profiled with Affymetrix HT\_HG-U133A arrays. ### 1.2 Methods We acquired the 14 gzipped tarballs containing the individual Level 1 data files (CEL files) from the TCGA open-access http page, [https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ov/cgcc/broad.mit.edu/ht_hg-u133a/transcriptome/](https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ov/cgcc/broad.mit.edu/ht_hg-u133a/transcriptome/), on September 2, 2012. According to the page, these files were last updated on June 24, 2011. At the same time, we also acquired the gzipped tarball with the MageTab (annotation) data. Explicit lists of the batch and version numbers of the tarballs used are given in the text below. We load the individual CEL files by folder, recording the folder (batch) information as we go, and use justRMA to compute RMA fits for the set. We extract the expression matrix and use the mapping information from the sample and data relationship format (sdrf) file in the MageTab folder to update the column names. ### 1.3 Results We save tcgaSampleInfo, tcgaDataDirs, tcgaFiles, and tcgaExpression to the RData file "tcgaExpression.RData". In passing, we note that our quantifications match the reported TCGA Level 2 quantifications quite well (essentially to within roundoff error) when we restrict justRMA to the 594 (of 598) CEL files that are "used in analysis" per the sdrf file. ## 2 Libraries We first load the libraries we will use in this report. ```r library(affy) library(hthgu133acdf) ``` ## 3 Specifying the Raw Data Location Here, we specify the location of the data we acquired from TCGA on our local system. You will need to acquire these files and adjust this path before running this report yourself. ```r pathToTCGAData <- file.path("RawData", "TCGA", "CEL_Files") ``` ## 4 The SDRF file from the MageTab Folder We now load the sample description (sdrf) information. ```r sdrf <- read.table(file.path(pathToTCGAData, "broad.mit.edu_OV.HT_HG-U133A.mage-tab.1.1007.0", "broad.mit.edu_OV.HT_HG-U133A.sdrf.txt"), header = TRUE, sep = "\t") dim(sdrf) ``` ``` ## [1] 598 33 ``` ```r sdrf[1, ] ``` ``` ## Extract.Name Protocol.REF ## 1 TCGA-29-1704-02A-01R-0808-01 broad.mit.edu:labeling:HT_HG-U133A:01 ## Labeled.Extract.Name Label Term.Source.REF ## 1 TCGA-29-1704-02A-01R-0808-01 biotin MGED Ontology ## Protocol.REF.1 ## 1 broad.mit.edu:hybridization:HT_HG-U133A:01 ## Hybridization.Name ## 1 URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990 ## Array.Design.REF Term.Source.REF.1 ## 1 Affymetrix.com:PhysicalArrayDesign:HT_HG-U133A caArray ## Protocol.REF.2 ## 1 broad.mit.edu:image_acquisition:HT_HG-U133A:01 ## Scan.Name ## 1 TCGA-29-1704-02A-01R-0808-01 ## Array.Data.File ## 1 URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990.CEL ## Comment..TCGA.Archive.Name. Comment..TCGA.Data.Type. ## 1 broad.mit.edu_OV.HT_HG-U133A.Level_1.27.1007.0 Expression-Gene ## Comment..TCGA.Data.Level. Comment..TCGA.Include.for.Analysis. ## 1 Level 1 no ## Comment..md5. Protocol.REF.3 Normalization.Name ## 1 d47d925a83a2749fb09448736afa8a4b -> -> ## Derived.Array.Data.Matrix.File Comment..TCGA.Archive.Name..1 ## 1 -> -> ## Comment..TCGA.Data.Type..1 Comment..TCGA.Data.Level..1 ## 1 -> -> ## Comment..TCGA.Include.for.Analysis..1 Comment..md5..1 Protocol.REF.4 ## 1 -> -> -> ## Normalization.Name.1 Derived.Array.Data.Matrix.File.1 ## 1 -> -> ## Comment..TCGA.Archive.Name..2 Comment..TCGA.Data.Type..2 ## 1 -> -> ## Comment..TCGA.Data.Level..2 Comment..TCGA.Include.for.Analysis..2 ## 1 -> -> ## Comment..md5..2 ## 1 -> ``` ```r length(unique(sdrf[, 1])) ``` ``` ## [1] 597 ``` There were 598 arrays run, but only 597 distinct samples; one sample was run twice. We now check which sample this was, and whether we care. ```r which(sdrf[, 1] == sdrf[which(duplicated(sdrf[, 1])), 1]) ``` ``` ## [1] 1 207 ``` ```r as.character(sdrf[which(sdrf[, 1] == sdrf[which(duplicated(sdrf[, 1])), 1]), "Hybridization.Name"]) ``` ``` ## [1] "URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990" ## [2] "TARRE_p_MultiPlate_TCGA_SS_MA_Ref_HT_HG-U133A_96-HTA_F05_586106" ``` ```r which(sdrf[, "Comment..TCGA.Include.for.Analysis."] == "no") ``` ``` ## [1] 1 157 171 207 ``` ```r as.character(sdrf[which(sdrf[, "Comment..TCGA.Include.for.Analysis."] == "no"), "Hybridization.Name"]) ``` ``` ## [1] "URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990" ## [2] "FEAST_p_TCGA_B20_21_Expression_HT_HG-U133A_96-HTA_F04_516474" ## [3] "FEAST_p_TCGA_B20_21_Expression_HT_HG-U133A_96-HTA_G06_516372" ## [4] "TARRE_p_MultiPlate_TCGA_SS_MA_Ref_HT_HG-U133A_96-HTA_F05_586106" ``` As it happens, four of the CEL files (including the two where the same sample was run) are excluded from later analyses. Spot checking the files in Batch 27 (where both of the duplicates were) shows these samples are present in the Level 1 but not in the Level 2 data. We restrict our quantification to just the 594 samples used. ## 5 Quantifying The CEL Files ### 5.1 Identifying the Data Directories Next, we turn to the individual Level 1 data files (CEL files). These are stored in 14 folders, corresponding to run batches. Here, we identify the folders and sort them in rough chronological order. Since this is a "freeze" of what we use to generate our RData file, we hardcode the directories used. ```r tcgaDataDirs <- c("broad.mit.edu_OV.HT_HG-U133A.Level_1.9.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.11.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.12.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.13.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.14.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.15.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.17.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.18.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.19.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.21.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.22.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.24.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.27.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.40.1007.0") nBatches <- length(tcgaDataDirs) batchNumber <- strsplit(tcgaDataDirs, "\\.") batchNumber <- unlist(lapply(batchNumber, function(x) { x[length(x) - 2] })) batchNumber <- as.numeric(batchNumber) batchNumber ``` ``` ## [1] 9 11 12 13 14 15 17 18 19 21 22 24 27 40 ``` ```r tcgaDataDirs <- tcgaDataDirs[order(batchNumber)] tcgaDataDirs ``` ``` ## [1] "broad.mit.edu_OV.HT_HG-U133A.Level_1.9.1007.0" ## [2] "broad.mit.edu_OV.HT_HG-U133A.Level_1.11.1007.0" ## [3] "broad.mit.edu_OV.HT_HG-U133A.Level_1.12.1007.0" ## [4] "broad.mit.edu_OV.HT_HG-U133A.Level_1.13.1007.0" ## [5] "broad.mit.edu_OV.HT_HG-U133A.Level_1.14.1007.0" ## [6] "broad.mit.edu_OV.HT_HG-U133A.Level_1.15.1007.0" ## [7] "broad.mit.edu_OV.HT_HG-U133A.Level_1.17.1007.0" ## [8] "broad.mit.edu_OV.HT_HG-U133A.Level_1.18.1007.0" ## [9] "broad.mit.edu_OV.HT_HG-U133A.Level_1.19.1007.0" ## [10] "broad.mit.edu_OV.HT_HG-U133A.Level_1.21.1007.0" ## [11] "broad.mit.edu_OV.HT_HG-U133A.Level_1.22.1007.0" ## [12] "broad.mit.edu_OV.HT_HG-U133A.Level_1.24.1007.0" ## [13] "broad.mit.edu_OV.HT_HG-U133A.Level_1.27.1007.0" ## [14] "broad.mit.edu_OV.HT_HG-U133A.Level_1.40.1007.0" ``` ```r batchNumber <- sort(batchNumber) ``` ### 5.2 Grabbing the CEL File Names Next, we get all of the individual filenames contained in each folder. ```r tcgaFiles <- vector("list", length(batchNumber)) names(tcgaFiles) <- paste("Batch", batchNumber, sep = ".") for (i1 in 1:nBatches) { tcgaFiles[[i1]] <- dir(file.path(pathToTCGAData, tcgaDataDirs[i1]), pattern = "CEL$") } unlist(lapply(tcgaFiles, length)) ``` ``` ## Batch.9 Batch.11 Batch.12 Batch.13 Batch.14 Batch.15 Batch.17 Batch.18 ## 45 37 46 47 46 22 47 47 ## Batch.19 Batch.21 Batch.22 Batch.24 Batch.27 Batch.40 ## 47 46 47 46 24 51 ``` ```r nFiles <- sum(unlist(lapply(tcgaFiles, length))) nFiles ``` ``` ## [1] 598 ``` ```r sampleBatch <- rep(batchNumber, times = unlist(lapply(tcgaFiles, length))) ``` There are 598 filenames, but (as noted above) these include samples not used in the analyses. We list out the full paths in a character vector for feeding to justRMA. ```r celFileNames <- unlist(tcgaFiles) celFileDirs <- rep(tcgaDataDirs, times = unlist(lapply(tcgaFiles, length))) celFilePaths <- file.path(pathToTCGAData, celFileDirs, celFileNames) unusedCELs <- as.character(sdrf[sdrf[, "Comment..TCGA.Include.for.Analysis."] == "no", "Array.Data.File"]) celFilePathsReduced <- celFilePaths[-match(unusedCELs, celFileNames)] ``` ### 5.3 Running justRMA Now we use justRMA to summarize expression at the probeset level. We exclude the 4 CEL files not included for analysis per the sdrf file. ```r d1 <- date() tcgaExpression <- justRMA(filenames = celFilePathsReduced) tcgaExpression <- exprs(tcgaExpression) d2 <- date() c(d1, d2) ``` ``` ## [1] "Wed Nov 20 10:53:29 2013" "Wed Nov 20 10:57:17 2013" ``` The justRMA computation takes between 5 and 6 minutes on my MacBook Pro. As an aside, we note that the RMA values computed here match the Level 2 values reported by TCGA quite well (to about 4 decimal places). Given that the group at the Broad is using a distinct implementation of RMA written for GenePattern, the differences are within roundoff error, and should have no substantive effect on any analyses. This difference in coding may also explain why the row (probeset) ordering produced by justRMA differs from that reported in the Level 2 files. ## 6 Mapping CEL Names to Sample Barcodes We now identify the sample barcodes using the sdrf file and parse them for more information. ```r barcodeRows <- match(celFileNames, as.character(sdrf[, "Array.Data.File"])) sampleBarcodes <- as.character(sdrf[barcodeRows, "Extract.Name"]) sum(duplicated(sampleBarcodes)) ``` ``` ## [1] 1 ``` ```r sampleBarcodes[sampleBarcodes == sampleBarcodes[duplicated(sampleBarcodes)]] ``` ``` ## [1] "TCGA-29-1704-02A-01R-0808-01" "TCGA-29-1704-02A-01R-0808-01" ``` ```r celFileNames[sampleBarcodes == sampleBarcodes[duplicated(sampleBarcodes)]] ``` ``` ## Batch.2720 ## "TARRE_p_MultiPlate_TCGA_SS_MA_Ref_HT_HG-U133A_96-HTA_F05_586106.CEL" ## Batch.2724 ## "URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990.CEL" ``` ```r unusedCELs ``` ``` ## [1] "URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990.CEL" ## [2] "FEAST_p_TCGA_B20_21_Expression_HT_HG-U133A_96-HTA_F04_516474.CEL" ## [3] "FEAST_p_TCGA_B20_21_Expression_HT_HG-U133A_96-HTA_G06_516372.CEL" ## [4] "TARRE_p_MultiPlate_TCGA_SS_MA_Ref_HT_HG-U133A_96-HTA_F05_586106.CEL" ``` ```r sampleBarcodes[duplicated(sampleBarcodes)] <- paste(sampleBarcodes[duplicated(sampleBarcodes)], "Rep", sep = ".") sourceSite <- substr(sampleBarcodes, 6, 7) patientID <- substr(sampleBarcodes, 9, 12) sampleType <- substr(sampleBarcodes, 14, 15) sampleTypeText <- rep("primaryTumor", nFiles) sampleTypeText[sampleType == "02"] <- "recurrentTumor" sampleTypeText[sampleType == "11"] <- "normalTissue" sampleUsed <- rep("yes", nFiles) sampleUsed[match(unusedCELs, celFileNames)] <- "no" ``` As noted above, one of the barcodes is used twice. To allow the barcodes to be used as sample IDs (rownames in a data frame), we add a suffix to the latter occurrence. Since the two CEL files for the same sample are among the four CEL files omitted from the analysis, the point is somewhat moot. We now bundle these bits of information into a data frame. ```r tcgaSampleInfo <- data.frame(sourceSite = sourceSite, patientID = patientID, sampleType = sampleType, sampleTypeText = sampleTypeText, sampleBatch = sampleBatch, row.names = sampleBarcodes) tcgaSampleInfo <- tcgaSampleInfo[sampleUsed == "yes", ] tcgaSampleInfo[1:4, ] ``` ``` ## sourceSite patientID sampleType ## TCGA-13-0758-01A-01R-0362-01 13 0758 01 ## TCGA-09-0364-01A-02R-0362-01 09 0364 01 ## TCGA-13-0723-01A-02R-0362-01 13 0723 01 ## TCGA-13-0757-01A-01R-0362-01 13 0757 01 ## sampleTypeText sampleBatch ## TCGA-13-0758-01A-01R-0362-01 primaryTumor 9 ## TCGA-09-0364-01A-02R-0362-01 primaryTumor 9 ## TCGA-13-0723-01A-02R-0362-01 primaryTumor 9 ## TCGA-13-0757-01A-01R-0362-01 primaryTumor 9 ``` ## 7 Saving RData Now we save the relevant information to an RData object. ```r colnames(tcgaExpression) <- as.character(sdrf[match(colnames(tcgaExpression), as.character(sdrf[, "Array.Data.File"])), "Extract.Name"]) all(colnames(tcgaExpression) == rownames(tcgaSampleInfo)) ``` ``` ## [1] TRUE ``` ```r save(tcgaSampleInfo, tcgaDataDirs, tcgaFiles, tcgaExpression, file = file.path("RDataObjects", "tcgaExpression.RData")) ``` ## 8 Appendix ### 8.1 File Location ```r getwd() ``` ``` ## [1] "/Users/slt/SLT WORKSPACE/EXEMPT/OVARIAN/Ovarian residual disease study 2012/RD manuscript/Web page for paper/Webpage" ``` ### 8.2 SessionInfo ```r sessionInfo() ``` ``` ## R version 3.0.2 (2013-09-25) ## Platform: x86_64-apple-darwin10.8.0 (64-bit) ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] parallel stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] hthgu133acdf_2.12.0 AnnotationDbi_1.22.6 affy_1.38.1 ## [4] Biobase_2.20.1 BiocGenerics_0.6.0 knitr_1.5 ## ## loaded via a namespace (and not attached): ## [1] affyio_1.28.0 BiocInstaller_1.10.4 DBI_0.2-7 ## [4] evaluate_0.5.1 formatR_0.9 IRanges_1.18.4 ## [7] preprocessCore_1.22.0 RSQLite_0.11.4 stats4_3.0.2 ## [10] stringr_0.6.2 tools_3.0.2 zlibbioc_1.6.0 ```