Assembling an RMA Quantification Matrix for the TCGA Ovarian Data

Keith A. Baggerly

1 Executive Summary

1.1 Introduction

We want to produce an RData file with a matrix of robust multi-array average (RMA) expression values for the TCGA ovarian cancer samples profiled with Affymetrix HT_HG-U133A arrays.

1.2 Methods

We acquired the 14 gzipped tarballs containing the individual Level 1 data files (CEL files) from the TCGA open-access http page, https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ov/cgcc/broad.mit.edu/ht_hg-u133a/transcriptome/, on September 2, 2012. According to the page, these files were last updated on June 24, 2011. At the same time, we also acquired the gzipped tarball with the MageTab (annotation) data.

Explicit lists of the batch and version numbers of the tarballs used are given in the text below.

We load the individual CEL files by folder, recording the folder (batch) information as we go, and use justRMA to compute RMA fits for the set. We extract the expression matrix and use the mapping information from the sample and data relationship format (sdrf) file in the MageTab folder to update the column names.

1.3 Results

We save tcgaSampleInfo, tcgaDataDirs, tcgaFiles, and tcgaExpression to the RData file “tcgaExpression.RData”.

In passing, we note that our quantifications match the reported TCGA Level 2 quantifications quite well (essentially to within roundoff error) when we restrict justRMA to the 594 (of 598) CEL files that are “used in analysis” per the sdrf file.

2 Libraries

We first load the libraries we will use in this report.


library(affy)
library(hthgu133acdf)

3 Specifying the Raw Data Location

Here, we specify the location of the data we acquired from TCGA on our local system. You will need to acquire these files and adjust this path before running this report yourself.


pathToTCGAData <- file.path("RawData", "TCGA", "CEL_Files")

4 The SDRF file from the MageTab Folder

We now load the sample description (sdrf) information.


sdrf <- read.table(file.path(pathToTCGAData, "broad.mit.edu_OV.HT_HG-U133A.mage-tab.1.1007.0", 
    "broad.mit.edu_OV.HT_HG-U133A.sdrf.txt"), header = TRUE, sep = "\t")
dim(sdrf)
## [1] 598  33
sdrf[1, ]
##                   Extract.Name                          Protocol.REF
## 1 TCGA-29-1704-02A-01R-0808-01 broad.mit.edu:labeling:HT_HG-U133A:01
##           Labeled.Extract.Name  Label Term.Source.REF
## 1 TCGA-29-1704-02A-01R-0808-01 biotin   MGED Ontology
##                               Protocol.REF.1
## 1 broad.mit.edu:hybridization:HT_HG-U133A:01
##                                       Hybridization.Name
## 1 URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990
##                                 Array.Design.REF Term.Source.REF.1
## 1 Affymetrix.com:PhysicalArrayDesign:HT_HG-U133A           caArray
##                                   Protocol.REF.2
## 1 broad.mit.edu:image_acquisition:HT_HG-U133A:01
##                      Scan.Name
## 1 TCGA-29-1704-02A-01R-0808-01
##                                              Array.Data.File
## 1 URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990.CEL
##                      Comment..TCGA.Archive.Name. Comment..TCGA.Data.Type.
## 1 broad.mit.edu_OV.HT_HG-U133A.Level_1.27.1007.0          Expression-Gene
##   Comment..TCGA.Data.Level. Comment..TCGA.Include.for.Analysis.
## 1                   Level 1                                  no
##                      Comment..md5. Protocol.REF.3 Normalization.Name
## 1 d47d925a83a2749fb09448736afa8a4b             ->                 ->
##   Derived.Array.Data.Matrix.File Comment..TCGA.Archive.Name..1
## 1                             ->                            ->
##   Comment..TCGA.Data.Type..1 Comment..TCGA.Data.Level..1
## 1                         ->                          ->
##   Comment..TCGA.Include.for.Analysis..1 Comment..md5..1 Protocol.REF.4
## 1                                    ->              ->             ->
##   Normalization.Name.1 Derived.Array.Data.Matrix.File.1
## 1                   ->                               ->
##   Comment..TCGA.Archive.Name..2 Comment..TCGA.Data.Type..2
## 1                            ->                         ->
##   Comment..TCGA.Data.Level..2 Comment..TCGA.Include.for.Analysis..2
## 1                          ->                                    ->
##   Comment..md5..2
## 1              ->

length(unique(sdrf[, 1]))
## [1] 597

There were 598 arrays run, but only 597 distinct samples; one sample was run twice. We now check which sample this was, and whether we care.


which(sdrf[, 1] == sdrf[which(duplicated(sdrf[, 1])), 1])
## [1]   1 207
as.character(sdrf[which(sdrf[, 1] == sdrf[which(duplicated(sdrf[, 1])), 1]), 
    "Hybridization.Name"])
## [1] "URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990"         
## [2] "TARRE_p_MultiPlate_TCGA_SS_MA_Ref_HT_HG-U133A_96-HTA_F05_586106"
which(sdrf[, "Comment..TCGA.Include.for.Analysis."] == "no")
## [1]   1 157 171 207
as.character(sdrf[which(sdrf[, "Comment..TCGA.Include.for.Analysis."] == "no"), 
    "Hybridization.Name"])
## [1] "URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990"         
## [2] "FEAST_p_TCGA_B20_21_Expression_HT_HG-U133A_96-HTA_F04_516474"   
## [3] "FEAST_p_TCGA_B20_21_Expression_HT_HG-U133A_96-HTA_G06_516372"   
## [4] "TARRE_p_MultiPlate_TCGA_SS_MA_Ref_HT_HG-U133A_96-HTA_F05_586106"

As it happens, four of the CEL files (including the two where the same sample was run) are excluded from later analyses. Spot checking the files in Batch 27 (where both of the duplicates were) shows these samples are present in the Level 1 but not in the Level 2 data. We restrict our quantification to just the 594 samples used.

5 Quantifying The CEL Files

5.1 Identifying the Data Directories

Next, we turn to the individual Level 1 data files (CEL files). These are stored in 14 folders, corresponding to run batches. Here, we identify the folders and sort them in rough chronological order. Since this is a “freeze” of what we use to generate our RData file, we hardcode the directories used.


tcgaDataDirs <- c("broad.mit.edu_OV.HT_HG-U133A.Level_1.9.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.11.1007.0", 
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.12.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.13.1007.0", 
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.14.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.15.1007.0", 
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.17.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.18.1007.0", 
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.19.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.21.1007.0", 
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.22.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.24.1007.0", 
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.27.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.40.1007.0")

nBatches <- length(tcgaDataDirs)

batchNumber <- strsplit(tcgaDataDirs, "\\.")
batchNumber <- unlist(lapply(batchNumber, function(x) {
    x[length(x) - 2]
}))
batchNumber <- as.numeric(batchNumber)
batchNumber
##  [1]  9 11 12 13 14 15 17 18 19 21 22 24 27 40

tcgaDataDirs <- tcgaDataDirs[order(batchNumber)]
tcgaDataDirs
##  [1] "broad.mit.edu_OV.HT_HG-U133A.Level_1.9.1007.0" 
##  [2] "broad.mit.edu_OV.HT_HG-U133A.Level_1.11.1007.0"
##  [3] "broad.mit.edu_OV.HT_HG-U133A.Level_1.12.1007.0"
##  [4] "broad.mit.edu_OV.HT_HG-U133A.Level_1.13.1007.0"
##  [5] "broad.mit.edu_OV.HT_HG-U133A.Level_1.14.1007.0"
##  [6] "broad.mit.edu_OV.HT_HG-U133A.Level_1.15.1007.0"
##  [7] "broad.mit.edu_OV.HT_HG-U133A.Level_1.17.1007.0"
##  [8] "broad.mit.edu_OV.HT_HG-U133A.Level_1.18.1007.0"
##  [9] "broad.mit.edu_OV.HT_HG-U133A.Level_1.19.1007.0"
## [10] "broad.mit.edu_OV.HT_HG-U133A.Level_1.21.1007.0"
## [11] "broad.mit.edu_OV.HT_HG-U133A.Level_1.22.1007.0"
## [12] "broad.mit.edu_OV.HT_HG-U133A.Level_1.24.1007.0"
## [13] "broad.mit.edu_OV.HT_HG-U133A.Level_1.27.1007.0"
## [14] "broad.mit.edu_OV.HT_HG-U133A.Level_1.40.1007.0"

batchNumber <- sort(batchNumber)

5.2 Grabbing the CEL File Names

Next, we get all of the individual filenames contained in each folder.


tcgaFiles <- vector("list", length(batchNumber))
names(tcgaFiles) <- paste("Batch", batchNumber, sep = ".")
for (i1 in 1:nBatches) {
    tcgaFiles[[i1]] <- dir(file.path(pathToTCGAData, tcgaDataDirs[i1]), pattern = "CEL$")
}
unlist(lapply(tcgaFiles, length))
##  Batch.9 Batch.11 Batch.12 Batch.13 Batch.14 Batch.15 Batch.17 Batch.18 
##       45       37       46       47       46       22       47       47 
## Batch.19 Batch.21 Batch.22 Batch.24 Batch.27 Batch.40 
##       47       46       47       46       24       51
nFiles <- sum(unlist(lapply(tcgaFiles, length)))
nFiles
## [1] 598

sampleBatch <- rep(batchNumber, times = unlist(lapply(tcgaFiles, length)))

There are 598 filenames, but (as noted above) these include samples not used in the analyses.

We list out the full paths in a character vector for feeding to justRMA.


celFileNames <- unlist(tcgaFiles)
celFileDirs <- rep(tcgaDataDirs, times = unlist(lapply(tcgaFiles, length)))
celFilePaths <- file.path(pathToTCGAData, celFileDirs, celFileNames)

unusedCELs <- as.character(sdrf[sdrf[, "Comment..TCGA.Include.for.Analysis."] == 
    "no", "Array.Data.File"])

celFilePathsReduced <- celFilePaths[-match(unusedCELs, celFileNames)]

5.3 Running justRMA

Now we use justRMA to summarize expression at the probeset level. We exclude the 4 CEL files not included for analysis per the sdrf file.


d1 <- date()
tcgaExpression <- justRMA(filenames = celFilePathsReduced)
tcgaExpression <- exprs(tcgaExpression)
d2 <- date()
c(d1, d2)
## [1] "Wed Nov 20 10:53:29 2013" "Wed Nov 20 10:57:17 2013"

The justRMA computation takes between 5 and 6 minutes on my MacBook Pro.

As an aside, we note that the RMA values computed here match the Level 2 values reported by TCGA quite well (to about 4 decimal places). Given that the group at the Broad is using a distinct implementation of RMA written for GenePattern, the differences are within roundoff error, and should have no substantive effect on any analyses. This difference in coding may also explain why the row (probeset) ordering produced by justRMA differs from that reported in the Level 2 files.

6 Mapping CEL Names to Sample Barcodes

We now identify the sample barcodes using the sdrf file and parse them for more information.


barcodeRows <- match(celFileNames, as.character(sdrf[, "Array.Data.File"]))
sampleBarcodes <- as.character(sdrf[barcodeRows, "Extract.Name"])

sum(duplicated(sampleBarcodes))
## [1] 1
sampleBarcodes[sampleBarcodes == sampleBarcodes[duplicated(sampleBarcodes)]]
## [1] "TCGA-29-1704-02A-01R-0808-01" "TCGA-29-1704-02A-01R-0808-01"
celFileNames[sampleBarcodes == sampleBarcodes[duplicated(sampleBarcodes)]]
##                                                            Batch.2720 
## "TARRE_p_MultiPlate_TCGA_SS_MA_Ref_HT_HG-U133A_96-HTA_F05_586106.CEL" 
##                                                            Batch.2724 
##          "URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990.CEL"
unusedCELs
## [1] "URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990.CEL"         
## [2] "FEAST_p_TCGA_B20_21_Expression_HT_HG-U133A_96-HTA_F04_516474.CEL"   
## [3] "FEAST_p_TCGA_B20_21_Expression_HT_HG-U133A_96-HTA_G06_516372.CEL"   
## [4] "TARRE_p_MultiPlate_TCGA_SS_MA_Ref_HT_HG-U133A_96-HTA_F05_586106.CEL"
sampleBarcodes[duplicated(sampleBarcodes)] <- paste(sampleBarcodes[duplicated(sampleBarcodes)], 
    "Rep", sep = ".")

sourceSite <- substr(sampleBarcodes, 6, 7)
patientID <- substr(sampleBarcodes, 9, 12)
sampleType <- substr(sampleBarcodes, 14, 15)
sampleTypeText <- rep("primaryTumor", nFiles)
sampleTypeText[sampleType == "02"] <- "recurrentTumor"
sampleTypeText[sampleType == "11"] <- "normalTissue"

sampleUsed <- rep("yes", nFiles)
sampleUsed[match(unusedCELs, celFileNames)] <- "no"

As noted above, one of the barcodes is used twice. To allow the barcodes to be used as sample IDs (rownames in a data frame), we add a suffix to the latter occurrence. Since the two CEL files for the same sample are among the four CEL files omitted from the analysis, the point is somewhat moot.

We now bundle these bits of information into a data frame.


tcgaSampleInfo <- data.frame(sourceSite = sourceSite, patientID = patientID, 
    sampleType = sampleType, sampleTypeText = sampleTypeText, sampleBatch = sampleBatch, 
    row.names = sampleBarcodes)
tcgaSampleInfo <- tcgaSampleInfo[sampleUsed == "yes", ]

tcgaSampleInfo[1:4, ]
##                              sourceSite patientID sampleType
## TCGA-13-0758-01A-01R-0362-01         13      0758         01
## TCGA-09-0364-01A-02R-0362-01         09      0364         01
## TCGA-13-0723-01A-02R-0362-01         13      0723         01
## TCGA-13-0757-01A-01R-0362-01         13      0757         01
##                              sampleTypeText sampleBatch
## TCGA-13-0758-01A-01R-0362-01   primaryTumor           9
## TCGA-09-0364-01A-02R-0362-01   primaryTumor           9
## TCGA-13-0723-01A-02R-0362-01   primaryTumor           9
## TCGA-13-0757-01A-01R-0362-01   primaryTumor           9

7 Saving RData

Now we save the relevant information to an RData object.


colnames(tcgaExpression) <- as.character(sdrf[match(colnames(tcgaExpression), 
    as.character(sdrf[, "Array.Data.File"])), "Extract.Name"])

all(colnames(tcgaExpression) == rownames(tcgaSampleInfo))
## [1] TRUE

save(tcgaSampleInfo, tcgaDataDirs, tcgaFiles, tcgaExpression, file = file.path("RDataObjects", 
    "tcgaExpression.RData"))

8 Appendix

8.1 File Location


getwd()
## [1] "/Users/slt/SLT WORKSPACE/EXEMPT/OVARIAN/Ovarian residual disease study 2012/RD manuscript/Web page for paper/Webpage"

8.2 SessionInfo


sessionInfo()
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] hthgu133acdf_2.12.0  AnnotationDbi_1.22.6 affy_1.38.1         
## [4] Biobase_2.20.1       BiocGenerics_0.6.0   knitr_1.5           
## 
## loaded via a namespace (and not attached):
##  [1] affyio_1.28.0         BiocInstaller_1.10.4  DBI_0.2-7            
##  [4] evaluate_0.5.1        formatR_0.9           IRanges_1.18.4       
##  [7] preprocessCore_1.22.0 RSQLite_0.11.4        stats4_3.0.2         
## [10] stringr_0.6.2         tools_3.0.2           zlibbioc_1.6.0