Keith A. Baggerly
We want to produce an RData file with a matrix of robust multi-array average (RMA) expression values for the TCGA ovarian cancer samples profiled with Affymetrix HT_HG-U133A arrays.
We acquired the 14 gzipped tarballs containing the individual Level 1 data files (CEL files) from the TCGA open-access http page, https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ov/cgcc/broad.mit.edu/ht_hg-u133a/transcriptome/, on September 2, 2012. According to the page, these files were last updated on June 24, 2011. At the same time, we also acquired the gzipped tarball with the MageTab (annotation) data.
Explicit lists of the batch and version numbers of the tarballs used are given in the text below.
We load the individual CEL files by folder, recording the folder (batch) information as we go, and use justRMA to compute RMA fits for the set. We extract the expression matrix and use the mapping information from the sample and data relationship format (sdrf) file in the MageTab folder to update the column names.
We save tcgaSampleInfo, tcgaDataDirs, tcgaFiles, and tcgaExpression to the RData file “tcgaExpression.RData”.
In passing, we note that our quantifications match the reported TCGA Level 2 quantifications quite well (essentially to within roundoff error) when we restrict justRMA to the 594 (of 598) CEL files that are “used in analysis” per the sdrf file.
We first load the libraries we will use in this report.
library(affy)
library(hthgu133acdf)
Here, we specify the location of the data we acquired from TCGA on our local system. You will need to acquire these files and adjust this path before running this report yourself.
pathToTCGAData <- file.path("RawData", "TCGA", "CEL_Files")
We now load the sample description (sdrf) information.
sdrf <- read.table(file.path(pathToTCGAData, "broad.mit.edu_OV.HT_HG-U133A.mage-tab.1.1007.0",
"broad.mit.edu_OV.HT_HG-U133A.sdrf.txt"), header = TRUE, sep = "\t")
dim(sdrf)
## [1] 598 33
sdrf[1, ]
## Extract.Name Protocol.REF
## 1 TCGA-29-1704-02A-01R-0808-01 broad.mit.edu:labeling:HT_HG-U133A:01
## Labeled.Extract.Name Label Term.Source.REF
## 1 TCGA-29-1704-02A-01R-0808-01 biotin MGED Ontology
## Protocol.REF.1
## 1 broad.mit.edu:hybridization:HT_HG-U133A:01
## Hybridization.Name
## 1 URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990
## Array.Design.REF Term.Source.REF.1
## 1 Affymetrix.com:PhysicalArrayDesign:HT_HG-U133A caArray
## Protocol.REF.2
## 1 broad.mit.edu:image_acquisition:HT_HG-U133A:01
## Scan.Name
## 1 TCGA-29-1704-02A-01R-0808-01
## Array.Data.File
## 1 URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990.CEL
## Comment..TCGA.Archive.Name. Comment..TCGA.Data.Type.
## 1 broad.mit.edu_OV.HT_HG-U133A.Level_1.27.1007.0 Expression-Gene
## Comment..TCGA.Data.Level. Comment..TCGA.Include.for.Analysis.
## 1 Level 1 no
## Comment..md5. Protocol.REF.3 Normalization.Name
## 1 d47d925a83a2749fb09448736afa8a4b -> ->
## Derived.Array.Data.Matrix.File Comment..TCGA.Archive.Name..1
## 1 -> ->
## Comment..TCGA.Data.Type..1 Comment..TCGA.Data.Level..1
## 1 -> ->
## Comment..TCGA.Include.for.Analysis..1 Comment..md5..1 Protocol.REF.4
## 1 -> -> ->
## Normalization.Name.1 Derived.Array.Data.Matrix.File.1
## 1 -> ->
## Comment..TCGA.Archive.Name..2 Comment..TCGA.Data.Type..2
## 1 -> ->
## Comment..TCGA.Data.Level..2 Comment..TCGA.Include.for.Analysis..2
## 1 -> ->
## Comment..md5..2
## 1 ->
length(unique(sdrf[, 1]))
## [1] 597
There were 598 arrays run, but only 597 distinct samples; one sample was run twice. We now check which sample this was, and whether we care.
which(sdrf[, 1] == sdrf[which(duplicated(sdrf[, 1])), 1])
## [1] 1 207
as.character(sdrf[which(sdrf[, 1] == sdrf[which(duplicated(sdrf[, 1])), 1]),
"Hybridization.Name"])
## [1] "URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990"
## [2] "TARRE_p_MultiPlate_TCGA_SS_MA_Ref_HT_HG-U133A_96-HTA_F05_586106"
which(sdrf[, "Comment..TCGA.Include.for.Analysis."] == "no")
## [1] 1 157 171 207
as.character(sdrf[which(sdrf[, "Comment..TCGA.Include.for.Analysis."] == "no"),
"Hybridization.Name"])
## [1] "URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990"
## [2] "FEAST_p_TCGA_B20_21_Expression_HT_HG-U133A_96-HTA_F04_516474"
## [3] "FEAST_p_TCGA_B20_21_Expression_HT_HG-U133A_96-HTA_G06_516372"
## [4] "TARRE_p_MultiPlate_TCGA_SS_MA_Ref_HT_HG-U133A_96-HTA_F05_586106"
As it happens, four of the CEL files (including the two where the same sample was run) are excluded from later analyses. Spot checking the files in Batch 27 (where both of the duplicates were) shows these samples are present in the Level 1 but not in the Level 2 data. We restrict our quantification to just the 594 samples used.
Next, we turn to the individual Level 1 data files (CEL files). These are stored in 14 folders, corresponding to run batches. Here, we identify the folders and sort them in rough chronological order. Since this is a “freeze” of what we use to generate our RData file, we hardcode the directories used.
tcgaDataDirs <- c("broad.mit.edu_OV.HT_HG-U133A.Level_1.9.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.11.1007.0",
"broad.mit.edu_OV.HT_HG-U133A.Level_1.12.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.13.1007.0",
"broad.mit.edu_OV.HT_HG-U133A.Level_1.14.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.15.1007.0",
"broad.mit.edu_OV.HT_HG-U133A.Level_1.17.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.18.1007.0",
"broad.mit.edu_OV.HT_HG-U133A.Level_1.19.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.21.1007.0",
"broad.mit.edu_OV.HT_HG-U133A.Level_1.22.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.24.1007.0",
"broad.mit.edu_OV.HT_HG-U133A.Level_1.27.1007.0", "broad.mit.edu_OV.HT_HG-U133A.Level_1.40.1007.0")
nBatches <- length(tcgaDataDirs)
batchNumber <- strsplit(tcgaDataDirs, "\\.")
batchNumber <- unlist(lapply(batchNumber, function(x) {
x[length(x) - 2]
}))
batchNumber <- as.numeric(batchNumber)
batchNumber
## [1] 9 11 12 13 14 15 17 18 19 21 22 24 27 40
tcgaDataDirs <- tcgaDataDirs[order(batchNumber)]
tcgaDataDirs
## [1] "broad.mit.edu_OV.HT_HG-U133A.Level_1.9.1007.0"
## [2] "broad.mit.edu_OV.HT_HG-U133A.Level_1.11.1007.0"
## [3] "broad.mit.edu_OV.HT_HG-U133A.Level_1.12.1007.0"
## [4] "broad.mit.edu_OV.HT_HG-U133A.Level_1.13.1007.0"
## [5] "broad.mit.edu_OV.HT_HG-U133A.Level_1.14.1007.0"
## [6] "broad.mit.edu_OV.HT_HG-U133A.Level_1.15.1007.0"
## [7] "broad.mit.edu_OV.HT_HG-U133A.Level_1.17.1007.0"
## [8] "broad.mit.edu_OV.HT_HG-U133A.Level_1.18.1007.0"
## [9] "broad.mit.edu_OV.HT_HG-U133A.Level_1.19.1007.0"
## [10] "broad.mit.edu_OV.HT_HG-U133A.Level_1.21.1007.0"
## [11] "broad.mit.edu_OV.HT_HG-U133A.Level_1.22.1007.0"
## [12] "broad.mit.edu_OV.HT_HG-U133A.Level_1.24.1007.0"
## [13] "broad.mit.edu_OV.HT_HG-U133A.Level_1.27.1007.0"
## [14] "broad.mit.edu_OV.HT_HG-U133A.Level_1.40.1007.0"
batchNumber <- sort(batchNumber)
Next, we get all of the individual filenames contained in each folder.
tcgaFiles <- vector("list", length(batchNumber))
names(tcgaFiles) <- paste("Batch", batchNumber, sep = ".")
for (i1 in 1:nBatches) {
tcgaFiles[[i1]] <- dir(file.path(pathToTCGAData, tcgaDataDirs[i1]), pattern = "CEL$")
}
unlist(lapply(tcgaFiles, length))
## Batch.9 Batch.11 Batch.12 Batch.13 Batch.14 Batch.15 Batch.17 Batch.18
## 45 37 46 47 46 22 47 47
## Batch.19 Batch.21 Batch.22 Batch.24 Batch.27 Batch.40
## 47 46 47 46 24 51
nFiles <- sum(unlist(lapply(tcgaFiles, length)))
nFiles
## [1] 598
sampleBatch <- rep(batchNumber, times = unlist(lapply(tcgaFiles, length)))
There are 598 filenames, but (as noted above) these include samples not used in the analyses.
We list out the full paths in a character vector for feeding to justRMA.
celFileNames <- unlist(tcgaFiles)
celFileDirs <- rep(tcgaDataDirs, times = unlist(lapply(tcgaFiles, length)))
celFilePaths <- file.path(pathToTCGAData, celFileDirs, celFileNames)
unusedCELs <- as.character(sdrf[sdrf[, "Comment..TCGA.Include.for.Analysis."] ==
"no", "Array.Data.File"])
celFilePathsReduced <- celFilePaths[-match(unusedCELs, celFileNames)]
Now we use justRMA to summarize expression at the probeset level. We exclude the 4 CEL files not included for analysis per the sdrf file.
d1 <- date()
tcgaExpression <- justRMA(filenames = celFilePathsReduced)
tcgaExpression <- exprs(tcgaExpression)
d2 <- date()
c(d1, d2)
## [1] "Wed Nov 20 10:53:29 2013" "Wed Nov 20 10:57:17 2013"
The justRMA computation takes between 5 and 6 minutes on my MacBook Pro.
As an aside, we note that the RMA values computed here match the Level 2 values reported by TCGA quite well (to about 4 decimal places). Given that the group at the Broad is using a distinct implementation of RMA written for GenePattern, the differences are within roundoff error, and should have no substantive effect on any analyses. This difference in coding may also explain why the row (probeset) ordering produced by justRMA differs from that reported in the Level 2 files.
We now identify the sample barcodes using the sdrf file and parse them for more information.
barcodeRows <- match(celFileNames, as.character(sdrf[, "Array.Data.File"]))
sampleBarcodes <- as.character(sdrf[barcodeRows, "Extract.Name"])
sum(duplicated(sampleBarcodes))
## [1] 1
sampleBarcodes[sampleBarcodes == sampleBarcodes[duplicated(sampleBarcodes)]]
## [1] "TCGA-29-1704-02A-01R-0808-01" "TCGA-29-1704-02A-01R-0808-01"
celFileNames[sampleBarcodes == sampleBarcodes[duplicated(sampleBarcodes)]]
## Batch.2720
## "TARRE_p_MultiPlate_TCGA_SS_MA_Ref_HT_HG-U133A_96-HTA_F05_586106.CEL"
## Batch.2724
## "URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990.CEL"
unusedCELs
## [1] "URGED_p_TCGA_Pop_May2011_HT_HG-U133A_96-HTA_D05_786990.CEL"
## [2] "FEAST_p_TCGA_B20_21_Expression_HT_HG-U133A_96-HTA_F04_516474.CEL"
## [3] "FEAST_p_TCGA_B20_21_Expression_HT_HG-U133A_96-HTA_G06_516372.CEL"
## [4] "TARRE_p_MultiPlate_TCGA_SS_MA_Ref_HT_HG-U133A_96-HTA_F05_586106.CEL"
sampleBarcodes[duplicated(sampleBarcodes)] <- paste(sampleBarcodes[duplicated(sampleBarcodes)],
"Rep", sep = ".")
sourceSite <- substr(sampleBarcodes, 6, 7)
patientID <- substr(sampleBarcodes, 9, 12)
sampleType <- substr(sampleBarcodes, 14, 15)
sampleTypeText <- rep("primaryTumor", nFiles)
sampleTypeText[sampleType == "02"] <- "recurrentTumor"
sampleTypeText[sampleType == "11"] <- "normalTissue"
sampleUsed <- rep("yes", nFiles)
sampleUsed[match(unusedCELs, celFileNames)] <- "no"
As noted above, one of the barcodes is used twice. To allow the barcodes to be used as sample IDs (rownames in a data frame), we add a suffix to the latter occurrence. Since the two CEL files for the same sample are among the four CEL files omitted from the analysis, the point is somewhat moot.
We now bundle these bits of information into a data frame.
tcgaSampleInfo <- data.frame(sourceSite = sourceSite, patientID = patientID,
sampleType = sampleType, sampleTypeText = sampleTypeText, sampleBatch = sampleBatch,
row.names = sampleBarcodes)
tcgaSampleInfo <- tcgaSampleInfo[sampleUsed == "yes", ]
tcgaSampleInfo[1:4, ]
## sourceSite patientID sampleType
## TCGA-13-0758-01A-01R-0362-01 13 0758 01
## TCGA-09-0364-01A-02R-0362-01 09 0364 01
## TCGA-13-0723-01A-02R-0362-01 13 0723 01
## TCGA-13-0757-01A-01R-0362-01 13 0757 01
## sampleTypeText sampleBatch
## TCGA-13-0758-01A-01R-0362-01 primaryTumor 9
## TCGA-09-0364-01A-02R-0362-01 primaryTumor 9
## TCGA-13-0723-01A-02R-0362-01 primaryTumor 9
## TCGA-13-0757-01A-01R-0362-01 primaryTumor 9
Now we save the relevant information to an RData object.
colnames(tcgaExpression) <- as.character(sdrf[match(colnames(tcgaExpression),
as.character(sdrf[, "Array.Data.File"])), "Extract.Name"])
all(colnames(tcgaExpression) == rownames(tcgaSampleInfo))
## [1] TRUE
save(tcgaSampleInfo, tcgaDataDirs, tcgaFiles, tcgaExpression, file = file.path("RDataObjects",
"tcgaExpression.RData"))
getwd()
## [1] "/Users/slt/SLT WORKSPACE/EXEMPT/OVARIAN/Ovarian residual disease study 2012/RD manuscript/Web page for paper/Webpage"
sessionInfo()
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] hthgu133acdf_2.12.0 AnnotationDbi_1.22.6 affy_1.38.1
## [4] Biobase_2.20.1 BiocGenerics_0.6.0 knitr_1.5
##
## loaded via a namespace (and not attached):
## [1] affyio_1.28.0 BiocInstaller_1.10.4 DBI_0.2-7
## [4] evaluate_0.5.1 formatR_0.9 IRanges_1.18.4
## [7] preprocessCore_1.22.0 RSQLite_0.11.4 stats4_3.0.2
## [10] stringr_0.6.2 tools_3.0.2 zlibbioc_1.6.0