Assembling an RMA Quantification Matrix for the TCGA Ovarian Data
=================================================================

Keith A. Baggerly

## 1 Executive Summary

### 1.1 Introduction

We want to produce an RData file with a matrix of robust multi-array average (RMA)
expression values for the TCGA ovarian cancer samples
profiled with Affymetrix HT\_HG-U133A arrays.

### 1.2 Methods

We acquired the 14 gzipped tarballs containing the
individual Level 1 data files (CEL files) from the
TCGA open-access http page,
[https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ov/cgcc/broad.mit.edu/ht_hg-u133a/transcriptome/](https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ov/cgcc/broad.mit.edu/ht_hg-u133a/transcriptome/),
on September 2, 2012.
According to the page,
these files were last updated on June 24, 2011.
At the same time, we also acquired the
gzipped tarball with the
MageTab (annotation) data.

Explicit lists of the batch and version numbers
of the tarballs used are given in the text below.

We load the individual CEL files by folder,
recording the folder (batch) information as we go,
and use justRMA to compute RMA fits for the set.
We extract the expression matrix and use the
mapping information from the sample and data relationship format (sdrf) file in the
MageTab folder to update the column names.

### 1.3 Results

We save tcgaSampleInfo,
     tcgaDataDirs,
     tcgaFiles, and
     tcgaExpression
to the RData file "tcgaExpression.RData".

In passing, we note that our quantifications
match the reported TCGA Level 2 quantifications
quite well (essentially to within roundoff error)
when we restrict justRMA to the 594 (of 598) CEL
files that are "used in analysis" per the sdrf file.

## 2 Libraries

We first load the libraries we will use
in this report.

```{r libraries, message=FALSE}

library(affy)
library(hthgu133acdf)

```

## 3 Specifying the Raw Data Location

Here, we specify the location of the data we acquired
from TCGA on our local system. You will need to acquire
these files and adjust this path before running
this report yourself.

```{r pathToTCGAData}

pathToTCGAData <-
    file.path("RawData","TCGA","CEL_Files")

```

## 4 The SDRF file from the MageTab Folder

We now load the sample description (sdrf) information.

```{r loadSDRF}

sdrf <-
    read.table(file.path(pathToTCGAData,
                         "broad.mit.edu_OV.HT_HG-U133A.mage-tab.1.1007.0",
                         "broad.mit.edu_OV.HT_HG-U133A.sdrf.txt"),
                   header=TRUE, sep="\t")
dim(sdrf)
sdrf[1,]

length(unique(sdrf[,1]))

```

There were 598 arrays run, but only 597 distinct samples;
one sample was run twice. We now check which sample this
was, and whether we care.

```{r checkDuplicates}

which(sdrf[,1]==sdrf[which(duplicated(sdrf[,1])),1])
as.character(sdrf[which(sdrf[,1]==sdrf[which(duplicated(sdrf[,1])),1]),
                  "Hybridization.Name"])
which(sdrf[,"Comment..TCGA.Include.for.Analysis."]=="no")
as.character(sdrf[which(sdrf[,"Comment..TCGA.Include.for.Analysis."]=="no"),
                  "Hybridization.Name"])

```

As it happens, four of the CEL files (including the
two where the same sample was run) are excluded from
later analyses. Spot checking the files in Batch 27
(where both of the duplicates were) shows these samples
are present in the Level 1 but not in the Level 2 data.
We restrict our quantification to just the 594 samples
used.

## 5 Quantifying The CEL Files

### 5.1 Identifying the Data Directories

Next, we turn to the individual Level 1 data files (CEL files).
These are stored in 14 folders, corresponding
to run batches. Here, we identify
the folders and sort them in rough chronological
order. Since this is a "freeze" of what we
use to generate our RData file, we hardcode
the directories used.

```{r getDataDirs}

tcgaDataDirs <-
    c(
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.9.1007.0",
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.11.1007.0",
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.12.1007.0",
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.13.1007.0",
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.14.1007.0",
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.15.1007.0",
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.17.1007.0",
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.18.1007.0",
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.19.1007.0",
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.21.1007.0",
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.22.1007.0",
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.24.1007.0",
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.27.1007.0",
    "broad.mit.edu_OV.HT_HG-U133A.Level_1.40.1007.0"
    )

nBatches <- length(tcgaDataDirs)

batchNumber <- strsplit(tcgaDataDirs,"\\.")
batchNumber <- unlist(lapply(batchNumber,function(x){x[length(x)-2]}))
batchNumber <- as.numeric(batchNumber)
batchNumber

tcgaDataDirs <- tcgaDataDirs[order(batchNumber)]
tcgaDataDirs

batchNumber <- sort(batchNumber)

```

### 5.2 Grabbing the CEL File Names

Next, we get all of the individual filenames
contained in each folder.

```{r grabFilenames}

tcgaFiles <- vector("list",length(batchNumber))
names(tcgaFiles) <- paste("Batch",batchNumber,sep=".")
for(i1 in 1:nBatches){
    tcgaFiles[[i1]] <-
        dir(file.path(pathToTCGAData,tcgaDataDirs[i1]),pattern="CEL$")
}
unlist(lapply(tcgaFiles,length))
nFiles <- sum(unlist(lapply(tcgaFiles,length)))
nFiles

sampleBatch <- rep(batchNumber,times=unlist(lapply(tcgaFiles,length)))

```

There are 598 filenames, but (as noted above) these include samples
not used in the analyses.

We list out the full paths in a character vector for
feeding to justRMA.

```{r celFilePaths}

celFileNames <- unlist(tcgaFiles)
celFileDirs <- rep(tcgaDataDirs,times=unlist(lapply(tcgaFiles,length)))
celFilePaths <- file.path(pathToTCGAData,celFileDirs,celFileNames)

unusedCELs <-
    as.character(sdrf[sdrf[,"Comment..TCGA.Include.for.Analysis."]=="no",
                      "Array.Data.File"])

celFilePathsReduced <- celFilePaths[-match(unusedCELs,celFileNames)]

```

### 5.3 Running justRMA

Now we use justRMA to summarize expression at the probeset level.
We exclude the 4 CEL files not included for analysis per the sdrf file.

```{r fitRMA,message=FALSE}

d1 <- date()
tcgaExpression <- justRMA(filenames=celFilePathsReduced)
tcgaExpression <- exprs(tcgaExpression)
d2 <- date()
c(d1,d2)

```

The justRMA computation takes between 5 and
6 minutes on my MacBook Pro.

As an aside, we note that
the RMA values computed here match the Level 2 values
reported by TCGA quite well (to about 4 decimal places).
Given that
the group at the Broad is using a distinct implementation
of RMA written for GenePattern, the differences are within
roundoff error, and should have no substantive effect on
any analyses. This difference in coding may also explain
why the row (probeset) ordering produced by justRMA
differs from that reported in the Level 2 files.

## 6 Mapping CEL Names to Sample Barcodes

We now identify the sample barcodes using the sdrf
file and parse them for more information.

```{r buildSampleInfoPt1}

barcodeRows <- match(celFileNames,as.character(sdrf[,"Array.Data.File"]))
sampleBarcodes <- as.character(sdrf[barcodeRows,"Extract.Name"])

sum(duplicated(sampleBarcodes))
sampleBarcodes[sampleBarcodes==sampleBarcodes[duplicated(sampleBarcodes)]]
celFileNames[sampleBarcodes==sampleBarcodes[duplicated(sampleBarcodes)]]
unusedCELs
sampleBarcodes[duplicated(sampleBarcodes)] <-
    paste(sampleBarcodes[duplicated(sampleBarcodes)],"Rep",sep=".")

sourceSite <- substr(sampleBarcodes,6,7)
patientID  <- substr(sampleBarcodes,9,12)
sampleType <- substr(sampleBarcodes,14,15)
sampleTypeText <- rep("primaryTumor",nFiles)
sampleTypeText[sampleType=="02"] <- "recurrentTumor"
sampleTypeText[sampleType=="11"] <- "normalTissue"

sampleUsed <- rep("yes",nFiles)
sampleUsed[match(unusedCELs,celFileNames)] <- "no"

```

As noted above, one of the barcodes is used twice.
To allow the barcodes to be used as sample
IDs (rownames in a data frame), we add a suffix
to the latter occurrence. Since the two CEL files
for the same sample are among the four CEL files
omitted from the analysis, the point is somewhat moot.

We now bundle these bits of information into a
data frame.

```{r buildSampleInfoPt2}

tcgaSampleInfo <-
    data.frame(sourceSite=sourceSite,
               patientID=patientID,
               sampleType=sampleType,
               sampleTypeText=sampleTypeText,
               sampleBatch=sampleBatch,
               row.names=sampleBarcodes)
tcgaSampleInfo <- tcgaSampleInfo[sampleUsed=="yes",]

tcgaSampleInfo[1:4,]

```

## 7 Saving RData

Now we save the relevant information to an RData object.

```{r saveTCGAExpression}

colnames(tcgaExpression) <-
    as.character(sdrf[match(colnames(tcgaExpression),
                            as.character(sdrf[,"Array.Data.File"])),
                      "Extract.Name"])

all(colnames(tcgaExpression)==rownames(tcgaSampleInfo))

save(tcgaSampleInfo,
     tcgaDataDirs,
     tcgaFiles,
     tcgaExpression,
     file=file.path("RDataObjects","tcgaExpression.RData"))

```


## 8 Appendix

### 8.1 File Location

```{r getLocation}

getwd()

```

### 8.2 SessionInfo

```{r sessionInfo}

sessionInfo();

```