Assembling Clinical Information for the CCLE Data
=================================================

by Keith A. Baggerly

## 1 Executive Summary

### 1.1 Introduction

We want to produce an RData file with the
clinical (annotation) information for the
cancer cell lines profiled as part of the
Cancer Cell Line Encylcopedia [(CCLE)](#ccle12).

### 1.2 Methods

We use GEOquery to parse the annotation information
for the 917 cell lines posted at the Gene Expression
Omnibus (GEO) as part of GSE36133:
[http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36133](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36133).
We use GEOquery to extract the annotation information
contained in the individual GSM files, including
cell line name, GSM sample id, site of primary tumor,
histology, and histological subtype (when applicable).

We save these results both as a data frame and a csv file.

### 1.3 Results

We save
ccleClinical
to the RData file "ccleClinical.RData",
and also export the table to ccleClinical.csv in RawData.

## 2 Libraries

We first load the options and libraries we will use
in this report.

```{r libraries, message=FALSE}

library(GEOquery)

```


## 3 Loading the Data

Here we simply use the GEOquery package to download the
annotation information (and posted quantifications) directly
from GEO. Since the quantifications are based on a nonstandard
CDF file, we prefer to build our own from the CEL files.
Since the number of CEL files is large, GEO partitions the
results into component series files -- each contains info
on at most 255 entries, so there are 4 files for the CCLE data.

```{r loadCCLEClinical, message=FALSE}

d1 <- date()
ccleFromGEO <-
    getGEO("GSE36133")
d2 <- date()
c(d1,d2)
length(ccleFromGEO)
names(ccleFromGEO)
class(ccleFromGEO)
class(ccleFromGEO[[1]])

```

Obtaining the data takes about 30 seconds on my MacBook Pro
using a high-speed home DSL connection.
Judging timing here is a bit tricky, in that it relies on the
speed of your internet connection as well as your
computer's processing power.
We now have a list of ExpressionSet objects to work with.

## 4 Extracting the Annotation

Since what we really want is the annotation, we need to
extract the phenoData from each ExpressionSet and look
at the pData from each phenoData object.

## 4.1 Identifying Annotation Fields of Interest

Before simply bundling the annotation across all files,
we examine the results for a few files to see which
fields are actually informative.

We first look at the information supplied for a single
file.

```{r examineFirstFile, message=FALSE}

annotBlock1 <- pData(phenoData(ccleFromGEO[[1]]))
dim(annotBlock1)
colnames(annotBlock1)
annotBlock1[1,]

```

There's quite a bit of annotation here, but most of it
isn't unique to the given cell line, and is thus of
less interest to us. We compare annotations for the
first two files to see which bits change.

```{r compareFirstTwoFiles, message=FALSE}

annotBlock1[1,]==annotBlock1[2,]
sum(annotBlock1[1,]!=annotBlock1[2,])

```

There are 7 fields whose values change, but two of
these (geo\_accession and supplementary\_file) reflect
the fact that the GSM number is different, and this
information is already in the row names. This leaves
title (the cell line name),
source\_name\_ch1 (where the cell line came from),
characteristics\_ch1 (the organ location of the primary tumor),
characteristics\_ch1.1 (the tumor histology), and
characteristics\_ch1.2 (the histologic subtype, if applicable).
We extract these fields for our annotation table.

## 4.2 Grabbing Interesting Columns

Now we grab the columns of interest from each ExpressionSet,
convert them to character matrices, and bind them together
into a single object.

```{r grabAndBind, message=FALSE}

annotBlock2 <- pData(phenoData(ccleFromGEO[[2]]))
annotBlock3 <- pData(phenoData(ccleFromGEO[[3]]))
annotBlock4 <- pData(phenoData(ccleFromGEO[[4]]))
keyColumns <-
    c("title","source_name_ch1","characteristics_ch1",
      "characteristics_ch1.1","characteristics_ch1.2")
allAnnot <-
    rbind(as.matrix(annotBlock1[,keyColumns]),
          as.matrix(annotBlock2[,keyColumns]),
          as.matrix(annotBlock3[,keyColumns]),
          as.matrix(annotBlock4[,keyColumns]))
dim(allAnnot)
allAnnot[1:3,]

```

We have extracted the information desired.

## 5 Rearranging the Annotation in a Data Frame

While we have all of the information we want, it's not
yet arranged the way we want it. We'd prefer to use the
cell line names as row names, as opposed to the GEO ids,
and several parts of the text strings (e.g., "primary site:")
appear redundant.

Here we clean up the data and reorder the columns.

```{r cleanColumns, message=FALSE}

GEO.ID            <- rownames(allAnnot)
cellLineNames     <- allAnnot[,"title"]
sourceName        <- allAnnot[,"source_name_ch1"]
primarySite       <- allAnnot[,"characteristics_ch1"]
histology         <- allAnnot[,"characteristics_ch1.1"]
subtype           <- allAnnot[,"characteristics_ch1.2"]

table(sourceName)

table(substr(primarySite,1,14))
primarySite <- substr(primarySite,15,nchar(primarySite))

table(substr(histology,1,11))
histology <- substr(histology,12,nchar(histology))

table(substr(subtype,1,20))
subtype <- substr(subtype,21,nchar(subtype))

ccleClinical <-
    data.frame(GEO.ID=GEO.ID,
               sourceName=sourceName,
               primarySite=primarySite,
               histology=histology,
               subtype=subtype,
               row.names=cellLineNames)

ccleClinical[1:3,]

```

## 6 Saving RData and csv Files

Now we save the relevant information to an RData object
and to a csv file; the latter for use when we don't
trust our internet connection.

```{r saveCCLEClinical, message=FALSE}

save(ccleClinical,
     file=file.path("RDataObjects","ccleClinical.RData"))

write.csv(ccleClinical,
          file=file.path("RawData","CCLE","Clinical","ccleClinical.csv"))

```


## 7 Appendix

### 7.1 File Location

```{r getLocation}

getwd()

```

### 7.2 SessionInfo

```{r sessionInfo}

sessionInfo();

```

## 8 References

> <p id="ccle12"> [1] Barretina J, Caponigro G, Stransky N, Venkatesan K et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. <em>Nature</em>, <b>483(7391)</b>:603-7, 2012. PMID: 22460905.</p>