Assembling Clinical Information for the CCLE Data
=================================================

by Keith A. Baggerly

## 1 Executive Summary

### 1.1 Introduction

We want to produce an RData file with the
clinical (annotation) information for the
cancer cell lines profiled as part of the
Cancer Cell Line Encylcopedia [(CCLE)](#ccle12).

### 1.2 Methods

We use GEOquery to parse the annotation information
for the 917 cell lines posted at the Gene Expression
Omnibus (GEO) as part of GSE36133:
[http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36133](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36133).
We use GEOquery to extract the annotation information
contained in the individual GSM files, including
cell line name, GSM sample id, site of primary tumor,
histology, and histological subtype (when applicable).

We save these results both as a data frame and a csv file.

### 1.3 Results

We save
ccleClinical
to the RData file "ccleClinical.RData",
and also export the table to ccleClinical.csv in RawData.

## 2 Libraries

We first load the options and libraries we will use
in this report.


```r

library(GEOquery)
```



## 3 Loading the Data

Here we simply use the GEOquery package to download the
annotation information (and posted quantifications) directly
from GEO. Since the quantifications are based on a nonstandard
CDF file, we prefer to build our own from the CEL files.
Since the number of CEL files is large, GEO partitions the
results into component series files -- each contains info
on at most 255 entries, so there are 4 files for the CCLE data.


```r

d1 <- date()
ccleFromGEO <- getGEO("GSE36133")
d2 <- date()
c(d1, d2)
```

```
## [1] "Wed Jun 12 14:54:36 2013" "Wed Jun 12 14:55:22 2013"
```

```r
length(ccleFromGEO)
```

```
## [1] 4
```

```r
names(ccleFromGEO)
```

```
## [1] "GSE36133_series_matrix-1.txt.gz" "GSE36133_series_matrix-2.txt.gz"
## [3] "GSE36133_series_matrix-3.txt.gz" "GSE36133_series_matrix-4.txt.gz"
```

```r
class(ccleFromGEO)
```

```
## [1] "list"
```

```r
class(ccleFromGEO[[1]])
```

```
## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"
```


Obtaining the data takes about 30 seconds on my MacBook Pro
using a high-speed home DSL connection.
Judging timing here is a bit tricky, in that it relies on the
speed of your internet connection as well as your
computer's processing power.
We now have a list of ExpressionSet objects to work with.

## 4 Extracting the Annotation

Since what we really want is the annotation, we need to
extract the phenoData from each ExpressionSet and look
at the pData from each phenoData object.

## 4.1 Identifying Annotation Fields of Interest

Before simply bundling the annotation across all files,
we examine the results for a few files to see which
fields are actually informative.

We first look at the information supplied for a single
file.


```r

annotBlock1 <- pData(phenoData(ccleFromGEO[[1]]))
dim(annotBlock1)
```

```
## [1] 255  37
```

```r
colnames(annotBlock1)
```

```
##  [1] "title"                   "geo_accession"          
##  [3] "status"                  "submission_date"        
##  [5] "last_update_date"        "type"                   
##  [7] "channel_count"           "source_name_ch1"        
##  [9] "organism_ch1"            "characteristics_ch1"    
## [11] "characteristics_ch1.1"   "characteristics_ch1.2"  
## [13] "treatment_protocol_ch1"  "growth_protocol_ch1"    
## [15] "molecule_ch1"            "extract_protocol_ch1"   
## [17] "label_ch1"               "label_protocol_ch1"     
## [19] "taxid_ch1"               "hyb_protocol"           
## [21] "scan_protocol"           "description"            
## [23] "data_processing"         "platform_id"            
## [25] "contact_name"            "contact_email"          
## [27] "contact_laboratory"      "contact_department"     
## [29] "contact_institute"       "contact_address"        
## [31] "contact_city"            "contact_state"          
## [33] "contact_zip/postal_code" "contact_country"        
## [35] "contact_web_link"        "supplementary_file"     
## [37] "data_row_count"
```

```r
annotBlock1[1, ]
```

```
##            title geo_accession                status submission_date
## GSM886835 1321N1     GSM886835 Public on Mar 20 2012     Mar 06 2012
##           last_update_date type channel_count source_name_ch1 organism_ch1
## GSM886835      Mar 20 2012  RNA             1           ECACC Homo sapiens
##                            characteristics_ch1 characteristics_ch1.1
## GSM886835 primary site: central_nervous_system     histology: glioma
##                     characteristics_ch1.2 treatment_protocol_ch1
## GSM886835 histology subtype1: astrocytoma                   None
##                                                                                                                                                                                                                                                                                       growth_protocol_ch1
## GSM886835 Cells lines were cultured following growth recommendations from the cell line source vendors. Cells were incubated at 37 ºC at 5% CO2 until 70% confluency was reached.  Pellets were harvested 48 hours post media change, flash frozen, and stored at -80 ºC until nucleic acid extraction.
##           molecule_ch1
## GSM886835    total RNA
##                                                                                                                                                                                                                                                                                                 extract_protocol_ch1
## GSM886835 RNA was isolated from frozen cell pellets, containing an average cell count of 4.5 million cells, via Trizol (Invitrogen) extraction following the manufacturers instructions. Samples were quantified using both the Nanodrop 8000 spectrophotometer (ThermoScientific) and via Agilent 2100 Bioanalyzer.
##           label_ch1
## GSM886835    Biotin
##                                                                                                                                                                                                  label_protocol_ch1
## GSM886835 The Affymetrix One-Cycle Target labeling assay was used to synthesize hybridization target using 1 mg of total RNA. Labeled cRNA hybridization target was produced using a 3-day, semi-automated workflow
##           taxid_ch1
## GSM886835      9606
##                                                                                                                                                                                                                                                                                                             hyb_protocol
## GSM886835 Fragmentation of the biotin-labeled cRNA prepared the sample for hybridization on the GeneChip Human U133 2.0 (PN 900467). The washing and staining was performed on the GeneChip Fluidics station using standard Affymetrix protocols outlined in the Expression Analysis Technical Manual, 2001, Affymetrix.
##                                                           scan_protocol
## GSM886835 The GeneChip® Scanner 3000 was used for scanning the arrays.
##                                  description
## GSM886835 Gene expression data from the CCLE
##                                                                                                                                                                                                                                                                              data_processing
## GSM886835 The data were analyzed with 2.14.0, using the packages affyio_1.22.0 and affy_1.32.0, and RMA as normalization method. rma() was used with default options (normalize=TRUE, background=TRUE) and with the package hgu133plus2hsentrezgcdf_15.0.0 from Brainarray for the CDF file.
##           platform_id      contact_name               contact_email
## GSM886835    GPL15308 Nicolas,,Stransky stransky@broadinstitute.org
##           contact_laboratory contact_department contact_institute
## GSM886835      Levi Garraway     Cancer Program   Broad Institute
##              contact_address contact_city contact_state
## GSM886835 7 Cambridge center    Cambridge            MA
##           contact_zip/postal_code contact_country
## GSM886835                   02118             USA
##                      contact_web_link
## GSM886835 www.broadinstitute.org/ccle
##                                                                                           supplementary_file
## GSM886835 ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM886nnn/GSM886835/GSM886835.CEL.gz
##           data_row_count
## GSM886835          18926
```


There's quite a bit of annotation here, but most of it
isn't unique to the given cell line, and is thus of
less interest to us. We compare annotations for the
first two files to see which bits change.


```r

annotBlock1[1, ] == annotBlock1[2, ]
```

```
##           title geo_accession status submission_date last_update_date type
## GSM886835 FALSE         FALSE   TRUE            TRUE             TRUE TRUE
##           channel_count source_name_ch1 organism_ch1 characteristics_ch1
## GSM886835          TRUE           FALSE         TRUE               FALSE
##           characteristics_ch1.1 characteristics_ch1.2
## GSM886835                 FALSE                 FALSE
##           treatment_protocol_ch1 growth_protocol_ch1 molecule_ch1
## GSM886835                   TRUE                TRUE         TRUE
##           extract_protocol_ch1 label_ch1 label_protocol_ch1 taxid_ch1
## GSM886835                 TRUE      TRUE               TRUE      TRUE
##           hyb_protocol scan_protocol description data_processing
## GSM886835         TRUE          TRUE        TRUE            TRUE
##           platform_id contact_name contact_email contact_laboratory
## GSM886835        TRUE         TRUE          TRUE               TRUE
##           contact_department contact_institute contact_address
## GSM886835               TRUE              TRUE            TRUE
##           contact_city contact_state contact_zip/postal_code
## GSM886835         TRUE          TRUE                    TRUE
##           contact_country contact_web_link supplementary_file
## GSM886835            TRUE             TRUE              FALSE
##           data_row_count
## GSM886835           TRUE
```

```r
sum(annotBlock1[1, ] != annotBlock1[2, ])
```

```
## [1] 7
```


There are 7 fields whose values change, but two of
these (geo\_accession and supplementary\_file) reflect
the fact that the GSM number is different, and this
information is already in the row names. This leaves
title (the cell line name),
source\_name\_ch1 (where the cell line came from),
characteristics\_ch1 (the organ location of the primary tumor),
characteristics\_ch1.1 (the tumor histology), and
characteristics\_ch1.2 (the histologic subtype, if applicable).
We extract these fields for our annotation table.

## 4.2 Grabbing Interesting Columns

Now we grab the columns of interest from each ExpressionSet,
convert them to character matrices, and bind them together
into a single object.


```r

annotBlock2 <- pData(phenoData(ccleFromGEO[[2]]))
annotBlock3 <- pData(phenoData(ccleFromGEO[[3]]))
annotBlock4 <- pData(phenoData(ccleFromGEO[[4]]))
keyColumns <- c("title", "source_name_ch1", "characteristics_ch1", "characteristics_ch1.1", 
    "characteristics_ch1.2")
allAnnot <- rbind(as.matrix(annotBlock1[, keyColumns]), as.matrix(annotBlock2[, 
    keyColumns]), as.matrix(annotBlock3[, keyColumns]), as.matrix(annotBlock4[, 
    keyColumns]))
dim(allAnnot)
```

```
## [1] 917   5
```

```r
allAnnot[1:3, ]
```

```
##           title    source_name_ch1 characteristics_ch1                   
## GSM886835 "1321N1" "ECACC"         "primary site: central_nervous_system"
## GSM886836 "143B"   "ATCC"          "primary site: bone"                  
## GSM886837 "22Rv1"  "ATCC"          "primary site: prostate"              
##           characteristics_ch1.1     characteristics_ch1.2            
## GSM886835 "histology: glioma"       "histology subtype1: astrocytoma"
## GSM886836 "histology: osteosarcoma" ""                               
## GSM886837 "histology: carcinoma"    ""
```


We have extracted the information desired.

## 5 Rearranging the Annotation in a Data Frame

While we have all of the information we want, it's not
yet arranged the way we want it. We'd prefer to use the
cell line names as row names, as opposed to the GEO ids,
and several parts of the text strings (e.g., "primary site:")
appear redundant.

Here we clean up the data and reorder the columns.


```r

GEO.ID <- rownames(allAnnot)
cellLineNames <- allAnnot[, "title"]
sourceName <- allAnnot[, "source_name_ch1"]
primarySite <- allAnnot[, "characteristics_ch1"]
histology <- allAnnot[, "characteristics_ch1.1"]
subtype <- allAnnot[, "characteristics_ch1.2"]

table(sourceName)
```

```
## sourceName
## Academic Lab         ATCC         DSMZ        ECACC        HSRRB 
##           10          423          209           48          112 
##         ICLC         KCLB     NCI/DCTD        RIKEN 
##            6           43            7           59
```

```r

table(substr(primarySite, 1, 14))
```

```
## 
## primary site:  
##            917
```

```r
primarySite <- substr(primarySite, 15, nchar(primarySite))

table(substr(histology, 1, 11))
```

```
## 
## histology:  
##         917
```

```r
histology <- substr(histology, 12, nchar(histology))

table(substr(subtype, 1, 20))
```

```
## 
##                      histology subtype1:  
##                  237                  680
```

```r
subtype <- substr(subtype, 21, nchar(subtype))

ccleClinical <- data.frame(GEO.ID = GEO.ID, sourceName = sourceName, primarySite = primarySite, 
    histology = histology, subtype = subtype, row.names = cellLineNames)

ccleClinical[1:3, ]
```

```
##           GEO.ID sourceName            primarySite    histology
## 1321N1 GSM886835      ECACC central_nervous_system       glioma
## 143B   GSM886836       ATCC                   bone osteosarcoma
## 22Rv1  GSM886837       ATCC               prostate    carcinoma
##            subtype
## 1321N1 astrocytoma
## 143B              
## 22Rv1
```


## 6 Saving RData and csv Files

Now we save the relevant information to an RData object
and to a csv file; the latter for use when we don't
trust our internet connection.


```r

save(ccleClinical, file = file.path("RDataObjects", "ccleClinical.RData"))

write.csv(ccleClinical, file = file.path("RawData", "CCLE", "Clinical", "ccleClinical.csv"))
```



## 7 Appendix

### 7.1 File Location


```r

getwd()
```

```
## [1] "\\\\mdadqsfs02/workspace/kabagg/RDPaper/Webpage/ResidualDisease"
```


### 7.2 SessionInfo


```r

sessionInfo()
```

```
## R version 2.15.3 (2013-03-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] GEOquery_2.24.1    Biobase_2.18.0     BiocGenerics_0.4.0
## [4] knitr_1.2         
## 
## loaded via a namespace (and not attached):
## [1] digest_0.6.3   evaluate_0.4.3 formatR_0.7    RCurl_1.95-4.1
## [5] stringr_0.6.2  tools_2.15.3   XML_3.96-1.1
```


## 8 References

> <p id="ccle12"> [1] Barretina J, Caponigro G, Stransky N, Venkatesan K et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. <em>Nature</em>, <b>483(7391)</b>:603-7, 2012. PMID: 22460905.</p>