Assembling Clinical Information for the CCLE Data ================================================= by Keith A. Baggerly ## 1 Executive Summary ### 1.1 Introduction We want to produce an RData file with the clinical (annotation) information for the cancer cell lines profiled as part of the Cancer Cell Line Encylcopedia [(CCLE)](#ccle12). ### 1.2 Methods We use GEOquery to parse the annotation information for the 917 cell lines posted at the Gene Expression Omnibus (GEO) as part of GSE36133: [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36133](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36133). We use GEOquery to extract the annotation information contained in the individual GSM files, including cell line name, GSM sample id, site of primary tumor, histology, and histological subtype (when applicable). We save these results both as a data frame and a csv file. ### 1.3 Results We save ccleClinical to the RData file "ccleClinical.RData", and also export the table to ccleClinical.csv in RawData. ## 2 Libraries We first load the options and libraries we will use in this report. ```{r libraries, message=FALSE} library(GEOquery) ``` ## 3 Loading the Data Here we simply use the GEOquery package to download the annotation information (and posted quantifications) directly from GEO. Since the quantifications are based on a nonstandard CDF file, we prefer to build our own from the CEL files. Since the number of CEL files is large, GEO partitions the results into component series files -- each contains info on at most 255 entries, so there are 4 files for the CCLE data. ```{r loadCCLEClinical, message=FALSE} d1 <- date() ccleFromGEO <- getGEO("GSE36133") d2 <- date() c(d1,d2) length(ccleFromGEO) names(ccleFromGEO) class(ccleFromGEO) class(ccleFromGEO[[1]]) ``` Obtaining the data takes about 30 seconds on my MacBook Pro using a high-speed home DSL connection. Judging timing here is a bit tricky, in that it relies on the speed of your internet connection as well as your computer's processing power. We now have a list of ExpressionSet objects to work with. ## 4 Extracting the Annotation Since what we really want is the annotation, we need to extract the phenoData from each ExpressionSet and look at the pData from each phenoData object. ## 4.1 Identifying Annotation Fields of Interest Before simply bundling the annotation across all files, we examine the results for a few files to see which fields are actually informative. We first look at the information supplied for a single file. ```{r examineFirstFile, message=FALSE} annotBlock1 <- pData(phenoData(ccleFromGEO[[1]])) dim(annotBlock1) colnames(annotBlock1) annotBlock1[1,] ``` There's quite a bit of annotation here, but most of it isn't unique to the given cell line, and is thus of less interest to us. We compare annotations for the first two files to see which bits change. ```{r compareFirstTwoFiles, message=FALSE} annotBlock1[1,]==annotBlock1[2,] sum(annotBlock1[1,]!=annotBlock1[2,]) ``` There are 7 fields whose values change, but two of these (geo\_accession and supplementary\_file) reflect the fact that the GSM number is different, and this information is already in the row names. This leaves title (the cell line name), source\_name\_ch1 (where the cell line came from), characteristics\_ch1 (the organ location of the primary tumor), characteristics\_ch1.1 (the tumor histology), and characteristics\_ch1.2 (the histologic subtype, if applicable). We extract these fields for our annotation table. ## 4.2 Grabbing Interesting Columns Now we grab the columns of interest from each ExpressionSet, convert them to character matrices, and bind them together into a single object. ```{r grabAndBind, message=FALSE} annotBlock2 <- pData(phenoData(ccleFromGEO[[2]])) annotBlock3 <- pData(phenoData(ccleFromGEO[[3]])) annotBlock4 <- pData(phenoData(ccleFromGEO[[4]])) keyColumns <- c("title","source_name_ch1","characteristics_ch1", "characteristics_ch1.1","characteristics_ch1.2") allAnnot <- rbind(as.matrix(annotBlock1[,keyColumns]), as.matrix(annotBlock2[,keyColumns]), as.matrix(annotBlock3[,keyColumns]), as.matrix(annotBlock4[,keyColumns])) dim(allAnnot) allAnnot[1:3,] ``` We have extracted the information desired. ## 5 Rearranging the Annotation in a Data Frame While we have all of the information we want, it's not yet arranged the way we want it. We'd prefer to use the cell line names as row names, as opposed to the GEO ids, and several parts of the text strings (e.g., "primary site:") appear redundant. Here we clean up the data and reorder the columns. ```{r cleanColumns, message=FALSE} GEO.ID <- rownames(allAnnot) cellLineNames <- allAnnot[,"title"] sourceName <- allAnnot[,"source_name_ch1"] primarySite <- allAnnot[,"characteristics_ch1"] histology <- allAnnot[,"characteristics_ch1.1"] subtype <- allAnnot[,"characteristics_ch1.2"] table(sourceName) table(substr(primarySite,1,14)) primarySite <- substr(primarySite,15,nchar(primarySite)) table(substr(histology,1,11)) histology <- substr(histology,12,nchar(histology)) table(substr(subtype,1,20)) subtype <- substr(subtype,21,nchar(subtype)) ccleClinical <- data.frame(GEO.ID=GEO.ID, sourceName=sourceName, primarySite=primarySite, histology=histology, subtype=subtype, row.names=cellLineNames) ccleClinical[1:3,] ``` ## 6 Saving RData and csv Files Now we save the relevant information to an RData object and to a csv file; the latter for use when we don't trust our internet connection. ```{r saveCCLEClinical, message=FALSE} save(ccleClinical, file=file.path("RDataObjects","ccleClinical.RData")) write.csv(ccleClinical, file=file.path("RawData","CCLE","Clinical","ccleClinical.csv")) ``` ## 7 Appendix ### 7.1 File Location ```{r getLocation} getwd() ``` ### 7.2 SessionInfo ```{r sessionInfo} sessionInfo(); ``` ## 8 References >
[1] Barretina J, Caponigro G, Stransky N, Venkatesan K et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483(7391):603-7, 2012. PMID: 22460905.