Assembling an RMA Quantification Matrix for the CCLE Data ======================================================== Keith A. Baggerly ## 1 Executive Summary ### 1.1 Introduction We want to produce an RData file with a matrix of RMA expression values for the cancer cell lines profiled as part of the Cancer Cell Line Encylcopedia [(CCLE)](#ccle12) on Affymetrix U133+2 arrays. ### 1.2 Methods We acquired a tarball of the 917 gzipped CEL files used from the Gene Expression Omnibus (GEO) page for GSE36133, [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36133](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36133), on Sep 14, 2012. (Warning - this file is over 4G, so it may not download properly to a 32-bit machine.) We used justRMA to compute RMA fits, and used our previously assembled clinical information to map the GEO GSM ids to remap the column (sample) names. ### 1.3 Results We save ccleExpression to the RData file "ccleExpression.RData". ## 2 Libraries We first load the libraries we will use in this report. ```r library(affy) library(hgu133plus2cdf) ``` ## 3 Loading Clinical Information Next, we load our previously assembled clinical information. ```r load(file.path("RDataObjects", "ccleClinical.RData")) ``` ## 4 Specifying the Raw Data Location Here, we specify the location of the data we acquired from GEO on our local system. You will need to acquire these files and adjust this path before running this report yourself. ```r pathToCCLEData <- file.path("RawData", "CCLE", "CEL_Files") ``` ## 5 Quantifying The CEL Files First, we specify the CEL file paths in a character vector for passing to justRMA. ```r celFileNames <- dir(pathToCCLEData, pattern = "^GSM") celFilePaths <- file.path(pathToCCLEData, celFileNames) ``` Now we use justRMA to summarize expression at the probeset level. ```r d1 <- date() ccleExpression <- justRMA(filenames = celFilePaths, compress = TRUE) ccleExpression <- exprs(ccleExpression) d2 <- date() ``` ```r c(d1, d2) ``` ``` ## [1] "Thu Jun 13 07:53:42 2013" "Thu Jun 13 08:08:35 2013" ``` ```r dim(ccleExpression) ``` ``` ## [1] 54675 917 ``` ```r ccleExpression[1:3, 1:3] ``` ``` ## GSM886835.CEL.gz GSM886836.CEL.gz GSM886837.CEL.gz ## 1007_s_at 8.400 7.699 10.638 ## 1053_at 10.062 9.331 10.577 ## 117_at 4.257 3.966 3.905 ``` The justRMA computation takes about 40 minutes on my MacBook Pro; the sheer volume of the data makes this challenging. ## 6 Mapping CEL Names to Sample IDs We now use the clinical information to replace the GEO GSM ids with the sample ids in the column names. ```r tempClinRows <- match(substr(colnames(ccleExpression), 1, 9), as.character(ccleClinical[, "GEO.ID"])) tempNames <- rownames(ccleClinical)[tempClinRows] ccleClinical[tempNames[1:3], ] ``` ``` ## GEO.ID sourceName primarySite histology ## 1321N1 GSM886835 ECACC central_nervous_system glioma ## 143B GSM886836 ATCC bone osteosarcoma ## 22Rv1 GSM886837 ATCC prostate carcinoma ## subtype ## 1321N1 astrocytoma ## 143B ## 22Rv1 ``` ```r colnames(ccleExpression)[1:3] ``` ``` ## [1] "GSM886835.CEL.gz" "GSM886836.CEL.gz" "GSM886837.CEL.gz" ``` ```r colnames(ccleExpression) <- tempNames ccleExpression[1:3, 1:3] ``` ``` ## 1321N1 143B 22Rv1 ## 1007_s_at 8.400 7.699 10.638 ## 1053_at 10.062 9.331 10.577 ## 117_at 4.257 3.966 3.905 ``` ## 7 Saving RData Now we save the relevant information to an RData object. ```r save(ccleExpression, file = file.path("RDataObjects", "ccleExpression.RData")) ``` ## 8 Appendix ### 8.1 File Location ```r getwd() ``` ``` ## [1] "/workspace/kabagg/RDPaper/Webpage/ResidualDisease" ``` ### 8.2 SessionInfo ```r sessionInfo() ``` ``` ## R version 2.15.1 (2012-06-22) ## Platform: x86_64-unknown-linux-gnu (64-bit) ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=C LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] parallel stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] hgu133plus2cdf_2.10.0 AnnotationDbi_1.22.5 affy_1.34.0 ## [4] Biobase_2.16.0 BiocGenerics_0.6.0 markdown_0.5.3 ## [7] knitr_0.9 ## ## loaded via a namespace (and not attached): ## [1] affyio_1.24.0 BiocInstaller_1.4.9 DBI_0.2-6 ## [4] digest_0.6.3 evaluate_0.4.3 formatR_0.7 ## [7] IRanges_1.18.0 preprocessCore_1.18.0 RSQLite_0.11.3 ## [10] stats4_2.15.1 stringr_0.6.2 tools_2.15.1 ## [13] zlibbioc_1.2.0 ``` ## 9 References >

[1] Barretina J, Caponigro G, Stransky N, Venkatesan K et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483(7391):603-7, 2012. PMID: 22460905.