by Keith A. Baggerly
We want to produce an RData file with the clinical (annotation) information for the cancer cell lines profiled as part of the Cancer Cell Line Encylcopedia (CCLE).
We use GEOquery to parse the annotation information for the 917 cell lines posted at the Gene Expression Omnibus (GEO) as part of GSE36133: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36133. We use GEOquery to extract the annotation information contained in the individual GSM files, including cell line name, GSM sample id, site of primary tumor, histology, and histological subtype (when applicable).
We save these results both as a data frame and a csv file.
We save ccleClinical to the RData file “ccleClinical.RData”, and also export the table to ccleClinical.csv in RawData.
We first load the options and libraries we will use in this report.
library(GEOquery)
Here we simply use the GEOquery package to download the annotation information (and posted quantifications) directly from GEO. Since the quantifications are based on a nonstandard CDF file, we prefer to build our own from the CEL files. Since the number of CEL files is large, GEO partitions the results into component series files – each contains info on at most 255 entries, so there are 4 files for the CCLE data.
d1 <- date()
ccleFromGEO <- getGEO("GSE36133")
d2 <- date()
c(d1, d2)
## [1] "Wed Jun 12 14:54:36 2013" "Wed Jun 12 14:55:22 2013"
length(ccleFromGEO)
## [1] 4
names(ccleFromGEO)
## [1] "GSE36133_series_matrix-1.txt.gz" "GSE36133_series_matrix-2.txt.gz"
## [3] "GSE36133_series_matrix-3.txt.gz" "GSE36133_series_matrix-4.txt.gz"
class(ccleFromGEO)
## [1] "list"
class(ccleFromGEO[[1]])
## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"
Obtaining the data takes about 30 seconds on my MacBook Pro using a high-speed home DSL connection. Judging timing here is a bit tricky, in that it relies on the speed of your internet connection as well as your computer's processing power. We now have a list of ExpressionSet objects to work with.
Since what we really want is the annotation, we need to extract the phenoData from each ExpressionSet and look at the pData from each phenoData object.
Before simply bundling the annotation across all files, we examine the results for a few files to see which fields are actually informative.
We first look at the information supplied for a single file.
annotBlock1 <- pData(phenoData(ccleFromGEO[[1]]))
dim(annotBlock1)
## [1] 255 37
colnames(annotBlock1)
## [1] "title" "geo_accession"
## [3] "status" "submission_date"
## [5] "last_update_date" "type"
## [7] "channel_count" "source_name_ch1"
## [9] "organism_ch1" "characteristics_ch1"
## [11] "characteristics_ch1.1" "characteristics_ch1.2"
## [13] "treatment_protocol_ch1" "growth_protocol_ch1"
## [15] "molecule_ch1" "extract_protocol_ch1"
## [17] "label_ch1" "label_protocol_ch1"
## [19] "taxid_ch1" "hyb_protocol"
## [21] "scan_protocol" "description"
## [23] "data_processing" "platform_id"
## [25] "contact_name" "contact_email"
## [27] "contact_laboratory" "contact_department"
## [29] "contact_institute" "contact_address"
## [31] "contact_city" "contact_state"
## [33] "contact_zip/postal_code" "contact_country"
## [35] "contact_web_link" "supplementary_file"
## [37] "data_row_count"
annotBlock1[1, ]
## title geo_accession status submission_date
## GSM886835 1321N1 GSM886835 Public on Mar 20 2012 Mar 06 2012
## last_update_date type channel_count source_name_ch1 organism_ch1
## GSM886835 Mar 20 2012 RNA 1 ECACC Homo sapiens
## characteristics_ch1 characteristics_ch1.1
## GSM886835 primary site: central_nervous_system histology: glioma
## characteristics_ch1.2 treatment_protocol_ch1
## GSM886835 histology subtype1: astrocytoma None
## growth_protocol_ch1
## GSM886835 Cells lines were cultured following growth recommendations from the cell line source vendors. Cells were incubated at 37 ºC at 5% CO2 until 70% confluency was reached. Pellets were harvested 48 hours post media change, flash frozen, and stored at -80 ºC until nucleic acid extraction.
## molecule_ch1
## GSM886835 total RNA
## extract_protocol_ch1
## GSM886835 RNA was isolated from frozen cell pellets, containing an average cell count of 4.5 million cells, via Trizol (Invitrogen) extraction following the manufacturers instructions. Samples were quantified using both the Nanodrop 8000 spectrophotometer (ThermoScientific) and via Agilent 2100 Bioanalyzer.
## label_ch1
## GSM886835 Biotin
## label_protocol_ch1
## GSM886835 The Affymetrix One-Cycle Target labeling assay was used to synthesize hybridization target using 1 mg of total RNA. Labeled cRNA hybridization target was produced using a 3-day, semi-automated workflow
## taxid_ch1
## GSM886835 9606
## hyb_protocol
## GSM886835 Fragmentation of the biotin-labeled cRNA prepared the sample for hybridization on the GeneChip Human U133 2.0 (PN 900467). The washing and staining was performed on the GeneChip Fluidics station using standard Affymetrix protocols outlined in the Expression Analysis Technical Manual, 2001, Affymetrix.
## scan_protocol
## GSM886835 The GeneChip® Scanner 3000 was used for scanning the arrays.
## description
## GSM886835 Gene expression data from the CCLE
## data_processing
## GSM886835 The data were analyzed with 2.14.0, using the packages affyio_1.22.0 and affy_1.32.0, and RMA as normalization method. rma() was used with default options (normalize=TRUE, background=TRUE) and with the package hgu133plus2hsentrezgcdf_15.0.0 from Brainarray for the CDF file.
## platform_id contact_name contact_email
## GSM886835 GPL15308 Nicolas,,Stransky stransky@broadinstitute.org
## contact_laboratory contact_department contact_institute
## GSM886835 Levi Garraway Cancer Program Broad Institute
## contact_address contact_city contact_state
## GSM886835 7 Cambridge center Cambridge MA
## contact_zip/postal_code contact_country
## GSM886835 02118 USA
## contact_web_link
## GSM886835 www.broadinstitute.org/ccle
## supplementary_file
## GSM886835 ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM886nnn/GSM886835/GSM886835.CEL.gz
## data_row_count
## GSM886835 18926
There's quite a bit of annotation here, but most of it isn't unique to the given cell line, and is thus of less interest to us. We compare annotations for the first two files to see which bits change.
annotBlock1[1, ] == annotBlock1[2, ]
## title geo_accession status submission_date last_update_date type
## GSM886835 FALSE FALSE TRUE TRUE TRUE TRUE
## channel_count source_name_ch1 organism_ch1 characteristics_ch1
## GSM886835 TRUE FALSE TRUE FALSE
## characteristics_ch1.1 characteristics_ch1.2
## GSM886835 FALSE FALSE
## treatment_protocol_ch1 growth_protocol_ch1 molecule_ch1
## GSM886835 TRUE TRUE TRUE
## extract_protocol_ch1 label_ch1 label_protocol_ch1 taxid_ch1
## GSM886835 TRUE TRUE TRUE TRUE
## hyb_protocol scan_protocol description data_processing
## GSM886835 TRUE TRUE TRUE TRUE
## platform_id contact_name contact_email contact_laboratory
## GSM886835 TRUE TRUE TRUE TRUE
## contact_department contact_institute contact_address
## GSM886835 TRUE TRUE TRUE
## contact_city contact_state contact_zip/postal_code
## GSM886835 TRUE TRUE TRUE
## contact_country contact_web_link supplementary_file
## GSM886835 TRUE TRUE FALSE
## data_row_count
## GSM886835 TRUE
sum(annotBlock1[1, ] != annotBlock1[2, ])
## [1] 7
There are 7 fields whose values change, but two of these (geo_accession and supplementary_file) reflect the fact that the GSM number is different, and this information is already in the row names. This leaves title (the cell line name), source_name_ch1 (where the cell line came from), characteristics_ch1 (the organ location of the primary tumor), characteristics_ch1.1 (the tumor histology), and characteristics_ch1.2 (the histologic subtype, if applicable). We extract these fields for our annotation table.
Now we grab the columns of interest from each ExpressionSet, convert them to character matrices, and bind them together into a single object.
annotBlock2 <- pData(phenoData(ccleFromGEO[[2]]))
annotBlock3 <- pData(phenoData(ccleFromGEO[[3]]))
annotBlock4 <- pData(phenoData(ccleFromGEO[[4]]))
keyColumns <- c("title", "source_name_ch1", "characteristics_ch1", "characteristics_ch1.1",
"characteristics_ch1.2")
allAnnot <- rbind(as.matrix(annotBlock1[, keyColumns]), as.matrix(annotBlock2[,
keyColumns]), as.matrix(annotBlock3[, keyColumns]), as.matrix(annotBlock4[,
keyColumns]))
dim(allAnnot)
## [1] 917 5
allAnnot[1:3, ]
## title source_name_ch1 characteristics_ch1
## GSM886835 "1321N1" "ECACC" "primary site: central_nervous_system"
## GSM886836 "143B" "ATCC" "primary site: bone"
## GSM886837 "22Rv1" "ATCC" "primary site: prostate"
## characteristics_ch1.1 characteristics_ch1.2
## GSM886835 "histology: glioma" "histology subtype1: astrocytoma"
## GSM886836 "histology: osteosarcoma" ""
## GSM886837 "histology: carcinoma" ""
We have extracted the information desired.
While we have all of the information we want, it's not yet arranged the way we want it. We'd prefer to use the cell line names as row names, as opposed to the GEO ids, and several parts of the text strings (e.g., “primary site:”) appear redundant.
Here we clean up the data and reorder the columns.
GEO.ID <- rownames(allAnnot)
cellLineNames <- allAnnot[, "title"]
sourceName <- allAnnot[, "source_name_ch1"]
primarySite <- allAnnot[, "characteristics_ch1"]
histology <- allAnnot[, "characteristics_ch1.1"]
subtype <- allAnnot[, "characteristics_ch1.2"]
table(sourceName)
## sourceName
## Academic Lab ATCC DSMZ ECACC HSRRB
## 10 423 209 48 112
## ICLC KCLB NCI/DCTD RIKEN
## 6 43 7 59
table(substr(primarySite, 1, 14))
##
## primary site:
## 917
primarySite <- substr(primarySite, 15, nchar(primarySite))
table(substr(histology, 1, 11))
##
## histology:
## 917
histology <- substr(histology, 12, nchar(histology))
table(substr(subtype, 1, 20))
##
## histology subtype1:
## 237 680
subtype <- substr(subtype, 21, nchar(subtype))
ccleClinical <- data.frame(GEO.ID = GEO.ID, sourceName = sourceName, primarySite = primarySite,
histology = histology, subtype = subtype, row.names = cellLineNames)
ccleClinical[1:3, ]
## GEO.ID sourceName primarySite histology
## 1321N1 GSM886835 ECACC central_nervous_system glioma
## 143B GSM886836 ATCC bone osteosarcoma
## 22Rv1 GSM886837 ATCC prostate carcinoma
## subtype
## 1321N1 astrocytoma
## 143B
## 22Rv1
Now we save the relevant information to an RData object and to a csv file; the latter for use when we don't trust our internet connection.
save(ccleClinical, file = file.path("RDataObjects", "ccleClinical.RData"))
write.csv(ccleClinical, file = file.path("RawData", "CCLE", "Clinical", "ccleClinical.csv"))
getwd()
## [1] "\\\\mdadqsfs02/workspace/kabagg/RDPaper/Webpage/ResidualDisease"
sessionInfo()
## R version 2.15.3 (2013-03-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] GEOquery_2.24.1 Biobase_2.18.0 BiocGenerics_0.4.0
## [4] knitr_1.2
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.3 evaluate_0.4.3 formatR_0.7 RCurl_1.95-4.1
## [5] stringr_0.6.2 tools_2.15.3 XML_3.96-1.1
[1] Barretina J, Caponigro G, Stransky N, Venkatesan K et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483(7391):603-7, 2012. PMID: 22460905.