Assembling Clinical Information for the TCGA Ovarian Data ========================================================= by Keith A. Baggerly ## 1 Executive Summary ### 1.1 Introduction We want to produce an RData file with the clinical information for the ovarian cancer samples profiled by TCGA. ### 1.2 Methods We acquired the gzipped tarball containing the biotab clinical information from the open access TCGA http page, [https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ov/bcr/biotab/clin/](https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ov/bcr/biotab/clin/), on Sep 14, 2012. We load the "clinical\_patient\_ov" information into a data frame, and construct an R "Surv" object for overall survival. We also construct a binary indicator vector for the presence or absence of residual disease (RD). ### 1.3 Results We save tcgaClinical, tcgaOSYrs, and tcgaRD to the RData file "tcgaClinical.RData". ## 2 Libraries We first load the libraries we will use in this report. ```r library(survival) ``` ## 3 Specifying the Raw Data Location Here, we specify the location of the data we acquired from TCGA on our local system. You will need to acquire these files and adjust this path before running this report yourself. ```r pathToTCGAData <- file.path("RawData", "TCGA", "Clinical") ``` ## 4 Loading the Data Here we simply load the table of clinical information. ```r tcgaClinical <- read.table(file.path(pathToTCGAData, "clinical_patient_ov.txt"), header = TRUE, sep = "\t", row.names = 1) dim(tcgaClinical) ``` ``` ## [1] 576 36 ``` ```r tcgaClinical[1, ] ``` ``` ## age_at_initial_pathologic_diagnosis ## TCGA-04-1331 78 ## anatomic_organ_subdivision ## TCGA-04-1331 [Not Available] ## bcr_patient_uuid date_of_form_completion ## TCGA-04-1331 6d10d4ee-6331-4bba-93bc-a7b64cc0b22a 2009-03-26 ## date_of_initial_pathologic_diagnosis days_to_birth ## TCGA-04-1331 2004-00-00 -28848 ## days_to_death days_to_initial_pathologic_diagnosis ## TCGA-04-1331 1336 0 ## days_to_last_followup eastern_cancer_oncology_group ## TCGA-04-1331 1224 [Not Available] ## ethnicity gender gynecologic_figo_staging_system ## TCGA-04-1331 NOT HISPANIC OR LATINO FEMALE [Not Available] ## histological_type icd_10 icd_o_3_histology ## TCGA-04-1331 Serous Cystadenocarcinoma [Not Available] 8441/3 ## icd_o_3_site informed_consent_verified ## TCGA-04-1331 C56.9 YES ## initial_pathologic_diagnosis_method jewish_origin ## TCGA-04-1331 [Not Available] [Not Available] ## karnofsky_performance_score lymphatic_invasion ## TCGA-04-1331 [Not Available] YES ## neoplasm_histologic_grade patient_id ## TCGA-04-1331 G3 1331 ## performance_status_scale_timing person_neoplasm_cancer_status ## TCGA-04-1331 [Not Available] WITH TUMOR ## pretreatment_history race residual_tumor tissue_source_site ## TCGA-04-1331 NO WHITE [Not Available] 4 ## tumor_histologic_subtype tumor_residual_disease tumor_stage ## TCGA-04-1331 Cystadenocarcinoma 1-10 mm IIIC ## tumor_tissue_site venous_invasion vital_status ## TCGA-04-1331 OVARY NO DECEASED ``` ## 5 Defining Overall Survival Next, we define an R "Surv" object for overall survival (OS). We begin by looking at the recorded values for patient status. ```r table(tcgaClinical[, "vital_status"]) ``` ``` ## ## [Not Available] DECEASED LIVING ## 4 297 275 ``` ```r tcgaClinical[1:15, "days_to_death"] ``` ``` ## [1] 1336 1247 55 [Not Applicable] ## [5] 61 [Not Applicable] [Not Applicable] 563 ## [9] 361 [Not Applicable] [Not Applicable] 1483 ## [13] 656 1946 [Not Applicable] ## 275 Levels: [Not Applicable] [Not Available] 1000 1003 1007 1013 ... 976 ``` ```r tcgaClinical[1:15, "days_to_last_followup"] ``` ``` ## [1] 1224 1247 55 1495 ## [5] 61 1418 [Not Available] 563 ## [9] 361 1992 1918 1483 ## [13] 656 1946 1991 ## 488 Levels: [Not Available] 0 1004 1007 1011 1013 1018 1024 1025 ... 999 ``` ```r tcgaClinical[1:15, "vital_status"] ``` ``` ## [1] DECEASED DECEASED DECEASED LIVING DECEASED LIVING LIVING ## [8] DECEASED DECEASED LIVING LIVING DECEASED DECEASED DECEASED ## [15] LIVING ## Levels: [Not Available] DECEASED LIVING ``` Vital status is available for almost all of the 576 patients. Checking the times available shows that days to death can exceed the days to last followup (entry 1), and that days to death is not available for patients still living, so we should use the former for deceased patients and the latter for living ones. The above conclusions are based on a small sampling of the data. We perform the more extensive sanity checks for verification here. ```r daysToDeath <- as.numeric(as.character(tcgaClinical[, "days_to_death"])) ``` ``` ## Warning: NAs introduced by coercion ``` ```r daysToLastFollowup <- as.numeric(as.character(tcgaClinical[, "days_to_last_followup"])) ``` ``` ## Warning: NAs introduced by coercion ``` ```r vitalStatus <- as.character(tcgaClinical[, "vital_status"]) summary(daysToDeath - daysToLastFollowup) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.0 0.0 0.0 24.2 0.0 1200.0 279 ``` ```r summary(daysToDeath[vitalStatus == "LIVING"]) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## NA NA NA NaN NA NA 275 ``` ```r daysToDeath[vitalStatus == "[Not Available]"] ``` ``` ## [1] NA NA NA NA ``` ```r daysToLastFollowup[vitalStatus == "[Not Available]"] ``` ``` ## [1] NA NA 0 NA ``` The sanity checks are passed. We now assemble the Surv object. ```r daysToEvent <- rep(NA, nrow(tcgaClinical)) daysToEvent[vitalStatus == "LIVING"] <- daysToLastFollowup[vitalStatus == "LIVING"] daysToEvent[vitalStatus == "DECEASED"] <- daysToDeath[vitalStatus == "DECEASED"] eventStatus <- rep(NA, nrow(tcgaClinical)) eventStatus[vitalStatus == "LIVING"] <- "Censored" eventStatus[vitalStatus == "DECEASED"] <- "Uncensored" tcgaOSYrs <- Surv(daysToEvent/365, eventStatus == "Uncensored") rownames(tcgaOSYrs) <- rownames(tcgaClinical) ``` ## 6 Defining a Residual Disease Indicator Now we summarize the Residual Disease (RD) information into a single indicator vector specifying if there is any RD ("RD") or no RD ("No RD"). We begin by tabulating the information we have. ```r table(tcgaClinical[, "tumor_residual_disease"]) ``` ``` ## ## [Not Available] >20 mm 1-10 mm ## 62 104 254 ## 11-20 mm No Macroscopic disease ## 38 118 ``` We now define the indicator. ```r tcgaRD <- rep(NA, nrow(tcgaClinical)) tcgaRD[tcgaClinical[, "tumor_residual_disease"] == ">20 mm"] <- "RD" tcgaRD[tcgaClinical[, "tumor_residual_disease"] == "1-10 mm"] <- "RD" tcgaRD[tcgaClinical[, "tumor_residual_disease"] == "11-20 mm"] <- "RD" tcgaRD[tcgaClinical[, "tumor_residual_disease"] == "No Macroscopic disease"] <- "No RD" table(tcgaRD) ``` ``` ## tcgaRD ## No RD RD ## 118 396 ``` ```r names(tcgaRD) <- rownames(tcgaClinical) ``` ## 7 Saving RData Now we save the relevant information to an RData object. ```r save(tcgaClinical, tcgaOSYrs, tcgaRD, file = file.path("RDataObjects", "tcgaClinical.RData")) ``` ## 8 Appendix ### 8.1 File Location ```r getwd() ``` ``` ## [1] "\\\\mdadqsfs02/workspace/kabagg/RDPaper/Webpage/ResidualDisease" ``` ### 8.2 Session Info ```r sessionInfo() ``` ``` ## R version 2.15.3 (2013-03-01) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] splines stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] survival_2.37-4 knitr_1.2 ## ## loaded via a namespace (and not attached): ## [1] digest_0.6.3 evaluate_0.4.3 formatR_0.7 stringr_0.6.2 ## [5] tools_2.15.3 ```