Assembling Clinical Information for the TCGA Ovarian Data ========================================================= by Keith A. Baggerly ## 1 Executive Summary ### 1.1 Introduction We want to produce an RData file with the clinical information for the ovarian cancer samples profiled by TCGA. ### 1.2 Methods We acquired the gzipped tarball containing the biotab clinical information from the open access TCGA http page, [https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ov/bcr/biotab/clin/](https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ov/bcr/biotab/clin/), on Sep 14, 2012. We load the "clinical\_patient\_ov" information into a data frame, and construct an R "Surv" object for overall survival. We also construct a binary indicator vector for the presence or absence of residual disease (RD). ### 1.3 Results We save tcgaClinical, tcgaOSYrs, and tcgaRD to the RData file "tcgaClinical.RData". ## 2 Libraries We first load the libraries we will use in this report. ```{r libraries, message=FALSE} library(survival) ``` ## 3 Specifying the Raw Data Location Here, we specify the location of the data we acquired from TCGA on our local system. You will need to acquire these files and adjust this path before running this report yourself. ```{r pathToTCGAData} pathToTCGAData <- file.path("RawData","TCGA","Clinical") ``` ## 4 Loading the Data Here we simply load the table of clinical information. ```{r loadTCGAClinical} tcgaClinical <- read.table(file.path(pathToTCGAData,"clinical_patient_ov.txt"), header=TRUE, sep="\t", row.names=1) dim(tcgaClinical) tcgaClinical[1,] ``` ## 5 Defining Overall Survival Next, we define an R "Surv" object for overall survival (OS). We begin by looking at the recorded values for patient status. ```{r examineStatus} table(tcgaClinical[,"vital_status"]) tcgaClinical[1:15,"days_to_death"] tcgaClinical[1:15,"days_to_last_followup"] tcgaClinical[1:15,"vital_status"] ``` Vital status is available for almost all of the 576 patients. Checking the times available shows that days to death can exceed the days to last followup (entry 1), and that days to death is not available for patients still living, so we should use the former for deceased patients and the latter for living ones. The above conclusions are based on a small sampling of the data. We perform the more extensive sanity checks for verification here. ```{r sanityChecks} daysToDeath <- as.numeric(as.character(tcgaClinical[,"days_to_death"])) daysToLastFollowup <- as.numeric(as.character(tcgaClinical[,"days_to_last_followup"])) vitalStatus <- as.character(tcgaClinical[,"vital_status"]) summary(daysToDeath-daysToLastFollowup) summary(daysToDeath[vitalStatus=="LIVING"]) daysToDeath[vitalStatus=="[Not Available]"] daysToLastFollowup[vitalStatus=="[Not Available]"] ``` The sanity checks are passed. We now assemble the Surv object. ```{r assembleSurv} daysToEvent <- rep(NA, nrow(tcgaClinical)) daysToEvent[vitalStatus=="LIVING"] <- daysToLastFollowup[vitalStatus=="LIVING"] daysToEvent[vitalStatus=="DECEASED"] <- daysToDeath[vitalStatus=="DECEASED"] eventStatus <- rep(NA, nrow(tcgaClinical)) eventStatus[vitalStatus=="LIVING"] <- "Censored" eventStatus[vitalStatus=="DECEASED"] <- "Uncensored" tcgaOSYrs <- Surv(daysToEvent/365,eventStatus=="Uncensored") rownames(tcgaOSYrs) <- rownames(tcgaClinical) ``` ## 6 Defining a Residual Disease Indicator Now we summarize the Residual Disease (RD) information into a single indicator vector specifying if there is any RD ("RD") or no RD ("No RD"). We begin by tabulating the information we have. ```{r tableRDStatus} table(tcgaClinical[,"tumor_residual_disease"]) ``` We now define the indicator. ```{r specifyRDIndicator} tcgaRD <- rep(NA,nrow(tcgaClinical)) tcgaRD[tcgaClinical[,"tumor_residual_disease"]==">20 mm"] <- "RD" tcgaRD[tcgaClinical[,"tumor_residual_disease"]=="1-10 mm"] <- "RD" tcgaRD[tcgaClinical[,"tumor_residual_disease"]=="11-20 mm"] <- "RD" tcgaRD[tcgaClinical[,"tumor_residual_disease"]=="No Macroscopic disease"] <- "No RD" table(tcgaRD) names(tcgaRD) <- rownames(tcgaClinical) ``` ## 7 Saving RData Now we save the relevant information to an RData object. ```{r saveTCGAClinical} save(tcgaClinical, tcgaOSYrs, tcgaRD, file=file.path("RDataObjects","tcgaClinical.RData")) ``` ## 8 Appendix ### 8.1 File Location ```{r getLocation} getwd() ``` ### 8.2 Session Info ```{r sessionInfo} sessionInfo(); ```