Assembling Clinical Information for the TCGA Ovarian Data

by Keith A. Baggerly

1 Executive Summary

1.1 Introduction

We want to produce an RData file with the clinical information for the ovarian cancer samples profiled by TCGA.

1.2 Methods

We acquired the gzipped tarball containing the biotab clinical information from the open access TCGA http page, https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ov/bcr/biotab/clin/, on Sep 14, 2012.

We load the “clinical_patient_ov” information into a data frame, and construct an R “Surv” object for overall survival.

We also construct a binary indicator vector for the presence or absence of residual disease (RD).

1.3 Results

We save tcgaClinical, tcgaOSYrs, and tcgaRD to the RData file “tcgaClinical.RData”.

2 Libraries

We first load the libraries we will use in this report.


library(survival)

3 Specifying the Raw Data Location

Here, we specify the location of the data we acquired from TCGA on our local system. You will need to acquire these files and adjust this path before running this report yourself.


pathToTCGAData <- file.path("RawData", "TCGA", "Clinical")

4 Loading the Data

Here we simply load the table of clinical information.


tcgaClinical <- read.table(file.path(pathToTCGAData, "clinical_patient_ov.txt"), 
    header = TRUE, sep = "\t", row.names = 1)
dim(tcgaClinical)
## [1] 576  36
tcgaClinical[1, ]
##              age_at_initial_pathologic_diagnosis
## TCGA-04-1331                                  78
##              anatomic_organ_subdivision
## TCGA-04-1331            [Not Available]
##                                  bcr_patient_uuid date_of_form_completion
## TCGA-04-1331 6d10d4ee-6331-4bba-93bc-a7b64cc0b22a              2009-03-26
##              date_of_initial_pathologic_diagnosis days_to_birth
## TCGA-04-1331                           2004-00-00        -28848
##              days_to_death days_to_initial_pathologic_diagnosis
## TCGA-04-1331          1336                                    0
##              days_to_last_followup eastern_cancer_oncology_group
## TCGA-04-1331                  1224               [Not Available]
##                           ethnicity gender gynecologic_figo_staging_system
## TCGA-04-1331 NOT HISPANIC OR LATINO FEMALE                 [Not Available]
##                      histological_type          icd_10 icd_o_3_histology
## TCGA-04-1331 Serous Cystadenocarcinoma [Not Available]            8441/3
##              icd_o_3_site informed_consent_verified
## TCGA-04-1331        C56.9                       YES
##              initial_pathologic_diagnosis_method   jewish_origin
## TCGA-04-1331                     [Not Available] [Not Available]
##              karnofsky_performance_score lymphatic_invasion
## TCGA-04-1331             [Not Available]                YES
##              neoplasm_histologic_grade patient_id
## TCGA-04-1331                        G3       1331
##              performance_status_scale_timing person_neoplasm_cancer_status
## TCGA-04-1331                 [Not Available]                    WITH TUMOR
##              pretreatment_history  race  residual_tumor tissue_source_site
## TCGA-04-1331                   NO WHITE [Not Available]                  4
##              tumor_histologic_subtype tumor_residual_disease tumor_stage
## TCGA-04-1331       Cystadenocarcinoma                1-10 mm        IIIC
##              tumor_tissue_site venous_invasion vital_status
## TCGA-04-1331             OVARY              NO     DECEASED

5 Defining Overall Survival

Next, we define an R “Surv” object for overall survival (OS). We begin by looking at the recorded values for patient status.


table(tcgaClinical[, "vital_status"])
## 
## [Not Available]        DECEASED          LIVING 
##               4             297             275
tcgaClinical[1:15, "days_to_death"]
##  [1] 1336             1247             55               [Not Applicable]
##  [5] 61               [Not Applicable] [Not Applicable] 563             
##  [9] 361              [Not Applicable] [Not Applicable] 1483            
## [13] 656              1946             [Not Applicable]
## 275 Levels: [Not Applicable] [Not Available] 1000 1003 1007 1013 ... 976
tcgaClinical[1:15, "days_to_last_followup"]
##  [1] 1224            1247            55              1495           
##  [5] 61              1418            [Not Available] 563            
##  [9] 361             1992            1918            1483           
## [13] 656             1946            1991           
## 488 Levels: [Not Available] 0 1004 1007 1011 1013 1018 1024 1025 ... 999
tcgaClinical[1:15, "vital_status"]
##  [1] DECEASED DECEASED DECEASED LIVING   DECEASED LIVING   LIVING  
##  [8] DECEASED DECEASED LIVING   LIVING   DECEASED DECEASED DECEASED
## [15] LIVING  
## Levels: [Not Available] DECEASED LIVING

Vital status is available for almost all of the 576 patients. Checking the times available shows that days to death can exceed the days to last followup (entry 1), and that days to death is not available for patients still living, so we should use the former for deceased patients and the latter for living ones.

The above conclusions are based on a small sampling of the data. We perform the more extensive sanity checks for verification here.


daysToDeath <- as.numeric(as.character(tcgaClinical[, "days_to_death"]))
## Warning: NAs introduced by coercion
daysToLastFollowup <- as.numeric(as.character(tcgaClinical[, "days_to_last_followup"]))
## Warning: NAs introduced by coercion
vitalStatus <- as.character(tcgaClinical[, "vital_status"])

summary(daysToDeath - daysToLastFollowup)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0     0.0    24.2     0.0  1200.0     279
summary(daysToDeath[vitalStatus == "LIVING"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA     275
daysToDeath[vitalStatus == "[Not Available]"]
## [1] NA NA NA NA
daysToLastFollowup[vitalStatus == "[Not Available]"]
## [1] NA NA  0 NA

The sanity checks are passed. We now assemble the Surv object.


daysToEvent <- rep(NA, nrow(tcgaClinical))
daysToEvent[vitalStatus == "LIVING"] <- daysToLastFollowup[vitalStatus == "LIVING"]
daysToEvent[vitalStatus == "DECEASED"] <- daysToDeath[vitalStatus == "DECEASED"]
eventStatus <- rep(NA, nrow(tcgaClinical))
eventStatus[vitalStatus == "LIVING"] <- "Censored"
eventStatus[vitalStatus == "DECEASED"] <- "Uncensored"

tcgaOSYrs <- Surv(daysToEvent/365, eventStatus == "Uncensored")
rownames(tcgaOSYrs) <- rownames(tcgaClinical)

6 Defining a Residual Disease Indicator

Now we summarize the Residual Disease (RD) information into a single indicator vector specifying if there is any RD (“RD”) or no RD (“No RD”). We begin by tabulating the information we have.


table(tcgaClinical[, "tumor_residual_disease"])
## 
##        [Not Available]                 >20 mm                1-10 mm 
##                     62                    104                    254 
##               11-20 mm No Macroscopic disease 
##                     38                    118

We now define the indicator.


tcgaRD <- rep(NA, nrow(tcgaClinical))
tcgaRD[tcgaClinical[, "tumor_residual_disease"] == ">20 mm"] <- "RD"
tcgaRD[tcgaClinical[, "tumor_residual_disease"] == "1-10 mm"] <- "RD"
tcgaRD[tcgaClinical[, "tumor_residual_disease"] == "11-20 mm"] <- "RD"
tcgaRD[tcgaClinical[, "tumor_residual_disease"] == "No Macroscopic disease"] <- "No RD"
table(tcgaRD)
## tcgaRD
## No RD    RD 
##   118   396
names(tcgaRD) <- rownames(tcgaClinical)

7 Saving RData

Now we save the relevant information to an RData object.


save(tcgaClinical, tcgaOSYrs, tcgaRD, file = file.path("RDataObjects", "tcgaClinical.RData"))

8 Appendix

8.1 File Location


getwd()
## [1] "\\\\mdadqsfs02/workspace/kabagg/RDPaper/Webpage/ResidualDisease"

8.2 Session Info


sessionInfo()
## R version 2.15.3 (2013-03-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] splines   stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] survival_2.37-4 knitr_1.2      
## 
## loaded via a namespace (and not attached):
## [1] digest_0.6.3   evaluate_0.4.3 formatR_0.7    stringr_0.6.2 
## [5] tools_2.15.3