Filtering Samples from the TCGA Data to Focus on RD

by Keith A. Baggerly

1 Executive Summary

1.1 Introduction

We have data from TCGA for 594 ovarian samples, but these include normal samples, recurrences, and cell lines. We don't want to include these samples in our comparisons. We want to identify the high-grade serous ovarian tumors with residual disease (RD) information to focus the question more precisely.

1.2 Methods

Starting with the previously assembled tables of clinical information and expression (for the sample names), we examine the various columns and see which clinical features would justify exclusion from the set being examined.

We consider

We use these rules to build up a data frame with two columns: sampleUse (Used or Unused), and whyExcluded. We also construct vectors mapping the samples assayed to the clinical information, and the RD status of the samples (as opposed to patients).

1.3 Results

We exclude 103 of the 594 samples for various reasons. Of the 491 that remain, 378 are RD and 113 are No RD.

We save tcgaFilteredSamples, tcgaSampleClinicalMapping, and tcgaSampleRD to the RData file “tcgaFilteredSamples.RData”.

2 Libraries

We first load the libraries we will use in this report.

3 Loading the Data

Here we simply load the previously assembled clinical information and expression matrices, and skim the first line of the clinical information to see what variables exist for filtering the samples.


load(file.path("RDataObjects", "tcgaClinical.RData"))
load(file.path("RDataObjects", "tcgaExpression.RData"))
tcgaClinical[1, ]
##              age_at_initial_pathologic_diagnosis
## TCGA-04-1331                                  78
##              anatomic_organ_subdivision
## TCGA-04-1331            [Not Available]
##                                  bcr_patient_uuid date_of_form_completion
## TCGA-04-1331 6d10d4ee-6331-4bba-93bc-a7b64cc0b22a              2009-03-26
##              date_of_initial_pathologic_diagnosis days_to_birth
## TCGA-04-1331                           2004-00-00        -28848
##              days_to_death days_to_initial_pathologic_diagnosis
## TCGA-04-1331          1336                                    0
##              days_to_last_followup eastern_cancer_oncology_group
## TCGA-04-1331                  1224               [Not Available]
##                           ethnicity gender gynecologic_figo_staging_system
## TCGA-04-1331 NOT HISPANIC OR LATINO FEMALE                 [Not Available]
##                      histological_type          icd_10 icd_o_3_histology
## TCGA-04-1331 Serous Cystadenocarcinoma [Not Available]            8441/3
##              icd_o_3_site informed_consent_verified
## TCGA-04-1331        C56.9                       YES
##              initial_pathologic_diagnosis_method   jewish_origin
## TCGA-04-1331                     [Not Available] [Not Available]
##              karnofsky_performance_score lymphatic_invasion
## TCGA-04-1331             [Not Available]                YES
##              neoplasm_histologic_grade patient_id
## TCGA-04-1331                        G3       1331
##              performance_status_scale_timing person_neoplasm_cancer_status
## TCGA-04-1331                 [Not Available]                    WITH TUMOR
##              pretreatment_history  race  residual_tumor tissue_source_site
## TCGA-04-1331                   NO WHITE [Not Available]                  4
##              tumor_histologic_subtype tumor_residual_disease tumor_stage
## TCGA-04-1331       Cystadenocarcinoma                1-10 mm        IIIC
##              tumor_tissue_site venous_invasion vital_status
## TCGA-04-1331             OVARY              NO     DECEASED

4 Filtering Samples Used

We now walk through the various criteria, and seeing what these imply for inclusion of the various samples. Our default assumption is that all samples are used.


sampleUse <- rep("Used", ncol(tcgaExpression))
names(sampleUse) <- colnames(tcgaExpression)

whyExcluded <- rep("", ncol(tcgaExpression))
names(whyExcluded) <- colnames(tcgaExpression)

We also define a mapping between the samples run and the patients from which they were derived, to let us go from the expression data (on samples) to the clinical data (on patients) and vice-versa.


sampleClinicalMapping <- match(substr(colnames(tcgaExpression), 1, 12), rownames(tcgaClinical))
names(sampleClinicalMapping) <- names(sampleUse)

4.1 Type of Sample

First, we check the type of sample. We want to focus on primary tumors, not normal samples or recurrences.


table(tcgaSampleInfo[, "sampleTypeText"])
## 
##   normalTissue   primaryTumor recurrentTumor 
##              8            569             17

sampleUse[tcgaSampleInfo[, "sampleTypeText"] == "normalTissue"] <- "Unused"
sampleUse[tcgaSampleInfo[, "sampleTypeText"] == "recurrentTumor"] <- "Unused"

whyExcluded[tcgaSampleInfo[, "sampleTypeText"] == "normalTissue"] <- paste(whyExcluded[tcgaSampleInfo[, 
    "sampleTypeText"] == "normalTissue"], "-normalTissue-", sep = "")
whyExcluded[tcgaSampleInfo[, "sampleTypeText"] == "recurrentTumor"] <- paste(whyExcluded[tcgaSampleInfo[, 
    "sampleTypeText"] == "recurrentTumor"], "-recurrentTumor-", sep = "")

table(sampleUse)
## sampleUse
## Unused   Used 
##     25    569

4.2 Residual Disease

Now we check residual disease status, and exclude samples with no information.


tcgaSampleRD <- rep("", ncol(tcgaExpression))
names(tcgaSampleRD) <- colnames(tcgaExpression)
sampleClinicalMapping <- match(substr(colnames(tcgaExpression), 1, 12), rownames(tcgaClinical))
names(sampleClinicalMapping) <- names(tcgaSampleRD)
tcgaSampleRD <- tcgaRD[sampleClinicalMapping]
names(tcgaSampleRD) <- names(sampleClinicalMapping)

table(tcgaSampleRD, useNA = "ifany")
## tcgaSampleRD
## No RD    RD  <NA> 
##   121   401    72

sampleUse[is.na(tcgaSampleRD)] <- "Unused"
whyExcluded[is.na(tcgaSampleRD)] <- paste(whyExcluded[is.na(tcgaSampleRD)], 
    "-No RD Info-", sep = "")

table(sampleUse)
## sampleUse
## Unused   Used 
##     88    506

4.3 Array Site

Next, we look at the site the sample was taken from (the “tissue site”). We want tumors from the ovary or the peritoneum.


table(tcgaClinical[, "tumor_tissue_site"])
## 
##            OMENTUM              OVARY PERITONEUM (OVARY) 
##                  2                572                  2

badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[, "tumor_tissue_site"] == 
    "OMENTUM")))

sampleUse[badSamples] <- "Unused"

whyExcluded[badSamples] <- paste(whyExcluded[badSamples], "-Not OV or PE-", 
    sep = "")

table(sampleUse)
## sampleUse
## Unused   Used 
##     90    504

4.4 Neoadjuvant Chemo

Next, we look at whether the patients received neoadjuvant chemotherapy. We want to focus on chemo-naive tumors.


table(tcgaClinical[, "pretreatment_history"])
## 
##  NO YES 
## 574   2

badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[, "pretreatment_history"] == 
    "YES")))

sampleUse[badSamples] <- "Unused"

whyExcluded[badSamples] <- paste(whyExcluded[badSamples], "-NeoAdj Chemo-", 
    sep = "")

table(sampleUse)
## sampleUse
## Unused   Used 
##     90    504

4.5 Grade

Next, we look at grade. We want only Grade 2 or higher samples.


table(tcgaClinical[, "neoplasm_histologic_grade"])
## 
## [Not Available]              G1              G2              G3 
##               4               6              69             486 
##              G4              GB              GX 
##               1               1               9

badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[, "neoplasm_histologic_grade"] == 
    "[Not Available]")))

sampleUse[badSamples] <- "Unused"

whyExcluded[badSamples] <- paste(whyExcluded[badSamples], "-Grade NA-", sep = "")

badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[, "neoplasm_histologic_grade"] == 
    "G1")))

sampleUse[badSamples] <- "Unused"

whyExcluded[badSamples] <- paste(whyExcluded[badSamples], "-Grade 1-", sep = "")

badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[, "neoplasm_histologic_grade"] == 
    "GB")))

sampleUse[badSamples] <- "Unused"

whyExcluded[badSamples] <- paste(whyExcluded[badSamples], "-Grade GB-", sep = "")

badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[, "neoplasm_histologic_grade"] == 
    "GX")))

sampleUse[badSamples] <- "Unused"

whyExcluded[badSamples] <- paste(whyExcluded[badSamples], "-Grade GX-", sep = "")

table(sampleUse)
## sampleUse
## Unused   Used 
##    102    492

4.6 Check Samples Used for Duplicates

Finally, we check to see if any of the samples remaining in the “Used” category appear more than once, and if so, which ones.


namesOfSamplesUsed <- names(sampleUse)[sampleUse == "Used"]
which(duplicated(substr(namesOfSamplesUsed, 1, 12)))
## [1] 443
namesOfSamplesUsed[which(duplicated(substr(namesOfSamplesUsed, 1, 12)))]
## [1] "TCGA-23-1023-01R-01R-0808-01"
which(substr(namesOfSamplesUsed, 1, 12) == "TCGA-23-1023")
## [1]  98 443
namesOfSamplesUsed[which(substr(namesOfSamplesUsed, 1, 12) == "TCGA-23-1023")]
## [1] "TCGA-23-1023-01A-02R-0434-01" "TCGA-23-1023-01R-01R-0808-01"

sampleUse["TCGA-23-1023-01R-01R-0808-01"] <- "Unused"
whyExcluded["TCGA-23-1023-01R-01R-0808-01"] <- "-duplicate sample-"

table(sampleUse)
## sampleUse
## Unused   Used 
##    103    491

One duplicate sample remains. We arbitrarily keep just the first one.

4.7 Final Tally

Now we see how many RD and No RD samples remain.


table(sampleUse, tcgaSampleRD, useNA = "ifany")
##          tcgaSampleRD
## sampleUse No RD  RD <NA>
##    Unused     8  23   72
##    Used     113 378    0

There are 491 samples left, 113 from patients with No RD, and 378 from patients with RD.

5 Building the Data Frame

Now we bundle the assembled information into a data frame for later use.


tcgaFilteredSamples <- data.frame(sampleUse = sampleUse, whyExcluded = whyExcluded, 
    row.names = colnames(tcgaExpression))

6 Saving RData

Now we save the relevant information to an RData object.


tcgaSampleClinicalMapping <- sampleClinicalMapping

save(tcgaFilteredSamples, tcgaSampleRD, tcgaSampleClinicalMapping, file = file.path("RDataObjects", 
    "tcgaFilteredSamples.RData"))

7 Appendix

7.1 File Location


getwd()
## [1] "/Users/slt/SLT WORKSPACE/EXEMPT/OVARIAN/Ovarian residual disease study 2012/RD manuscript/Web page for paper/Webpage"

7.2 SessionInfo


sessionInfo()
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.5
## 
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.1 formatR_0.9    stringr_0.6.2  tools_3.0.2