Filtering Samples from the TCGA Data to Focus on RD ================================================== by Keith A. Baggerly ## 1 Executive Summary ### 1.1 Introduction We have data from TCGA for 594 ovarian samples, but these include normal samples, recurrences, and cell lines. We don't want to include these samples in our comparisons. We want to identify the high-grade serous ovarian tumors with residual disease (RD) information to focus the question more precisely. ### 1.2 Methods Starting with the previously assembled tables of clinical information and expression (for the sample names), we examine the various columns and see which clinical features would justify exclusion from the set being examined. We consider - Type of Sample, excluding normal and recurrent disease. - RD status, excluding samples with no RD information. - Array Site, excluding samples not coming from the ovary (OV) or peritoneum (PE). - Neoadjuvant Treatment, excluding samples from patients who received chemotherapy before sample acquisition. - Grade, excluding samples not Grade 2, 3, or 4. - Duplication, excluding all but the first occurrenceof any samples remaining deriving from the same patient. We use these rules to build up a data frame with two columns: sampleUse (Used or Unused), and whyExcluded. We also construct vectors mapping the samples assayed to the clinical information, and the RD status of the samples (as opposed to patients). ### 1.3 Results We exclude 103 of the 594 samples for various reasons. Of the 491 that remain, 378 are RD and 113 are No RD. We save tcgaFilteredSamples, tcgaSampleClinicalMapping, and tcgaSampleRD to the RData file "tcgaFilteredSamples.RData". ## 2 Libraries We first load the libraries we will use in this report. ## 3 Loading the Data Here we simply load the previously assembled clinical information and expression matrices, and skim the first line of the clinical information to see what variables exist for filtering the samples. ```r load(file.path("RDataObjects", "tcgaClinical.RData")) load(file.path("RDataObjects", "tcgaExpression.RData")) tcgaClinical[1, ] ``` ``` ## age_at_initial_pathologic_diagnosis ## TCGA-04-1331 78 ## anatomic_organ_subdivision ## TCGA-04-1331 [Not Available] ## bcr_patient_uuid date_of_form_completion ## TCGA-04-1331 6d10d4ee-6331-4bba-93bc-a7b64cc0b22a 2009-03-26 ## date_of_initial_pathologic_diagnosis days_to_birth ## TCGA-04-1331 2004-00-00 -28848 ## days_to_death days_to_initial_pathologic_diagnosis ## TCGA-04-1331 1336 0 ## days_to_last_followup eastern_cancer_oncology_group ## TCGA-04-1331 1224 [Not Available] ## ethnicity gender gynecologic_figo_staging_system ## TCGA-04-1331 NOT HISPANIC OR LATINO FEMALE [Not Available] ## histological_type icd_10 icd_o_3_histology ## TCGA-04-1331 Serous Cystadenocarcinoma [Not Available] 8441/3 ## icd_o_3_site informed_consent_verified ## TCGA-04-1331 C56.9 YES ## initial_pathologic_diagnosis_method jewish_origin ## TCGA-04-1331 [Not Available] [Not Available] ## karnofsky_performance_score lymphatic_invasion ## TCGA-04-1331 [Not Available] YES ## neoplasm_histologic_grade patient_id ## TCGA-04-1331 G3 1331 ## performance_status_scale_timing person_neoplasm_cancer_status ## TCGA-04-1331 [Not Available] WITH TUMOR ## pretreatment_history race residual_tumor tissue_source_site ## TCGA-04-1331 NO WHITE [Not Available] 4 ## tumor_histologic_subtype tumor_residual_disease tumor_stage ## TCGA-04-1331 Cystadenocarcinoma 1-10 mm IIIC ## tumor_tissue_site venous_invasion vital_status ## TCGA-04-1331 OVARY NO DECEASED ``` ## 4 Filtering Samples Used We now walk through the various criteria, and seeing what these imply for inclusion of the various samples. Our default assumption is that all samples are used. ```r sampleUse <- rep("Used", ncol(tcgaExpression)) names(sampleUse) <- colnames(tcgaExpression) whyExcluded <- rep("", ncol(tcgaExpression)) names(whyExcluded) <- colnames(tcgaExpression) ``` We also define a mapping between the samples run and the patients from which they were derived, to let us go from the expression data (on samples) to the clinical data (on patients) and vice-versa. ```r sampleClinicalMapping <- match(substr(colnames(tcgaExpression), 1, 12), rownames(tcgaClinical)) names(sampleClinicalMapping) <- names(sampleUse) ``` ### 4.1 Type of Sample First, we check the type of sample. We want to focus on primary tumors, not normal samples or recurrences. ```r table(tcgaSampleInfo[, "sampleTypeText"]) ``` ``` ## ## normalTissue primaryTumor recurrentTumor ## 8 569 17 ``` ```r sampleUse[tcgaSampleInfo[, "sampleTypeText"] == "normalTissue"] <- "Unused" sampleUse[tcgaSampleInfo[, "sampleTypeText"] == "recurrentTumor"] <- "Unused" whyExcluded[tcgaSampleInfo[, "sampleTypeText"] == "normalTissue"] <- paste(whyExcluded[tcgaSampleInfo[, "sampleTypeText"] == "normalTissue"], "-normalTissue-", sep = "") whyExcluded[tcgaSampleInfo[, "sampleTypeText"] == "recurrentTumor"] <- paste(whyExcluded[tcgaSampleInfo[, "sampleTypeText"] == "recurrentTumor"], "-recurrentTumor-", sep = "") table(sampleUse) ``` ``` ## sampleUse ## Unused Used ## 25 569 ``` ### 4.2 Residual Disease Now we check residual disease status, and exclude samples with no information. ```r tcgaSampleRD <- rep("", ncol(tcgaExpression)) names(tcgaSampleRD) <- colnames(tcgaExpression) sampleClinicalMapping <- match(substr(colnames(tcgaExpression), 1, 12), rownames(tcgaClinical)) names(sampleClinicalMapping) <- names(tcgaSampleRD) tcgaSampleRD <- tcgaRD[sampleClinicalMapping] names(tcgaSampleRD) <- names(sampleClinicalMapping) table(tcgaSampleRD, useNA = "ifany") ``` ``` ## tcgaSampleRD ## No RD RD ## 121 401 72 ``` ```r sampleUse[is.na(tcgaSampleRD)] <- "Unused" whyExcluded[is.na(tcgaSampleRD)] <- paste(whyExcluded[is.na(tcgaSampleRD)], "-No RD Info-", sep = "") table(sampleUse) ``` ``` ## sampleUse ## Unused Used ## 88 506 ``` ### 4.3 Array Site Next, we look at the site the sample was taken from (the "tissue site"). We want tumors from the ovary or the peritoneum. ```r table(tcgaClinical[, "tumor_tissue_site"]) ``` ``` ## ## OMENTUM OVARY PERITONEUM (OVARY) ## 2 572 2 ``` ```r badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[, "tumor_tissue_site"] == "OMENTUM"))) sampleUse[badSamples] <- "Unused" whyExcluded[badSamples] <- paste(whyExcluded[badSamples], "-Not OV or PE-", sep = "") table(sampleUse) ``` ``` ## sampleUse ## Unused Used ## 90 504 ``` ### 4.4 Neoadjuvant Chemo Next, we look at whether the patients received neoadjuvant chemotherapy. We want to focus on chemo-naive tumors. ```r table(tcgaClinical[, "pretreatment_history"]) ``` ``` ## ## NO YES ## 574 2 ``` ```r badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[, "pretreatment_history"] == "YES"))) sampleUse[badSamples] <- "Unused" whyExcluded[badSamples] <- paste(whyExcluded[badSamples], "-NeoAdj Chemo-", sep = "") table(sampleUse) ``` ``` ## sampleUse ## Unused Used ## 90 504 ``` ### 4.5 Grade Next, we look at grade. We want only Grade 2 or higher samples. ```r table(tcgaClinical[, "neoplasm_histologic_grade"]) ``` ``` ## ## [Not Available] G1 G2 G3 ## 4 6 69 486 ## G4 GB GX ## 1 1 9 ``` ```r badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[, "neoplasm_histologic_grade"] == "[Not Available]"))) sampleUse[badSamples] <- "Unused" whyExcluded[badSamples] <- paste(whyExcluded[badSamples], "-Grade NA-", sep = "") badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[, "neoplasm_histologic_grade"] == "G1"))) sampleUse[badSamples] <- "Unused" whyExcluded[badSamples] <- paste(whyExcluded[badSamples], "-Grade 1-", sep = "") badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[, "neoplasm_histologic_grade"] == "GB"))) sampleUse[badSamples] <- "Unused" whyExcluded[badSamples] <- paste(whyExcluded[badSamples], "-Grade GB-", sep = "") badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[, "neoplasm_histologic_grade"] == "GX"))) sampleUse[badSamples] <- "Unused" whyExcluded[badSamples] <- paste(whyExcluded[badSamples], "-Grade GX-", sep = "") table(sampleUse) ``` ``` ## sampleUse ## Unused Used ## 102 492 ``` ### 4.6 Check Samples Used for Duplicates Finally, we check to see if any of the samples remaining in the "Used" category appear more than once, and if so, which ones. ```r namesOfSamplesUsed <- names(sampleUse)[sampleUse == "Used"] which(duplicated(substr(namesOfSamplesUsed, 1, 12))) ``` ``` ## [1] 443 ``` ```r namesOfSamplesUsed[which(duplicated(substr(namesOfSamplesUsed, 1, 12)))] ``` ``` ## [1] "TCGA-23-1023-01R-01R-0808-01" ``` ```r which(substr(namesOfSamplesUsed, 1, 12) == "TCGA-23-1023") ``` ``` ## [1] 98 443 ``` ```r namesOfSamplesUsed[which(substr(namesOfSamplesUsed, 1, 12) == "TCGA-23-1023")] ``` ``` ## [1] "TCGA-23-1023-01A-02R-0434-01" "TCGA-23-1023-01R-01R-0808-01" ``` ```r sampleUse["TCGA-23-1023-01R-01R-0808-01"] <- "Unused" whyExcluded["TCGA-23-1023-01R-01R-0808-01"] <- "-duplicate sample-" table(sampleUse) ``` ``` ## sampleUse ## Unused Used ## 103 491 ``` One duplicate sample remains. We arbitrarily keep just the first one. ### 4.7 Final Tally Now we see how many RD and No RD samples remain. ```r table(sampleUse, tcgaSampleRD, useNA = "ifany") ``` ``` ## tcgaSampleRD ## sampleUse No RD RD ## Unused 8 23 72 ## Used 113 378 0 ``` There are 491 samples left, 113 from patients with No RD, and 378 from patients with RD. ## 5 Building the Data Frame Now we bundle the assembled information into a data frame for later use. ```r tcgaFilteredSamples <- data.frame(sampleUse = sampleUse, whyExcluded = whyExcluded, row.names = colnames(tcgaExpression)) ``` ## 6 Saving RData Now we save the relevant information to an RData object. ```r tcgaSampleClinicalMapping <- sampleClinicalMapping save(tcgaFilteredSamples, tcgaSampleRD, tcgaSampleClinicalMapping, file = file.path("RDataObjects", "tcgaFilteredSamples.RData")) ``` ## 7 Appendix ### 7.1 File Location ```r getwd() ``` ``` ## [1] "/Users/slt/SLT WORKSPACE/EXEMPT/OVARIAN/Ovarian residual disease study 2012/RD manuscript/Web page for paper/Webpage" ``` ### 7.2 SessionInfo ```r sessionInfo() ``` ``` ## R version 3.0.2 (2013-09-25) ## Platform: x86_64-apple-darwin10.8.0 (64-bit) ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] knitr_1.5 ## ## loaded via a namespace (and not attached): ## [1] evaluate_0.5.1 formatR_0.9 stringr_0.6.2 tools_3.0.2 ```