Filtering Samples from the TCGA Data to Focus on RD ================================================== by Keith A. Baggerly ## 1 Executive Summary ### 1.1 Introduction We have data from TCGA for 594 ovarian samples, but these include normal samples, recurrences, and cell lines. We don't want to include these samples in our comparisons. We want to identify the high-grade serous ovarian tumors with residual disease (RD) information to focus the question more precisely. ### 1.2 Methods Starting with the previously assembled tables of clinical information and expression (for the sample names), we examine the various columns and see which clinical features would justify exclusion from the set being examined. We consider - Type of Sample, excluding normal and recurrent disease. - RD status, excluding samples with no RD information. - Array Site, excluding samples not coming from the ovary (OV) or peritoneum (PE). - Neoadjuvant Treatment, excluding samples from patients who received chemotherapy before sample acquisition. - Grade, excluding samples not Grade 2, 3, or 4. - Duplication, excluding all but the first occurrenceof any samples remaining deriving from the same patient. We use these rules to build up a data frame with two columns: sampleUse (Used or Unused), and whyExcluded. We also construct vectors mapping the samples assayed to the clinical information, and the RD status of the samples (as opposed to patients). ### 1.3 Results We exclude 103 of the 594 samples for various reasons. Of the 491 that remain, 378 are RD and 113 are No RD. We save tcgaFilteredSamples, tcgaSampleClinicalMapping, and tcgaSampleRD to the RData file "tcgaFilteredSamples.RData". ## 2 Libraries We first load the libraries we will use in this report. ```{r libraries} ``` ## 3 Loading the Data Here we simply load the previously assembled clinical information and expression matrices, and skim the first line of the clinical information to see what variables exist for filtering the samples. ```{r loadTCGARData} load(file.path("RDataObjects","tcgaClinical.RData")) load(file.path("RDataObjects","tcgaExpression.RData")) tcgaClinical[1,] ``` ## 4 Filtering Samples Used We now walk through the various criteria, and seeing what these imply for inclusion of the various samples. Our default assumption is that all samples are used. ```{r setDefaults} sampleUse <- rep("Used", ncol(tcgaExpression)) names(sampleUse) <- colnames(tcgaExpression) whyExcluded <- rep("", ncol(tcgaExpression)) names(whyExcluded) <- colnames(tcgaExpression) ``` We also define a mapping between the samples run and the patients from which they were derived, to let us go from the expression data (on samples) to the clinical data (on patients) and vice-versa. ```{r defineSampleMapping} sampleClinicalMapping <- match(substr(colnames(tcgaExpression),1,12), rownames(tcgaClinical)) names(sampleClinicalMapping) <- names(sampleUse) ``` ### 4.1 Type of Sample First, we check the type of sample. We want to focus on primary tumors, not normal samples or recurrences. ```{r checkSampleType} table(tcgaSampleInfo[,"sampleTypeText"]) sampleUse[tcgaSampleInfo[,"sampleTypeText"]=="normalTissue"] <- "Unused" sampleUse[tcgaSampleInfo[,"sampleTypeText"]=="recurrentTumor"] <- "Unused" whyExcluded[tcgaSampleInfo[,"sampleTypeText"]=="normalTissue"] <- paste(whyExcluded[tcgaSampleInfo[,"sampleTypeText"]=="normalTissue"], "-normalTissue-",sep="") whyExcluded[tcgaSampleInfo[,"sampleTypeText"]=="recurrentTumor"] <- paste(whyExcluded[tcgaSampleInfo[,"sampleTypeText"]=="recurrentTumor"], "-recurrentTumor-",sep="") table(sampleUse) ``` ### 4.2 Residual Disease Now we check residual disease status, and exclude samples with no information. ```{r checkRD} tcgaSampleRD <- rep("",ncol(tcgaExpression)) names(tcgaSampleRD) <- colnames(tcgaExpression) sampleClinicalMapping <- match(substr(colnames(tcgaExpression),1,12), rownames(tcgaClinical)) names(sampleClinicalMapping) <- names(tcgaSampleRD) tcgaSampleRD <- tcgaRD[sampleClinicalMapping] names(tcgaSampleRD) <- names(sampleClinicalMapping) table(tcgaSampleRD,useNA="ifany") sampleUse[is.na(tcgaSampleRD)] <- "Unused" whyExcluded[is.na(tcgaSampleRD)] <- paste(whyExcluded[is.na(tcgaSampleRD)], "-No RD Info-",sep="") table(sampleUse) ``` ### 4.3 Array Site Next, we look at the site the sample was taken from (the "tissue site"). We want tumors from the ovary or the peritoneum. ```{r checkArraySite} table(tcgaClinical[,"tumor_tissue_site"]) badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[,"tumor_tissue_site"]== "OMENTUM"))) sampleUse[badSamples] <- "Unused" whyExcluded[badSamples] <- paste(whyExcluded[badSamples],"-Not OV or PE-",sep="") table(sampleUse) ``` ### 4.4 Neoadjuvant Chemo Next, we look at whether the patients received neoadjuvant chemotherapy. We want to focus on chemo-naive tumors. ```{r checkNeoadjuvant} table(tcgaClinical[,"pretreatment_history"]) badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[,"pretreatment_history"]=="YES"))) sampleUse[badSamples] <- "Unused" whyExcluded[badSamples] <- paste(whyExcluded[badSamples],"-NeoAdj Chemo-",sep="") table(sampleUse) ``` ### 4.5 Grade Next, we look at grade. We want only Grade 2 or higher samples. ```{r checkGrade} table(tcgaClinical[,"neoplasm_histologic_grade"]) badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[,"neoplasm_histologic_grade"]== "[Not Available]"))) sampleUse[badSamples] <- "Unused" whyExcluded[badSamples] <- paste(whyExcluded[badSamples],"-Grade NA-",sep="") badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[,"neoplasm_histologic_grade"]== "G1"))) sampleUse[badSamples] <- "Unused" whyExcluded[badSamples] <- paste(whyExcluded[badSamples],"-Grade 1-",sep="") badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[,"neoplasm_histologic_grade"]== "GB"))) sampleUse[badSamples] <- "Unused" whyExcluded[badSamples] <- paste(whyExcluded[badSamples],"-Grade GB-",sep="") badSamples <- which(is.element(sampleClinicalMapping, which(tcgaClinical[,"neoplasm_histologic_grade"]== "GX"))) sampleUse[badSamples] <- "Unused" whyExcluded[badSamples] <- paste(whyExcluded[badSamples],"-Grade GX-",sep="") table(sampleUse) ``` ### 4.6 Check Samples Used for Duplicates Finally, we check to see if any of the samples remaining in the "Used" category appear more than once, and if so, which ones. ```{r checkDuplicates} namesOfSamplesUsed <- names(sampleUse)[sampleUse=="Used"] which(duplicated(substr(namesOfSamplesUsed,1,12))) namesOfSamplesUsed[which(duplicated(substr(namesOfSamplesUsed,1,12)))] which(substr(namesOfSamplesUsed,1,12)=="TCGA-23-1023") namesOfSamplesUsed[which(substr(namesOfSamplesUsed,1,12)=="TCGA-23-1023")] sampleUse["TCGA-23-1023-01R-01R-0808-01"] <- "Unused" whyExcluded["TCGA-23-1023-01R-01R-0808-01"] <- "-duplicate sample-" table(sampleUse) ``` One duplicate sample remains. We arbitrarily keep just the first one. ### 4.7 Final Tally Now we see how many RD and No RD samples remain. ```{r checkTally} table(sampleUse,tcgaSampleRD,useNA="ifany") ``` There are 491 samples left, 113 from patients with No RD, and 378 from patients with RD. ## 5 Building the Data Frame Now we bundle the assembled information into a data frame for later use. ```{r buildDataFrame} tcgaFilteredSamples <- data.frame(sampleUse=sampleUse, whyExcluded=whyExcluded, row.names=colnames(tcgaExpression)) ``` ## 6 Saving RData Now we save the relevant information to an RData object. ```{r saveTcgaClinical} tcgaSampleClinicalMapping <- sampleClinicalMapping save(tcgaFilteredSamples, tcgaSampleRD, tcgaSampleClinicalMapping, file=file.path("RDataObjects","tcgaFilteredSamples.RData")) ``` ## 7 Appendix ### 7.1 File Location ```{r getLocation} getwd() ``` ### 7.2 SessionInfo ```{r sessionInfo} sessionInfo(); ```