by Keith A. Baggerly
We want to identify genes whose expression shows a strong and similarly directed association with residual disease (RD) status in both the TCGA and Tothill datasets.
We load our previously assembled RData files for tcgaExpression, tcgaFilteredData, tothillExpression, tothillFilteredData, and tothillClinical.
We restrict our attention to probesets on both the TCGA and Tothill array platforms.
Then, using just the filtered sets of samples, we contrast RD and No RD samples within each dataset using two sample t-tests. For each probeset in each dataset, we identify the associated gene and record the mean expression level in the RD and No RD groups, the t-statistic, the raw p-value, and the false discovery rate (FDR) adjusted p-value.
We flag probsets that are significantly different in both TCGA and Tothill using (a) a 5% FDR cutoff and (b) a 10% FDR cutoff.
We plot the bivariate t-tests to look for structure, expression heatmaps for the selected probes to look for patterns, correlations between the probes chosen to look for coordinated behavior, and dot and density plots for individual probesets to identify other features. We write a convenience function to make generation of the dot and density plots easier.
There are 22277 probesets common to the two platforms.
We flag 8 probesets using a 5% FDR cutoff in both datasets, and 47 probesets using a 10% FDR cutoff.
The bivariate plot of the t-statistics found is shown in Figure 1; a zoomed version highlighting the probesets overexpressed in RD samples is shown in Figure 2.
The expression heatmaps for TCGA and Tothill using the 5% FDR cutoffs are in Figures 3 and 4, respectively. The expression heatmaps for TCGA and Tothill using the 10% FDR cutoffs are in Figures 5 and 6, respectively.
Heatmaps of the pairwise 10% FDR probe correlations for TCGA and Tothill are shown in Figures 7 and 8.
Dot and density plots for all 47 probesets passing the 10% FDR filter are saved as “plotsOfTop47Probesets.pdf” in the Reports folder.
Dot and density plots for 6 probesets, corresponding to LUM, DCN, GADD45B, FABP4, ADH1B, and ADIPOQ are shown in Figures 9, 10, 11, 12, 13, and 14, respectively.
We save tcgaCommonUsed, tothillCommonUsed, keyProbesets05pct, keyGenes05pct, keyProbesets10pct, keyGenes10pct, nTCGANoRD, nTCGARD, nTothillNoRD, nTothillRD, plotProbesetResults, and rdTTestResults to the RData file “rdFlaggedGenes.RData”.
In both sets of expression heatmaps, there is evidence of a molecularly distinct subset of patients (about a third) with a higher chance of having RD. Expression levels for most of the genes identified are consistently higher in these patients.
For LUM, DCN, and GADD45B, which represent the bulk of the probsets showing elevation, what we see is an overall mean shift (values are trending higher) without a clear division point (above here, something's changed). For FABP4, ADH1B, and (to a lesser extent) ADIPOQ, we see a qualitative shift in a smaller subset – values for most samples are very low (effectively “off”“), but values for a subset of patients are very high ("on”“).
A qualitative difference strikes us as more likely to survive a shift across assays than a mean offset, so we preferentially pursue FABP4 and ADH1B.
We first load the libraries we will use in this report.
library(affy)
library(hthgu133a.db)
library(gplots)
Here we simply load the previously assembled RData files.
clinical information and expression matrices, and skim the first line of the clinical information to see what variables exist for filtering the samples.
load(file.path("RDataObjects", "tcgaExpression.RData"))
load(file.path("RDataObjects", "tcgaFilteredSamples.RData"))
load(file.path("RDataObjects", "tothillExpression.RData"))
load(file.path("RDataObjects", "tothillFilteredSamples.RData"))
load(file.path("RDataObjects", "tothillClinical.RData"))
We only want to examine probesets evaluated in both datasets.
commonProbesets <- intersect(rownames(tcgaExpression), rownames(tothillExpression))
Given the common probesets, we next get matrices of data for RD and No RD measurements for both TCGA and Tothill.
We begin with TCGA
tcgaCommonRD <- tcgaExpression[commonProbesets, names(tcgaSampleRD)[which((tcgaSampleRD ==
"RD") & (tcgaFilteredSamples[, "sampleUse"] == "Used"))]]
dim(tcgaCommonRD)
## [1] 22277 378
tcgaCommonNoRD <- tcgaExpression[commonProbesets, names(tcgaSampleRD)[which((tcgaSampleRD ==
"No RD") & (tcgaFilteredSamples[, "sampleUse"] == "Used"))]]
dim(tcgaCommonNoRD)
## [1] 22277 113
Next, we repeat the process for Tothill.
tothillCommonRD <- tothillExpression[commonProbesets, names(tothillRD)[which((tothillRD ==
"RD") & (tothillFilteredSamples[, "sampleUse"] == "Used"))]]
dim(tothillCommonRD)
## [1] 22277 139
tothillCommonNoRD <- tothillExpression[commonProbesets, names(tothillRD)[which((tothillRD ==
"No RD") & (tothillFilteredSamples[, "sampleUse"] == "Used"))]]
dim(tothillCommonNoRD)
## [1] 22277 50
For later plots, it can be easier to rearrange things yet again.
tcgaCommonUsed <- cbind(tcgaCommonNoRD, tcgaCommonRD)
tothillCommonUsed <- cbind(tothillCommonNoRD, tothillCommonRD)
Our first comparisons involve simple two-sample t-tests. We perform these for TCGA first.
d1 <- date()
tcgaTVals <- rep(0, length(commonProbesets))
names(tcgaTVals) <- commonProbesets
tcgaPVals <- tcgaTVals
for (i1 in 1:length(commonProbesets)) {
tempT <- t.test(tcgaCommonRD[i1, ], tcgaCommonNoRD[i1, ], var.equal = TRUE)
tcgaTVals[i1] <- tempT[["statistic"]]
tcgaPVals[i1] <- tempT[["p.value"]]
}
d2 <- date()
c(d1, d2)
## [1] "Wed Nov 20 11:29:41 2013" "Wed Nov 20 11:29:50 2013"
tcgaPValsAdj <- p.adjust(tcgaPVals, method = "fdr")
names(tcgaPValsAdj) <- commonProbesets
Then we repeat the process with Tothill.
d1 <- date()
tothillTVals <- rep(0, length(commonProbesets))
names(tothillTVals) <- commonProbesets
tothillPVals <- tothillTVals
for (i1 in 1:length(commonProbesets)) {
tempT <- t.test(tothillCommonRD[i1, ], tothillCommonNoRD[i1, ], var.equal = TRUE)
tothillTVals[i1] <- tempT[["statistic"]]
tothillPVals[i1] <- tempT[["p.value"]]
}
d2 <- date()
c(d1, d2)
## [1] "Wed Nov 20 11:29:50 2013" "Wed Nov 20 11:29:59 2013"
tothillPValsAdj <- p.adjust(tothillPVals, method = "fdr")
names(tothillPValsAdj) <- commonProbesets
We now see which genes (if any) appear significant at an FDR of 5% in both datasets.
sum(tcgaPValsAdj < 0.05)
## [1] 149
sum(tothillPValsAdj < 0.05)
## [1] 81
sum((tothillPValsAdj < 0.05) & (tcgaPValsAdj < 0.05))
## [1] 8
sum((tothillPValsAdj < 0.1) & (tcgaPValsAdj < 0.1))
## [1] 47
keyProbesets05pct <- names(which((tothillPValsAdj < 0.05) & (tcgaPValsAdj <
0.05)))
keyProbesets10pct <- names(which((tothillPValsAdj < 0.1) & (tcgaPValsAdj < 0.1)))
keyGenes05pct <- unlist(mget(keyProbesets05pct, hthgu133aSYMBOL))
keyGenes10pct <- unlist(mget(keyProbesets10pct, hthgu133aSYMBOL))
keyGenes05pct
## 201744_s_at 203666_at 203980_at 207574_s_at 209335_at 209613_s_at
## "LUM" "CXCL12" "FABP4" "GADD45B" "DCN" "ADH1B"
## 209687_at 221541_at
## "CXCL12" "CRISPLD2"
There are 8 probesets flagged at a common FDR of 5% (listed above), and 47 probesets flagged at a common FDR of 10%.
Now we bundle our t-test results into a data frame for later reference, sorting the entries by mean fdr-adjusted p-value.
tcgaMeanRD <- apply(tcgaCommonRD, 1, mean)
tcgaMeanNoRD <- apply(tcgaCommonNoRD, 1, mean)
tothillMeanRD <- apply(tothillCommonRD, 1, mean)
tothillMeanNoRD <- apply(tothillCommonNoRD, 1, mean)
commonGeneSymbols <- unlist(mget(commonProbesets, hthgu133aSYMBOL))
rdTTestResults <- data.frame(row.names = rownames(tcgaCommonUsed), geneSymbol = commonGeneSymbols,
tcgaMeanRD = tcgaMeanRD, tcgaMeanNoRD = tcgaMeanNoRD, tcgaTVals = tcgaTVals,
tcgaPVals = tcgaPVals, tcgaPValsAdj = tcgaPValsAdj, tothillMeanRD = tothillMeanRD,
tothillMeanNoRD = tothillMeanNoRD, tothillTVals = tothillTVals, tothillPVals = tothillPVals,
tothillPValsAdj = tothillPValsAdj)
rdTTestResults <- rdTTestResults[order(tcgaPValsAdj + tothillPValsAdj), ]
As a check, we look at the results for the top 10 probesets by this ordering.
rdTTestResults[1:10, ]
## geneSymbol tcgaMeanRD tcgaMeanNoRD tcgaTVals tcgaPVals
## 201744_s_at LUM 9.057 8.084 4.992 8.318e-07
## 203666_at CXCL12 4.400 3.921 4.134 4.202e-05
## 203980_at FABP4 4.278 3.463 3.863 1.272e-04
## 209335_at DCN 7.376 6.746 3.853 1.324e-04
## 209613_s_at ADH1B 3.514 2.944 3.681 2.586e-04
## 221541_at CRISPLD2 6.017 5.444 3.858 1.295e-04
## 209612_s_at ADH1B 3.694 3.196 3.507 4.947e-04
## 207574_s_at GADD45B 6.795 6.392 3.741 2.049e-04
## 209687_at CXCL12 6.885 6.228 3.928 9.811e-05
## 211813_x_at DCN 8.602 8.003 3.525 4.635e-04
## tcgaPValsAdj tothillMeanRD tothillMeanNoRD tothillTVals
## 201744_s_at 0.002885 10.325 9.036 4.536
## 203666_at 0.020107 7.778 6.950 4.160
## 203980_at 0.034264 6.422 4.371 4.533
## 209335_at 0.034609 8.687 7.527 4.560
## 209613_s_at 0.044321 4.912 3.091 5.045
## 221541_at 0.034340 7.523 6.433 4.344
## 209612_s_at 0.057726 5.556 3.781 4.968
## 207574_s_at 0.039855 8.276 7.580 4.155
## 209687_at 0.028758 8.178 7.162 3.986
## 211813_x_at 0.056505 10.921 9.857 4.509
## tothillPVals tothillPValsAdj
## 201744_s_at 1.024e-05 0.010629
## 203666_at 4.842e-05 0.022975
## 203980_at 1.036e-05 0.010629
## 209335_at 9.243e-06 0.010629
## 209613_s_at 1.067e-06 0.002970
## 221541_at 2.289e-05 0.016447
## 209612_s_at 1.521e-06 0.003765
## 207574_s_at 4.937e-05 0.022975
## 209687_at 9.644e-05 0.035837
## 211813_x_at 1.145e-05 0.010629
Given the contrast results, we now plot the data in several ways to see if this clarifies aspects of the structure.
Our first check involves plotting the TCGA and Tothill t-statistics against each other, to see if there are clear outliers or disagreement with respect to sign. The initial plot, Figure 1, shows the vast majority of the probesets selected are more strongly expressed in RD samples. A zoom on the upper quadrant of this plot is shown in Figure 2.
We want to see if the most clearly chosen probesets are flagging the same samples, and how clearly they divide RD from No RD cases. We check this first for the 8 probesets passing the 5% FDR filters. The TCGA heatmap is shown in Figure 3, and the Tothill heatmap is shown in Figure 4. The general story is the same in both; we see a clear cluster in which most of the probesets are concurrently elevated, and the patients in these clusters have much higher rates of RD. Of the 8 probesets examined, those for FABP4 and ADH1B stand out as telling the story most starkly, though the enrichment rates may not be much higher. We do not see a tight grouping of the No RD cases, but this is to be expected since some cases of RD will not be driven at the molecular level but rather by spatial positioning in the abdomen.
Having examined the 5% FDR probesets, we now expand our view to encompass probesets passing a 10% FDR filter in both TCGA and Tothill. The TCGA heatmap is shown in Figure 5, and the Tothill heatmap is shown in Figure 6. One factor that becomes more apparent here is the broadly parallel pattern of overexpression seen for most of the probesets (FABP4 and ADH1B again stand out). This suggests there may be a common driver for many of them; possibly a “pathway” of some type.
Given the broad parallelism of expression seen in the heatmaps of probesets passing the 10% FDR filters, we want to check the correlation patterns between these probes. The TCGA heatmap is shown in Figure 7, and the Tothill heatmap is shown in Figure 8.
We know some of the probesets are of interest. Now we look at aspects of behavior of the individual probesets, specifically dotplots and density plots for each probeset by dataset.
First, we construct a generic function for plotting these results for a given probeset.
plotProbesetResults <- function(probesetID) {
par(mfrow = c(2, 2))
geneName <- unlist(mget(probesetID, hthgu133aSYMBOL))
plot(tcgaCommonUsed[probesetID, ], col = c(rep("blue", nTCGANoRD), rep("red",
nTCGARD)), xlab = "TCGA Samples", ylab = "Expression", main = paste("Expression of",
geneName, "in TCGA"))
abline(v = nTCGANoRD + 0.5)
plot(tothillCommonUsed[probesetID, ], col = c(rep("blue", nTothillNoRD),
rep("red", nTothillRD)), xlab = "Tothill Samples", ylab = "Expression",
main = paste("Expression of", geneName, "in Tothill"))
abline(v = nTothillNoRD + 0.5)
tempDensTCGA <- density(tcgaCommonUsed[probesetID, ])
tempDensTCGANoRD <- density(tcgaCommonNoRD[probesetID, ])
tempDensTCGARD <- density(tcgaCommonRD[probesetID, ])
plot(tempDensTCGA[["x"]], tempDensTCGA[["y"]], xlab = paste("Expression of",
probesetID, "in TCGA"), ylab = "Density", type = "l", main = paste("Density of",
probesetID, "in TCGA"))
lines(tempDensTCGANoRD[["x"]], (nTCGANoRD/(nTCGANoRD + nTCGARD)) * tempDensTCGANoRD[["y"]],
col = "blue")
lines(tempDensTCGARD[["x"]], (nTCGARD/(nTCGANoRD + nTCGARD)) * tempDensTCGARD[["y"]],
col = "red")
tempDensTothill <- density(tothillCommonUsed[probesetID, ])
tempDensTothillNoRD <- density(tothillCommonNoRD[probesetID, ])
tempDensTothillRD <- density(tothillCommonRD[probesetID, ])
plot(tempDensTothill[["x"]], tempDensTothill[["y"]], xlab = paste("Expression of",
probesetID, "in Tothill"), ylab = "Density", type = "l", main = paste("Density of",
probesetID, "in Tothill"))
lines(tempDensTothillNoRD[["x"]], (nTothillNoRD/(nTothillNoRD + nTothillRD)) *
tempDensTothillNoRD[["y"]], col = "blue")
lines(tempDensTothillRD[["x"]], (nTothillRD/(nTothillNoRD + nTothillRD)) *
tempDensTothillRD[["y"]], col = "red")
par(mfrow = c(1, 1))
}
For reference, we produce a pdf file containing the results for all 47 probesets in our top list.
pdf(file = file.path("Reports", "plotsOfTop47Probesets.pdf"))
for (i1 in 1:length(keyProbesets10pct)) {
plotProbesetResults(keyProbesets10pct[i1])
}
dev.off()
## pdf
## 2
We include results for a few selected genes here: LUM (201744_s_at), Figure 9, DCN (211896_s_at), Figure 10, GADD45B (207574_s_at), Figure 11, FABP4 (203980_at), Figure 12, ADH1B (209613_s_at), Figure 13, and ADIPOQ (207175_at), Figure 14. For some of the genes (ADH1B, DCN), mutliple probesets are available but the results appear qualitatively similar to the representative ones chosen.
For LUM, DCN, and GADD45B, which represent the bulk of the probsets showing elevation, what we see is an overall mean shift (values are trending higher) without a clear division point (above here, something's changed). For FABP4, ADH1B, and (to a lesser extent) ADIPOQ, we see a qualitative shift – values for most samples are very low (effectively “off”“), but values for a subset of patients are very high ("on”“). This type of qualitative difference strikes us as more likely to survive a shift across assays than a mean offset, so we will preferentially pursue FABP4 and ADH1B.
Now we save the relevant information to an RData object.
save(tcgaCommonUsed, tothillCommonUsed, keyProbesets05pct, keyGenes05pct, keyProbesets10pct,
keyGenes10pct, nTCGANoRD, nTCGARD, nTothillNoRD, nTothillRD, plotProbesetResults,
rdTTestResults, file = file.path("RDataObjects", "rdFlaggedGenes.RData"))
getwd()
## [1] "/Users/slt/SLT WORKSPACE/EXEMPT/OVARIAN/Ovarian residual disease study 2012/RD manuscript/Web page for paper/Webpage"
sessionInfo()
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] gplots_2.12.1 hthgu133a.db_2.9.0 org.Hs.eg.db_2.9.0
## [4] RSQLite_0.11.4 DBI_0.2-7 AnnotationDbi_1.22.6
## [7] affy_1.38.1 Biobase_2.20.1 BiocGenerics_0.6.0
## [10] knitr_1.5
##
## loaded via a namespace (and not attached):
## [1] affyio_1.28.0 BiocInstaller_1.10.4 bitops_1.0-6
## [4] caTools_1.14 evaluate_0.5.1 formatR_0.9
## [7] gdata_2.13.2 gtools_3.1.0 IRanges_1.18.4
## [10] KernSmooth_2.23-10 preprocessCore_1.22.0 stats4_3.0.2
## [13] stringr_0.6.2 tools_3.0.2 zlibbioc_1.6.0