\documentclass[11pt]{article} \usepackage{graphicx} \usepackage{cite} \usepackage{hyperref} \pagestyle{myheadings} \markright{microfluidicsModels04; revised: 1 Sept 2010} \setlength{\topmargin}{0in} \setlength{\textheight}{8in} \setlength{\textwidth}{6.5in} \setlength{\oddsidemargin}{0in} \setlength{\evensidemargin}{0in} \def\rcode#1{\texttt{#1}} \def\fref#1{Figure~\ref{#1}} \def\tref#1{Table~\ref{#1}} \title{Feature Selection: Gene Markers of Prognosis in CLL} \author{Kevin R. Coombes} \date{12 July 2010; REVISED: 1 Sept 2010} \SweaveOpts{prefix.string=Figures/microfluidicsModels04,eps=FALSE} <>= options(width=92) options(SweaveHooks = list(fig = function() par(bg='white')),eps=FALSE) @ <>= if (!file.exists("Figures")) { dir.create("Figures") } @ \begin{document} \maketitle \tableofcontents \section{Executive Summary} \subsection{Introduction} The main goal of the study is to identify changes in gene expression in CLL that are associated with clinical outcome (including overall survival and time-to-treatment). \subsubsection{Aims/Objectives} We want to find individual genes and combinations of genes that are related to prognosis in CLL. Here ``prognosis'' refers to the ability to predict time-to-treatment or overall survival. The current analysis look \textbf{only at the training set}. \subsection{(Statistical) Methods} Exploratory data analyses and summaries. Cox proportional hazards models. Stepwise selection of predictive variables using the Akaike Information Criterion. Bayesian model averaging. \subsection{Results} \begin{itemize} \item Time-to-treatment \begin{itemize} \item We found (at least) fifteen genes that were significant in univariate models to predict time-to-treatment (\tref{top15.sig}) and worked well under cross-validation (\tref{tbl:sig.pred}) or as part of multivariate models (\fref{sig.pred}). \end{itemize} \item Overall survival \begin{itemize} \item We found about five to ten genes that were significant in univariate models to predict overall survival (\tref{top15.os}) and worked well under cross-validation (\tref{tbl:os.pred}) or as part of multivariate models (\fref{os.pred}). \end{itemize} \item Time-to-treatment, including clinical variables \begin{itemize} \item It is not clear that any \textbf{individual} genes add to our ability to predict time-to-treatment after accounting for known clinical factors. However, some of the individual factrors above are at least bordeline significant in the training set. \end{itemize} \item Overall survival, including clinical variables \begin{itemize} \item The results are similar for trying to predict overall survival after accounting for clinical factors. \end{itemize} \item When these genes worked, they could be used as binary or as continuous predictors, and they worked whether we started at diganosis or at sample collection. \end{itemize} \subsection{Conclusion} The most promising strategy appears to be: \begin{enumerate} \item Focus on time-to-treatment as the primary outcome. Specifically, build a small number (probably about three) of different multivariate models to predict time-to-treatment while ignoring the clinical variables. \item Compute composite scores from each of the multivariate models. Test whether the composite scores are useful to predict time-to-treatment or overall survival in the presence of the known clinical factors in the training set. \item Validate the (univariate) ability of individual genes to predict time-to-treatment in the Card~C validation set. \item Validate the multivariate models on Card~C. \item Test whether the composite scores are useful to predict time-to-treatment or overall survival in the presence of the known clinical factors in the Card~C validation set. \end{enumerate} \section{Loading the Data} In the previous report (\rcode{microfluidicsModels03.pdf}), we stored all of the relevant data in a binary R file. We begin by loading that data. <>= load("micro03.rda") ls() @ \subsection{Preliminaries} As mentioned previously, we want to perform a variety of time-to-event (``survival'') analyses, with different starting and ending points. The basic code for these analyses is in the following package. <>= library(survival) @ We want to perform univariate analyses (one gene at a time) using the microfluidics QRT-PCR data as a continuous predictor of the outcome. The \rcode{continuous.univariate} function will perform these analyses and record the results. <>= continuous.univariate <- function(timecol, statcol, dset=cardAB.data, clindata=cardAB.clinical, base.model=NULL) { GeneUnivariate <- matrix(NA, ncol(dset), 5) colnames(GeneUnivariate) <- c("coef", "HR", "SE", "Z", "P") rownames(GeneUnivariate) <- colnames(dset) for (i in 1:ncol(dset)) { mygene <- colnames(dset)[i] dataset <- data.frame(clindata[, c(timecol, statcol)], Y=dset[,i]) colnames(dataset)[3] <- mygene myform <- formula(paste("Surv(", timecol, ", ", statcol, ") ~", mygene)) if(!is.null(base.model)) { dataset <- data.frame(dataset, clindata[,attr(terms(base.model), "term.labels")]) myform <- update(formula(base.model), paste("~ . + ", mygene)) } model <- coxph(myform, data=dataset) GeneUnivariate[i,] <- summary(model)$coef[mygene,] } GeneUnivariate } @ We also want to convert each continuous gene expression variable into a dichotomous variable. (Note that in the vast majority of cases, we split at the median. However, if there is a large natural gap in the values, as detected by the k-means algorithm, then we split at the gap instead of the median.) The \rcode{hilo} and \rcode{binary.univariate} functions will perform this analysis. Details on the algorithms and methods are provided at the end of this report. <>= hilo <- function(x) { y <- kmeans(x, centers=2)$cluster m1 <- median(x[y==1]) m2 <- median(x[y==2]) if (m1 < m2) y <- 3-y gap <- min(x[y==1]) - max(x[y==2]) if (gap > sd(x)/2) { # use kmeans z <- y } else { z <- 1 + as.numeric(x < median(x)) } factor(ifelse(z==2, "LowCycle", "HighCycle"), levels=c("HighCycle", "LowCycle")) } @ <>= binary.univariate <- function(timecol, statcol, dset=cardAB.data, clindata=cardAB.clinical, base.model=NULL) { GeneUnivariate <- matrix(NA, ncol(dset), 5) colnames(GeneUnivariate) <- c("coef", "HR", "SE", "Z", "P") rownames(GeneUnivariate) <- colnames(dset) for (i in 1:ncol(dset)) { mygene <- colnames(dset)[i] dataset <- data.frame(clindata[, c(timecol, statcol)], Y=hilo(dset[,i])) colnames(dataset)[3] <- mygene myform <- formula(paste("Surv(", timecol, ", ", statcol, ") ~", mygene)) if(!is.null(base.model)) { dataset <- data.frame(dataset, clindata[,attr(terms(base.model), "term.labels")]) myform <- update(formula(base.model), paste("~ . + ", mygene)) } model <- coxph(myform, data=dataset) GeneUnivariate[i,] <- summary(model)$coef[paste(mygene, "LowCycle", sep=''),] } GeneUnivariate } @ \section{Time to treatment} In order to find robust prognostic markers of time to treatment, we perform four different analyses on the training data from the \Sexpr{nrow(cardAB.data)} patients assayed on Cards~A and~B. Specifically, we use every combination of two starting points: \begin{itemize} \item Diagnosis \item Sample Collection \end{itemize} one end point \begin{itemize} \item First Significant Treatment \end{itemize} and two ways to incorporate gene expression: \begin{itemize} \item Continuous \item Binary \end{itemize} Because the computations are time-consuming, we apply these methods once and cache the results, allowing us to concentrate on the structure of the report. A list of the top fifteen genes with the most significant average $p$-values (over the four methods used to predict time-to-treatment) is contained in \tref{top15.sig}. <>= UC.dx2sig <- continuous.univariate("TimeDiagnosis2SigTreat", "NumericSigTreatment", dataset, clindata, dx.model) UB.dx2sig <- binary.univariate("TimeDiagnosis2SigTreat", "NumericSigTreatment", dataset, clindata, dx.model) UC.sample2sig <- continuous.univariate("TimeSample2SigTreat", "NumericSigTreatment", dataset, clindata, sam.model) UB.sample2sig <- binary.univariate("TimeSample2SigTreat", "NumericSigTreatment", dataset,clindata, sam.model) @ <>= fourp <- data.frame(b.samsig=UB.sample2sig[,"P"], c.samsig=UC.sample2sig[,"P"], b.dxsig=UB.dx2sig[,"P"], c.dxsig=UC.dx2sig[,"P"]) @ <>= geo.mean <- apply(fourp, 1, function(x) exp(mean(log(x)))) @ <>= rg <- apply(dataset, 2, function(x) diff(range(x))) sigma <- apply(dataset, 2, sd) fourp <- data.frame(fourp, geo.mean.p=geo.mean, rg, sigma) @ <>= fourFoldWay <- function(dataset, clindata, dx.model=NULL, sam.model=NULL) { <> <> <> <> fourp } @ <>= f <- "fourp.rda" if (file.exists(f)) { load(f) } else { fourp <- fourFoldWay(cardAB.data, cardAB.clinical) fourp <- data.frame(fourp, onC=as.numeric(onC)) save(fourp, file=f) } rm(f) @ <>= library(xtable) xtable(round(fourp[order(fourp$geo.mean.p),][1:15,], 4), digits=c(0, rep(4, 5), 1, 3, 0), align="|l|rrrrrrrr|", caption="Highest ranked genes for univariate ability to predict time-to-treatment.", label="top15.sig") @ \subsection{Cross-validation of feature selection} To get an even more robust measure of which genes are important, we repeat this analysis with various subsets of the training data. Specifically, we repeatedly select $50$ out of the $67$ patients who were assayed on Cards~A and~B. We fit the same four Cox proportional hazards models and compute the geometric mean of the $p$-values. Selecting the genes whose geometric mean $p < 0.03$ (and always requiring at least two genes), we use the Akaike Information Criterion (AIC) to construct an optimal prognostic model on the set of $50$ patients. We record the percentage of times that each gene passes the first filter (on the geometric mean $p$) and is retained in the optimal model by AIC. All of these computations are performed inside the \rcode{myLoop} function, which will allow us to reuse the same code later. Since these computations are more time-consuming than the previous ones, we also cache the results while writing the report. <>= myLoop <- function(dxcol, samcol, statcol, nLoop=300, dx.model=NULL, sam.model=NULL) { dagger <- as.data.frame(matrix(0, nrow=nLoop, ncol=ncol(cardAB.data))) colnames(dagger) <- colnames(cardAB.data) keeper <- caller <- dagger for (i in 1:nLoop) { selector <- sample(nrow(cardAB.data), 50) dataset <- cardAB.data[selector,] clindata <- cardAB.clinical[selector,] c.dx2sig <- continuous.univariate(dxcol, statcol, dataset, clindata, dx.model) b.dx2sig <- binary.univariate(dxcol, statcol, dataset, clindata, dx.model) c.sample2sig <- continuous.univariate(samcol, statcol, dataset, clindata, sam.model) b.sample2sig <- binary.univariate(samcol, statcol, dataset, clindata, sam.model) fourple <- data.frame(b.samsig=b.sample2sig[,"P"], c.samsig=c.sample2sig[,"P"], b.dxsig=b.dx2sig[,"P"], c.dxsig=c.dx2sig[,"P"]) geo.mean <- apply(fourple, 1, function(x) exp(mean(log(x)))) topper <- names(geo.mean)[geo.mean < 0.03 & !is.na(geo.mean)] if (length(topper)<2) { topper <- names(geo.mean)[order(geo.mean)][1:2] } caller[i, topper] <- 1 myf <- formula(paste("Surv(", samcol, ",", statcol, ") ~", paste(topper ,collapse="+"))) mod0 <- coxph(myf, data.frame(dataset, clindata)) mod1 <- step(mod0, trace=0) keeper[i,names(coef(mod1))] <- 1 myf <- formula(paste("Surv(", dxcol, ",", statcol, ") ~", paste(topper ,collapse="+"))) mod0 <- coxph(myf, data.frame(dataset, clindata)) mod1 <- step(mod0, trace=0) dagger[i,names(coef(mod1))] <- 1 } list(nLoop=nLoop, dxAIC=dagger, samAIC=keeper, firstPass=caller) } @ <>= loop2df <- function(loop) { first <- apply(loop$firstPass, 2, mean) dx <- apply(loop$dxAIC, 2, mean) sam <- apply(loop$samAIC, 2, mean) data.frame(FirstPass=first, dxAIC=dx, samAIC=sam) } @ <>= f <- "loopSig.rda" if (file.exists(f)) { load(f) } else { loopSig <- myLoop("TimeDiagnosis2SigTreat", "TimeSample2SigTreat", "NumericSigTreatment", 300) save(loopSig, file=f) } rm(f) temp <- loop2df(loopSig) sig.pred <- data.frame(fourp, temp) @ The fifteen genes that are selected most often by the univariate models during cross-validation are shown in \tref{tbl:sig.pred}. While the order of the genes has changed slightly from \tref{top15.sig}, we get exactly the same set of top fifteen genes. We also note that 13 out of 15 were chosen for inclusion on Card~C (i.e., to be measured on the validation dataset). In \fref{sig.pred}, we provide barplots of the number of times during cross-validation that a gene was significant in a univariate model (top panel) or was retained in a multivariate model (bottom panel). (note that the barplots only show genes that are present on both Cards~AB and Card~C. Colors in the bottom panel indicate whether the multivariate model was predicting time-to-treatment from diagnosis (purple) or from sample collection (orange). <>= xtable(round(sig.pred[rev(order(sig.pred$FirstPass)),8:11][1:15,], digits=3), digits=c(0,0, 3, 3, 3), align="|l|rrrr|", label="tbl:sig.pred", caption="Highest ranked genes by cross-validation for predicting time-to-treatment.") @ <>= par(mfrow=c(2,1)) temp <- sig.pred[fourp$onC==1,] pts <- barplot(300*t(temp[, c("FirstPass")]), names.arg=rep('',nrow(temp))) mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5) pts <- barplot(300*t(temp[, c("dxAIC", "samAIC")]), col=colorcode, names.arg=rep('',nrow(temp)), legend.text=c("From Diagnosis", "From Sample")) mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5) par(mfrow=c(1,1)) @ <>= png(file="SKI-SLAM-paper/mfc-figure1ab.png", width=800, height=1200, bg="white", pointsize=20) par(bg="white", cex=1.5) colorcode <- c("black", "gray") <> dev.off() pdf(file="SKI-SLAM-paper/mfc-figure1ab.pdf", width=8, height=12, bg="white", pointsize=16) par(bg="white") colorcode <- c("black", "gray") <> dev.off() @ \begin{figure} <>= colorcode <- c("purple", "orange") <> @ \caption{\textbf{(Top)} Number of times, during cross-validation, that a marker is significant in a univariate model to predict time-to-treatment. \textbf{(Bottom)} Number of times a marker is retained in a multivariate model to predict time-to-treatment.} \label{sig.pred} \end{figure} \section{Overall Survival} We now want to repeat the same analysis for the ``overall survival'' outcome in place of ``time-to-treatment''. The top fifteen genes from this analysis are listed in \tref{top15.os}. When we compare this result to \tref{top15.sig}, we see that the $p$-values are much less significant. This almost has to be the case, since the number of events is much smaller in the overall survival analysis (\Sexpr{sum(cardAB.clinical[,"VitalStatus"]=="Dead")} deaths) than in the time-to-treatment analysis (\Sexpr{sum(cardAB.clinical[,"SigTreatment"]=="Treated")} treated patients). Interestingly, we also see that only two genes (CD14 and EGR3) make it into the top fifteen for both the time-to-treatment analysis and the overall survival analysis. Finally, we note that only seven of the top fifteen geens were among the ones we selected for Card~C in our earlier analysis. <>= UC.dx2sig <- continuous.univariate("OSAfterDiagnosis", "NumericVitalStatus", dataset, clindata, dx.model) UB.dx2sig <- binary.univariate("OSAfterDiagnosis", "NumericVitalStatus", dataset, clindata, dx.model) UC.sample2sig <- continuous.univariate("OSAfterSample", "NumericVitalStatus", dataset, clindata, sam.model) UB.sample2sig <- binary.univariate("OSAfterSample", "NumericVitalStatus", dataset,clindata, sam.model) @ <>= fourFoldWayOS <- function(dataset, clindata, dx.model=NULL, sam.model=NULL) { <> <> <> <> fourp } @ <>= f <- "oster.rda" if (file.exists(f)) { load(f) } else { oster <- fourFoldWayOS(cardAB.data, cardAB.clinical) oster <- data.frame(oster, onC=as.numeric(onC)) save(oster, file=f) } rm(f) @ <>= xtable(round(oster[order(oster$geo.mean.p),][1:15,], 4), digits=c(0, rep(4, 5), 1, 3, 0), align="|l|rrrrrrrr|", caption="Highest ranked genes for univariate ability to predict overall survival.", label="top15.os") @ As before, we cross-validate the feature and model selection step. The top fifteen cross-validated genes are listed in \tref{tbl:os.pred}. Unlike the case of time-to-treatment, two genes from the initial analysis (SLC2A5 and ZBTB20) fell out of the top fifteen, to be replaced by NPM1 and CCL5. We have also prepared box plots (\fref{os.pred}) to show the number of times that genes were significant in univariate models or retained in multivariate models to predict overall survival. Overall, it is unclear if anything other than the top five to seven genes (at most) are really meaningful. <>= set.seed(394267) f <- "loopOS.rda" if (file.exists(f)) { load(f) } else { loopOS <- myLoop("OSAfterDiagnosis", "OSAfterSample", "NumericVitalStatus", 300) save(loopOS, file=f) } rm(f) @ <>= temp <- loop2df(loopOS) os.pred <- data.frame(oster, temp) @ <>= xtable(round(os.pred[rev(order(os.pred$FirstPass)),8:11][1:15,], digits=3), digits=c(0,0, 3, 3, 3), align="|l|rrrr|", label="tbl:os.pred", caption="Highest ranked genes by cross-validation for predicting overall survival.") @ \begin{figure} <>= par(mfrow=c(2,1)) temp <- os.pred[oster$onC==1,] pts <- barplot(300*t(temp[, c("FirstPass")]), names.arg=rep('',nrow(temp))) mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5) pts <- barplot(300*t(temp[, c("dxAIC", "samAIC")]), col=c("purple", "orange"), names.arg=rep('',nrow(temp)), legend.text=c("From Diagnosis", "From Sample"), args.legend=list(x="topleft")) mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5) par(mfrow=c(1,1)) @ \caption{\textbf{(Top)} Number of times, during cross-validation, that a marker is significant in a univariate model to predict overall survival. \textbf{(Bottom)} Number of times a marker is retained in a multivariate model to predict overall survival.} \label{os.pred} \end{figure} \section{Adding to Clinical Predictors} All of the analyses to this point have been performed ignoring the large number of clinical covariates that we already have available to predict the prognosis of CLL patients. However, in a previous report (\rcode{microfluidicsModels02.pdf}), we developed and validated the best clinical predictors for both time-to-treatment and overall survival. In this section, we will repeat the previous analysis in order to find out which genes are most useful in adding predictive power to the clinical variables we already know. \subsection{Time to treatment} We start by seeing which individual genes can add to our ability to predict time-to-treatment. The top fifteen genes are listed in \tref{top15.sig.add}. This list has very little overlap with the genes selected above (while ignoring the clinical variables). More importantly, the average $p$-values clearly indicate that these individual genes do not seem to add very much to the existing clinical variables. (Only one $p$-value is less than $0.05$, and that is without accounting for multiple testing.) We also note that ECE1 is listed twice---because it was one of the housekeeping genes included on both Card~A and Card~B. This observation strengthens our conclusion that these genes are not really significant. <>= f <- "fourp-add.rda" if (file.exists(f)) { load(f) } else { fourp.add <- fourFoldWay(cardAB.data, cardAB.clinical, dx2sig.model, sam2sig.model) fourp.add <- data.frame(fourp.add, onC=as.numeric(onC)) save(fourp.add, file=f) } rm(f) @ <>= xtable(round(fourp.add[order(fourp.add$geo.mean.p),][1:15,], 4), digits=c(0, rep(4, 5), 1, 3, 0), align="|l|rrrrrrrr|", caption=paste("Highest ranked genes for univariate ability", "to predict time-to-treatment after", "accounting for clinical variables."), label="top15.sig.add") @ Next, we cross-validate the feature and model selection process for predicting time-to-treatment when we incorporate the clinical variables. As usual, we list the top fifteen genes in \tref{tbl:sig.clin.pred}. While this list mostly matches the previous one, it still is not very impressive. The corresponding barplot (\fref{sig.clin.pred}) does not change our conclusions. <>= set.seed(192604) f <- "loopSigClin.rda" if(file.exists(f)) { load(f) } else { loopSigClin <- myLoop("TimeDiagnosis2SigTreat", "TimeSample2SigTreat", "NumericSigTreatment", 300, dx2sig.model, sam2sig.model) save(loopSigClin, file=f) } rm(f) @ <>= temp <- loop2df(loopSigClin) sig.clin.pred <- data.frame(fourp.add, temp) @ <>= xtable(round(sig.clin.pred[rev(order(sig.clin.pred$FirstPass)),8:11][1:15,], digits=3), digits=c(0,0, 3, 3, 3), align="|l|rrrr|", label="tbl:sig.clin.pred", caption=paste("Highest ranked genes by cross-validation", "for predicting time-to-treatment after", "accounting for clinical variables.")) @ \begin{figure} <>= par(mfrow=c(2,1)) temp <- sig.clin.pred[fourp$onC==1,] pts <- barplot(300*t(temp[, c("FirstPass")]), names.arg=rep('',nrow(temp))) mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5) pts <- barplot(300*t(temp[, c("dxAIC", "samAIC")]), col=c("purple", "orange"), names.arg=rep('',nrow(temp)), legend.text=c("From Diagnosis", "From Sample"), args.legend=list(x="topleft")) mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5) par(mfrow=c(1,1)) @ \caption{\textbf{(Top)} Number of times, during cross-validation, that a marker is significant in a univariate model to predict time-to-treatment when added to the best clinical model. \textbf{(Bottom)} Number of times a marker is retained in a multivariate model to predict time-to-treatment.} \label{sig.clin.pred} \end{figure} \subsection{Overall survival} In the final set of analyses, we look at our ability to predict ovreall survival from gene expression when added to the existing clinical model. \tref{top15.os.add} shows the top fifteen genes; only two of them have $p$-values =smaller than $0.05$, and only three of them were previously selected for inclusion on Card~C. <>= f <- "oster-add.rda" if (file.exists(f)) { load(f) } else { oster.add <- fourFoldWayOS(cardAB.data, cardAB.clinical, dx2os.model, sam2os.model) oster.add <- data.frame(oster.add, onC=as.numeric(onC)) save(oster.add, file=f) } rm(f) @ <>= xtable(round(oster.add[order(oster.add$geo.mean.p),][1:15,], 4), digits=c(0, rep(4, 5), 1, 3, 0), align="|l|rrrrrrrr|", caption=paste("Highest ranked genes for univariate ability", "to predict overall survival after", "accounting for clinical variables."), label="top15.os.add") @ For the sake of completeness, we cross-validate the feature abnd model selection process. The top fifteen genes are listed in \tref{tbl:os.clin.pred}; only ten of the fifteen genes from \tref{top15.os.add} are included in the cross-validated list. Interestingly, however, the five replacement genes are more likely to come from the set of genes selected for inclusion on Card~C. The barplot (\fref{os.clin.pred}) suggests that at most five of the genes that were also assayed on Card~C can enhance our ability to predict overall survivial in the presence of known clinical factors. <>= set.seed(564316) f <- "loopOSClin.rda" if (file.exists(f)) { load(f) } else { loopOSClin <- myLoop("OSAfterDiagnosis", "OSAfterSample", "NumericVitalStatus", 300, dx2os.model, sam2os.model) save(loopOSClin, file=f) } rm(f) @ <>= temp <- loop2df(loopOSClin) os.clin.pred <- data.frame(oster.add, temp) @ <>= xtable(round(os.clin.pred[rev(order(os.clin.pred$FirstPass)),8:11][1:15,], digits=3), digits=c(0,0, 3, 3, 3), align="|l|rrrr|", label="tbl:os.clin.pred", caption=paste("Highest ranked genes by cross-validation", "for predicting overall survival after", "accounting for clinical variables.")) @ \begin{figure} <>= par(mfrow=c(2,1)) temp <- os.clin.pred[oster$onC==1,] pts <- barplot(300*t(temp[, c("FirstPass")]), names.arg=rep('',nrow(temp))) mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5) pts <- barplot(300*t(temp[, c("dxAIC", "samAIC")]), col=c("purple", "orange"), names.arg=rep('',nrow(temp)), legend.text=c("From Diagnosis", "From Sample"), args.legend=list(x="topleft")) mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5) par(mfrow=c(1,1)) @ \caption{\textbf{(Top)} Number of times, during cross-validation, that a marker is significant in a univariate model to predict overall survival when added to te best clinical model. \textbf{(Bottom)} Number of times a marker is retained in a multivariate model to predict overall survival.} \label{os.clin.pred} \end{figure} \section{Overall Predictive Ability} \label{sec:overall} In this section, we want to pool the data from all four of the analyses (predicting time-to-treatment or overall survival, both with or without including clinical variables) that we have performed so far. We create separate objects to count the number of times genes wer significant in univariate models and the number of times they were retained in multivariate models. We then create barplot displaying this data (\fref{overall}). We make two observations about these plots. First, there is little difference between being selected as significant in a univariate model or surviving in a multivariate model, so we can probably perform feature selection just oin the cross-validated univariate results. Second, the results from predicting time-to-treatment without the clinical variables (red part of the bars in the top panel) dominate the selection procedure. There are only three genes (WSB@, TNFRSF8, CRY1) that we might add to the ``useful'' list based solely on their ability to predict overall survival instead of time-to-treatment. <>= fp <- data.frame(Treat=sig.pred$FirstPass, OS=os.pred$FirstPass, ClinTreat=sig.clin.pred$FirstPass, OSTreat=os.clin.pred$FirstPass) rownames(fp) <- rownames(sig.pred) aic <- data.frame(sig.pred[,10:11,], os.pred[,10:11,], sig.clin.pred[,10:11,], os.clin.pred[,10:11,]) rownames(aic) <- rownames(sig.pred) @ \begin{figure} <>= par(mfrow=c(2,1)) temp <- fp[oster$onC==1,] #temp <- fp pts <- barplot(300*t(temp), col=c("red","blue","pink", "skyblue"), names.arg=rep('',nrow(temp))) mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5) temp <- aic[oster$onC==1,] pts <- barplot(300*t(temp), col=c("purple", "orange", "blue", "red", "#ff00ff", "#ffaa77", "skyblue", "pink"), names.arg=rep('',nrow(temp)), legend.text=c("Dx, treat", "Sample, treat", "Dx OS", "Sample OS"), args.legend=list(x="topleft", cex=0.5)) mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5) par(mfrow=c(1,1)) @ \caption{\textbf{(Top)} Number of times, during cross-validation, that a marker is significant in a univariate model to predict time-to-treatment or overall survival. \textbf{(Bottom)} Number of times a marker is retained in a multivariate model.} \label{overall} \end{figure} \section{Algorithm Details} In this section, we collect details about the algorithms used in this analysis. \subsection{Continuous Univariate Models} We want to perform univariate analyses (one gene at a time) using the microfluidics QRT-PCR data as a continuous predictor of the outcome. The following function will perform these analyses and record the results. <>= <> @ \subsection{Dichotomous Univariate Models} We also want to convert each continuous gene expression variable into a dichotomous variable. In the vast majority of cases, we split at the median. However, if there is a large natural gap in the values, as detected by the k-means algorithm, then we split at the gap instead of the median. Dichotomization is performed by this function: <>= <> @ Fitting the time-to-event model using the binary predictor is then performed by this function: <>= <> @ \subsection{Four Analyses of Time to Treatment} In order to find robust prognostic markers of time to treatment, we perform four different analyses on the training data. This block of code explains how we \rcode{apply.four.methods}. <>= <> @ Next, in order to \rcode{pool.p.values}, we assemble the four $p$-values from these four different analyses into a single data frame. <>= <> @ We take the geometric mean of the four different p-values computed for each gene. This scale appears to be the natural way to combine $p$-values, in light of Fisher's computation showing that summing the logs of independent $p$-values yields (with appropriate weights) a chi-squared statistic. <>= geo.mean <- apply(fourp, 1, function(x) exp(mean(log(x)))) @ We also \rcode{compute.other.statistics} (specifically, range and standard deviation) for each gene in datasets~A and~B, and combine everything into a data frame. <>= rg <- apply(dataset, 2, function(x) diff(range(x))) sigma <- apply(dataset, 2, sd) fourp <- data.frame(fourp, geo.mean.p=geo.mean, rg, sigma) @ To make it simpler to apply this algorithm to other datasets, we turn it into a function: <>= fourFoldWay <- function(dataset, clindata, dx.model=NULL, sam.model=NULL) { <> <> <> <> fourp } @ \subsection{Cross-Validation of Feature and Model Selection} In order to cross-validate our emthods, we repeatedly select $50$ out of the $67$ patients who were assayed on Cards~A and~B. We fit the same four Cox proportional hazards models and compute the geometric mean of the $p$-values. Selecting the genes whose geometric mean $p < 0.03$ (and always requiring at least two genes), we use the Akaike Information Criterion (AIC) to construct an optimal prognostic model on the set of $50$ patients. We record the percentage of times that each gene passes the first filter (on the geometric mean $p$) and is retained in the optimal model by AIC. All of these computations are performed inside the \rcode{myLoop} function. <>= <> @ We use the next function to summarize the cross-validation results. <>= <> @ \subsection{Four Analyses of Overall Survival} As with predicting time-to-treatment, we also apply four methods (by combining either binary or continuos predictors with starting points at either diagnosis or sample collection) to rpedict overall survival. <>= <> @ We combine these methods with the pooled $p$-values and geometric mean into another function. <>= fourFoldWayOS <- function(dataset, clindata, dx.model=NULL, sam.model=NULL) { <> <> <> <> fourp } @ \section{Appendix} We save the current workspace. <>= rm(myLoop, temp, pts, loop2df, aic, fp) save.image(file="micro04.rda") @ This analysis was run in the following directory: <>= getwd() @ <>= sessionInfo() @ \end{document}