\documentclass[11pt]{article}
\usepackage{graphicx}
\usepackage{cite}
\usepackage{hyperref}
\pagestyle{myheadings}
\markright{microfluidicsModels04; revised: 1 Sept 2010}

\setlength{\topmargin}{0in}
\setlength{\textheight}{8in}
\setlength{\textwidth}{6.5in}
\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}

\def\rcode#1{\texttt{#1}}
\def\fref#1{Figure~\ref{#1}}
\def\tref#1{Table~\ref{#1}}

\title{Feature Selection: Gene Markers of Prognosis in CLL}
\author{Kevin R. Coombes}
\date{12 July 2010; REVISED: 1 Sept 2010}

\SweaveOpts{prefix.string=Figures/microfluidicsModels04,eps=FALSE}
<<options,echo=FALSE>>=
options(width=92)
options(SweaveHooks = list(fig = function() par(bg='white')),eps=FALSE)
@ 

<<makeFiguresDirectory,echo=FALSE>>=
if (!file.exists("Figures")) {
  dir.create("Figures")
}
@ 

\begin{document}
\maketitle
\tableofcontents

\section{Executive Summary}
\subsection{Introduction}
The main goal of the study is to identify changes in gene expression
in CLL that are associated with clinical outcome (including overall
survival and time-to-treatment).

\subsubsection{Aims/Objectives}
We want to find individual genes and combinations of genes that are
related to prognosis in CLL.  Here ``prognosis'' refers to the ability
to predict time-to-treatment or overall survival.

The current analysis look \textbf{only at the training set}.

\subsection{(Statistical) Methods}
Exploratory data analyses and summaries.  Cox proportional hazards
models. Stepwise selection of predictive variables using the Akaike
Information Criterion.  Bayesian model averaging.

\subsection{Results}

\begin{itemize}
\item Time-to-treatment
  \begin{itemize}
  \item We found (at least) fifteen genes that were significant in
    univariate models to predict time-to-treatment (\tref{top15.sig})
    and worked well under cross-validation (\tref{tbl:sig.pred}) or as
    part of multivariate models (\fref{sig.pred}).
  \end{itemize}
\item Overall survival
  \begin{itemize}
  \item We found about five to ten genes that were significant in
    univariate models to predict overall survival (\tref{top15.os})
    and worked well under cross-validation (\tref{tbl:os.pred}) or as
    part of multivariate models (\fref{os.pred}).
  \end{itemize}
\item Time-to-treatment, including clinical variables
  \begin{itemize}
  \item It is not clear that any \textbf{individual} genes add to our
    ability to predict time-to-treatment after accounting for known
    clinical factors.  However, some of the individual factrors above
    are at least bordeline significant in the training set.
  \end{itemize}
\item Overall survival, including clinical variables
  \begin{itemize}
  \item The results are similar for trying to predict overall
    survival after accounting for clinical factors.
  \end{itemize}
\item When these genes worked, they could be used as binary or as
  continuous predictors, and they worked whether we started at
  diganosis or at sample collection.
\end{itemize}

\subsection{Conclusion}

The most promising strategy appears to be:
\begin{enumerate}
\item Focus on time-to-treatment as the primary outcome.
  Specifically, build a small number (probably about three) of
  different multivariate models to predict time-to-treatment while
  ignoring the clinical variables.
\item Compute composite scores from each of the multivariate models.
  Test whether the composite scores are useful to predict
  time-to-treatment or overall survival in the presence of the known
  clinical factors in the training set.
\item Validate the (univariate) ability of individual genes to predict
  time-to-treatment in the Card~C validation set.
\item Validate the multivariate models on Card~C.
\item Test whether the composite scores are useful to predict
  time-to-treatment or overall survival in the presence of the known
  clinical factors in the Card~C validation set.
\end{enumerate}

\section{Loading the Data}

In the previous report (\rcode{microfluidicsModels03.pdf}), we stored
all of the relevant data in a binary R file.  We begin by loading that
data.
<<load.data>>=
load("micro03.rda")
ls()
@ 

\subsection{Preliminaries}

As mentioned previously, we want to perform a variety of time-to-event
(``survival'') analyses, with different starting and ending points.
The basic code for these analyses is in the following package.
<<lib>>=
library(survival)
@ 


We want to perform univariate analyses (one gene at a time) using the
microfluidics QRT-PCR data as a continuous predictor of the outcome.
The \rcode{continuous.univariate} function will perform these analyses
and record the results.
<<cont.uni,echo=FALSE>>=
continuous.univariate <- function(timecol, statcol, 
                                  dset=cardAB.data, 
                                  clindata=cardAB.clinical,
                                  base.model=NULL) {
  GeneUnivariate <- matrix(NA, ncol(dset), 5)
  colnames(GeneUnivariate) <- c("coef", "HR", "SE", "Z", "P")
  rownames(GeneUnivariate) <- colnames(dset)
  for (i in 1:ncol(dset)) {
    mygene <- colnames(dset)[i]
    dataset <- data.frame(clindata[, c(timecol, statcol)], Y=dset[,i])
    colnames(dataset)[3] <- mygene
    myform <- formula(paste("Surv(", timecol, ", ", statcol, ") ~", mygene))
    if(!is.null(base.model)) {
      dataset <- data.frame(dataset, clindata[,attr(terms(base.model), "term.labels")])
      myform <- update(formula(base.model), paste("~ . + ", mygene))
    }
    model <- coxph(myform, data=dataset)
    GeneUnivariate[i,] <- summary(model)$coef[mygene,]
  }
  GeneUnivariate
}
@ 
We also want to convert each continuous gene expression variable into
a dichotomous variable.  (Note that in the vast majority of cases, we
split at the median.  However, if there is a large natural gap in the
values, as detected by the k-means algorithm, then we split at the gap
instead of the median.)  The \rcode{hilo} and
\rcode{binary.univariate} functions will perform this analysis.
Details on the algorithms and methods are provided at the end of this
report.
<<hilo,echo=FALSE>>=
hilo <- function(x) {
  y <- kmeans(x, centers=2)$cluster
  m1 <- median(x[y==1]) 
  m2 <- median(x[y==2]) 
  if (m1 < m2) y <- 3-y
  gap <- min(x[y==1]) - max(x[y==2])
  if (gap > sd(x)/2) { # use kmeans
    z <- y
  } else {
    z <- 1 + as.numeric(x < median(x))
  }
  factor(ifelse(z==2, "LowCycle", "HighCycle"), 
         levels=c("HighCycle", "LowCycle"))
}
@ 
<<bin.uni,echo=FALSE>>=
binary.univariate <- function(timecol, statcol, dset=cardAB.data, 
                              clindata=cardAB.clinical, base.model=NULL) {
  GeneUnivariate <- matrix(NA, ncol(dset), 5)
  colnames(GeneUnivariate) <- c("coef", "HR", "SE", "Z", "P")
  rownames(GeneUnivariate) <- colnames(dset)
  for (i in 1:ncol(dset)) {
    mygene <- colnames(dset)[i]
    dataset <- data.frame(clindata[, c(timecol, statcol)], Y=hilo(dset[,i]))
    colnames(dataset)[3] <- mygene
    myform <- formula(paste("Surv(", timecol, ", ", statcol, ") ~", mygene))
    if(!is.null(base.model)) {
      dataset <- data.frame(dataset, clindata[,attr(terms(base.model), "term.labels")])
      myform <- update(formula(base.model), paste("~ . + ", mygene))
    }
    model <- coxph(myform, data=dataset)
    GeneUnivariate[i,] <- summary(model)$coef[paste(mygene, "LowCycle", sep=''),]
  }
  GeneUnivariate
}
@ 

\section{Time to treatment}

In order to find robust prognostic markers of time to treatment, we
perform four different analyses on the training data from the
\Sexpr{nrow(cardAB.data)} patients assayed on Cards~A and~B.
Specifically, we use every combination of two starting points:
\begin{itemize}
\item Diagnosis
\item Sample Collection
\end{itemize} 
one end point
\begin{itemize}
\item First Significant Treatment
\end{itemize} 
and two ways to incorporate gene expression:
\begin{itemize}
\item Continuous
\item Binary
\end{itemize} 
Because the computations are time-consuming, we apply these methods
once and cache the results, allowing us to concentrate on the
structure of the report.  A list of the top fifteen genes with the
most significant average $p$-values (over the four methods used to
predict time-to-treatment) is contained in \tref{top15.sig}.

<<apply.four.methods,eval=FALSE,echo=FALSE>>=
UC.dx2sig       <- continuous.univariate("TimeDiagnosis2SigTreat", 
                                         "NumericSigTreatment",
                                         dataset, clindata, dx.model)
UB.dx2sig       <- binary.univariate("TimeDiagnosis2SigTreat",
                                     "NumericSigTreatment",
                                     dataset, clindata, dx.model)
UC.sample2sig   <- continuous.univariate("TimeSample2SigTreat", 
                                         "NumericSigTreatment",
                                         dataset, clindata, sam.model)
UB.sample2sig   <- binary.univariate("TimeSample2SigTreat",
                                     "NumericSigTreatment",
                                     dataset,clindata, sam.model)
@ 
<<pool.p.values,eval=FALSE,echo=FALSE>>=
fourp <- data.frame(b.samsig=UB.sample2sig[,"P"], 
                    c.samsig=UC.sample2sig[,"P"], 
                    b.dxsig=UB.dx2sig[,"P"],
                    c.dxsig=UC.dx2sig[,"P"])
@ 
<<compute.geometric.mean,eval=FALSE,echo=FALSE>>=
geo.mean <- apply(fourp, 1, function(x) exp(mean(log(x))))
@ 
<<compute.other.statistics,eval=FALSE,echo=FALSE>>=
rg <- apply(dataset, 2, function(x) diff(range(x)))
sigma <- apply(dataset, 2, sd)

fourp <- data.frame(fourp, geo.mean.p=geo.mean, 
                     rg, sigma)
@ 
<<ffw,echo=FALSE>>=
fourFoldWay <- function(dataset, clindata, dx.model=NULL, sam.model=NULL) {
<<apply.four.methods>>
<<pool.p.values>>
<<compute.geometric.mean>>
<<compute.other.statistics>>
  fourp
}
@ 
<<fourp>>=
f <- "fourp.rda"
if (file.exists(f)) {
  load(f)
} else {
  fourp <- fourFoldWay(cardAB.data, cardAB.clinical)
  fourp <- data.frame(fourp, onC=as.numeric(onC))
  save(fourp, file=f)
}
rm(f)
@ 
<<results=tex,echo=FALSE>>=
library(xtable)
xtable(round(fourp[order(fourp$geo.mean.p),][1:15,], 4),
       digits=c(0, rep(4, 5), 1, 3, 0),
       align="|l|rrrrrrrr|",
       caption="Highest ranked genes for univariate ability to predict time-to-treatment.",
       label="top15.sig")
@ 

\subsection{Cross-validation of feature selection}

To get an even more robust measure of which genes are important, we
repeat this analysis with various subsets of the training data.
Specifically, we repeatedly select $50$ out of the $67$ patients who
were assayed on Cards~A and~B.  We fit the same four Cox proportional
hazards models and compute the geometric mean of the $p$-values.
Selecting the genes whose geometric mean $p < 0.03$ (and always
requiring at least two genes), we use the Akaike Information Criterion
(AIC) to construct an optimal prognostic model on the set of $50$
patients.  We record the percentage of times that each gene passes the
first filter (on the geometric mean $p$) and is retained in the
optimal model by AIC.  All of these computations are performed inside
the \rcode{myLoop} function, which will allow us to reuse the same
code later.  Since these computations are more time-consuming than the
previous ones, we also cache the results while writing the report.
<<myLoop,echo=FALSE>>=
myLoop <- function(dxcol, samcol, statcol, 
                   nLoop=300, dx.model=NULL, sam.model=NULL) {
  dagger <- as.data.frame(matrix(0, nrow=nLoop, ncol=ncol(cardAB.data)))
  colnames(dagger) <- colnames(cardAB.data)
  keeper <- caller <- dagger
  for (i in 1:nLoop) {
    selector <- sample(nrow(cardAB.data), 50)
    dataset  <- cardAB.data[selector,]
    clindata <- cardAB.clinical[selector,]
    c.dx2sig <- continuous.univariate(dxcol, statcol, 
                                      dataset, clindata,
                                      dx.model)
    b.dx2sig  <- binary.univariate(dxcol, statcol,
                                   dataset, clindata,
                                   dx.model)
    c.sample2sig <- continuous.univariate(samcol, statcol,
                                          dataset, clindata,
                                          sam.model)
    b.sample2sig <- binary.univariate(samcol, statcol,
                                      dataset, clindata,
                                      sam.model)
    fourple <- data.frame(b.samsig=b.sample2sig[,"P"], 
                          c.samsig=c.sample2sig[,"P"], 
                          b.dxsig=b.dx2sig[,"P"],
                          c.dxsig=c.dx2sig[,"P"])
    geo.mean <- apply(fourple, 1, function(x) exp(mean(log(x))))  
    topper <- names(geo.mean)[geo.mean < 0.03 & !is.na(geo.mean)]
    if (length(topper)<2) {
      topper <- names(geo.mean)[order(geo.mean)][1:2]
    }
    caller[i, topper] <- 1
    myf <- formula(paste("Surv(", samcol, ",", statcol, ") ~", 
                         paste(topper ,collapse="+")))
    mod0 <- coxph(myf, data.frame(dataset, clindata))
    mod1 <- step(mod0, trace=0)
    keeper[i,names(coef(mod1))] <- 1
    myf <- formula(paste("Surv(", dxcol, ",", statcol, ") ~", 
                         paste(topper ,collapse="+")))
    mod0 <- coxph(myf, data.frame(dataset, clindata))
    mod1 <- step(mod0, trace=0)
    dagger[i,names(coef(mod1))] <- 1
  }
  list(nLoop=nLoop, dxAIC=dagger, samAIC=keeper, firstPass=caller)
}
@ 
<<loop2df,echo=FALSE>>=
loop2df <- function(loop) {
  first <- apply(loop$firstPass, 2, mean)
  dx    <- apply(loop$dxAIC, 2, mean)
  sam   <- apply(loop$samAIC, 2, mean)
  data.frame(FirstPass=first, dxAIC=dx, samAIC=sam)
}
@ 
<<loopSig>>=
f <- "loopSig.rda"
if (file.exists(f)) {
  load(f)
} else {
  loopSig <- myLoop("TimeDiagnosis2SigTreat", "TimeSample2SigTreat",
                 "NumericSigTreatment", 300)
  save(loopSig, file=f)
}
rm(f)
temp <- loop2df(loopSig)
sig.pred <- data.frame(fourp, temp)
@ 
The fifteen genes that are selected most often by the univariate
models during cross-validation are shown in \tref{tbl:sig.pred}.
While the order of the genes has changed slightly from
\tref{top15.sig}, we get exactly the same set of top fifteen genes.
We also note that 13 out of 15 were chosen for inclusion on Card~C
(i.e., to be measured on the validation dataset).  

In \fref{sig.pred}, we provide barplots of the number of times during
cross-validation that a gene was significant in a univariate model
(top panel) or was retained in a multivariate model (bottom panel).
(note that the barplots only show genes that are present on both
Cards~AB and Card~C.  Colors in the bottom panel indicate whether the
multivariate model was predicting time-to-treatment from diagnosis
(purple) or from sample collection (orange).
<<echo=FALSE,results=tex>>=
xtable(round(sig.pred[rev(order(sig.pred$FirstPass)),8:11][1:15,], digits=3),
       digits=c(0,0, 3, 3, 3), align="|l|rrrr|",
       label="tbl:sig.pred",
       caption="Highest ranked genes by cross-validation for predicting time-to-treatment.")
@ 
<<fig1code,echo=FALSE,eval=FALSE>>=
par(mfrow=c(2,1))
temp <- sig.pred[fourp$onC==1,]
pts <- barplot(300*t(temp[, c("FirstPass")]), 
               names.arg=rep('',nrow(temp)))
mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5)
pts <- barplot(300*t(temp[, c("dxAIC", "samAIC")]), col=colorcode,

               names.arg=rep('',nrow(temp)), 
               legend.text=c("From Diagnosis", "From Sample"))
mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5)
par(mfrow=c(1,1))
@ 
<<make1png>>=
png(file="SKI-SLAM-paper/mfc-figure1ab.png",
    width=800, height=1200, bg="white", pointsize=20)
par(bg="white", cex=1.5)
colorcode <- c("black", "gray")
<<fig1code>>
dev.off()
pdf(file="SKI-SLAM-paper/mfc-figure1ab.pdf", 
    width=8, height=12, bg="white", pointsize=16)
par(bg="white")
colorcode <- c("black", "gray")
<<fig1code>>
dev.off()
@ 

\begin{figure}
<<fig=TRUE,echo=FALSE, width=6.5, height=8>>=
colorcode <- c("purple", "orange")
<<fig1code>>
@ 
\caption{\textbf{(Top)} Number of times, during cross-validation, that
  a marker is significant in a univariate model to predict
  time-to-treatment.  \textbf{(Bottom)} Number of times a marker is
  retained in a multivariate model to predict time-to-treatment.}
\label{sig.pred}
\end{figure}

\section{Overall Survival}

We now want to repeat the same analysis for the ``overall survival''
outcome in place of ``time-to-treatment''.  The top fifteen genes from
this analysis are listed in \tref{top15.os}.  When we compare this
result to \tref{top15.sig}, we see that the $p$-values are much less
significant.  This almost has to be the case, since the number of
events is much smaller in the overall survival analysis
(\Sexpr{sum(cardAB.clinical[,"VitalStatus"]=="Dead")} deaths) than in
the time-to-treatment analysis
(\Sexpr{sum(cardAB.clinical[,"SigTreatment"]=="Treated")} treated patients).
Interestingly, we also see that only two genes (CD14 and EGR3) make it
into the top fifteen for both the time-to-treatment analysis and the
overall survival analysis.  Finally, we note that only seven of the
top fifteen geens were among the ones we selected for Card~C in our
earlier analysis.

<<apply.four.survival.methods,eval=FALSE,echo=FALSE>>=
UC.dx2sig       <- continuous.univariate("OSAfterDiagnosis", 
                                         "NumericVitalStatus",
                                         dataset, clindata, dx.model)
UB.dx2sig       <- binary.univariate("OSAfterDiagnosis",
                                     "NumericVitalStatus",
                                     dataset, clindata, dx.model)
UC.sample2sig   <- continuous.univariate("OSAfterSample", 
                                         "NumericVitalStatus",
                                         dataset, clindata, sam.model)
UB.sample2sig   <- binary.univariate("OSAfterSample",
                                     "NumericVitalStatus",
                                     dataset,clindata, sam.model)  
@ 
<<ffw.os,echo=FALSE>>=
fourFoldWayOS <- function(dataset, clindata, dx.model=NULL, sam.model=NULL) {
<<apply.four.survival.methods>>
<<pool.p.values>>
<<compute.geometric.mean>>
<<compute.other.statistics>>
  fourp
}
@ 
<<oster>>=
f <- "oster.rda"
if (file.exists(f)) {
  load(f)
} else {
  oster <- fourFoldWayOS(cardAB.data, cardAB.clinical)
  oster <- data.frame(oster, onC=as.numeric(onC))
  save(oster, file=f)
}
rm(f)
@ 
<<results=tex,echo=FALSE>>=
xtable(round(oster[order(oster$geo.mean.p),][1:15,], 4),
       digits=c(0, rep(4, 5), 1, 3, 0),
       align="|l|rrrrrrrr|",
       caption="Highest ranked genes for univariate ability to predict overall survival.",
       label="top15.os")
@ 

As before, we cross-validate the feature and model selection step.
The top fifteen cross-validated genes are listed in
\tref{tbl:os.pred}.  Unlike the case of time-to-treatment, two genes
from the initial analysis (SLC2A5 and ZBTB20) fell out of the top
fifteen, to be replaced by NPM1 and CCL5.  We have also prepared box
plots (\fref{os.pred}) to show the number of times that genes were
significant in univariate models or retained in multivariate models to
predict overall survival.  Overall, it is unclear if anything other
than the top five to seven genes (at most) are really meaningful.
<<loop.os>>=
set.seed(394267)
f <- "loopOS.rda"
if (file.exists(f)) {
  load(f)
} else {
  loopOS <- myLoop("OSAfterDiagnosis", "OSAfterSample",
                   "NumericVitalStatus", 300)
  save(loopOS, file=f)
}
rm(f)
@ 
<<os.pred>>=
temp <- loop2df(loopOS)
os.pred <- data.frame(oster, temp)
@ 
<<echo=FALSE,results=tex>>=
xtable(round(os.pred[rev(order(os.pred$FirstPass)),8:11][1:15,], digits=3),
       digits=c(0,0, 3, 3, 3), align="|l|rrrr|",
       label="tbl:os.pred",
       caption="Highest ranked genes by cross-validation for predicting overall survival.")
@ 
\begin{figure}
<<fig=TRUE,echo=FALSE, width=6.5, height=8>>=
par(mfrow=c(2,1))
temp <- os.pred[oster$onC==1,]
pts <- barplot(300*t(temp[, c("FirstPass")]), 
               names.arg=rep('',nrow(temp)))
mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5)
pts <- barplot(300*t(temp[, c("dxAIC", "samAIC")]), col=c("purple", "orange"),
               names.arg=rep('',nrow(temp)), 
               legend.text=c("From Diagnosis", "From Sample"),
               args.legend=list(x="topleft"))
mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5)
par(mfrow=c(1,1))
@ 
\caption{\textbf{(Top)} Number of times, during cross-validation, that
  a marker is significant in a univariate model to predict
  overall survival.  \textbf{(Bottom)} Number of times a marker is
  retained in a multivariate model to predict overall survival.}
\label{os.pred}
\end{figure}


\section{Adding to Clinical Predictors}

All of the analyses to this point have been performed ignoring the
large number of clinical covariates that we already have available to
predict the prognosis of CLL patients.  However, in a previous report
(\rcode{microfluidicsModels02.pdf}), we developed and validated the
best clinical predictors for both time-to-treatment and overall
survival.  In this section, we will repeat the previous analysis in
order to find out which genes are most useful in adding predictive
power to the clinical variables we already know.

\subsection{Time to treatment}

We start by seeing which individual genes can add to our ability to
predict time-to-treatment.  The top fifteen genes are listed in
\tref{top15.sig.add}.  This list has very little overlap with the
genes selected above (while ignoring the clinical variables).  More
importantly, the average $p$-values clearly indicate that these
individual genes do not seem to add very much to the existing clinical
variables. (Only one $p$-value is less than $0.05$, and that is
without accounting for multiple testing.)  We also note that ECE1 is
listed twice---because it was one of the housekeeping genes included
on both Card~A and Card~B.  This observation strengthens our
conclusion that these genes are not really significant.
<<fourp.add>>=
f <- "fourp-add.rda"
if (file.exists(f)) {
  load(f)
} else {
  fourp.add <- fourFoldWay(cardAB.data, cardAB.clinical, dx2sig.model, sam2sig.model)
  fourp.add <- data.frame(fourp.add, onC=as.numeric(onC))
  save(fourp.add, file=f)
}
rm(f)
@ 
<<results=tex,echo=FALSE>>=
xtable(round(fourp.add[order(fourp.add$geo.mean.p),][1:15,], 4),
       digits=c(0, rep(4, 5), 1, 3, 0),
       align="|l|rrrrrrrr|",
       caption=paste("Highest ranked genes for univariate ability",
         "to predict time-to-treatment after",
         "accounting for clinical variables."),
       label="top15.sig.add")
@ 

Next, we cross-validate the feature and model selection process for
predicting time-to-treatment when we incorporate the clinical
variables.  As usual, we list the top fifteen genes in
\tref{tbl:sig.clin.pred}.  While this list mostly matches the previous
one, it still is not very impressive.  The corresponding barplot
(\fref{sig.clin.pred}) does not change our conclusions.
<<loopSigClin>>=
set.seed(192604)
f <- "loopSigClin.rda"
if(file.exists(f)) {
  load(f)
} else {
  loopSigClin <- myLoop("TimeDiagnosis2SigTreat", "TimeSample2SigTreat",
                        "NumericSigTreatment", 300,
                        dx2sig.model, sam2sig.model)
  save(loopSigClin, file=f)
}
rm(f)
@ 
<<sig.clin.pred>>=
temp <- loop2df(loopSigClin)
sig.clin.pred <- data.frame(fourp.add, temp)
@ 
<<echo=FALSE,results=tex>>=
xtable(round(sig.clin.pred[rev(order(sig.clin.pred$FirstPass)),8:11][1:15,], digits=3),
       digits=c(0,0, 3, 3, 3), align="|l|rrrr|",
       label="tbl:sig.clin.pred",
       caption=paste("Highest ranked genes by cross-validation", 
         "for predicting time-to-treatment after",
         "accounting for clinical variables."))
@ 
\begin{figure}
<<fig=TRUE,echo=FALSE, width=6.5, height=8>>=
par(mfrow=c(2,1))
temp <- sig.clin.pred[fourp$onC==1,]
pts <- barplot(300*t(temp[, c("FirstPass")]), 
               names.arg=rep('',nrow(temp)))
mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5)
pts <- barplot(300*t(temp[, c("dxAIC", "samAIC")]), col=c("purple", "orange"),
               names.arg=rep('',nrow(temp)), 
               legend.text=c("From Diagnosis", "From Sample"),
               args.legend=list(x="topleft"))
mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5)
par(mfrow=c(1,1))
@ 
\caption{\textbf{(Top)} Number of times, during cross-validation, that
  a marker is significant in a univariate model to predict
  time-to-treatment when added to the best clinical model.
  \textbf{(Bottom)} Number of times a marker is retained in a
  multivariate model to predict time-to-treatment.}
\label{sig.clin.pred}
\end{figure}

\subsection{Overall survival}

In the final set of analyses, we look at our ability to predict
ovreall survival from gene expression when added to the existing
clinical model. \tref{top15.os.add} shows the top fifteen genes; only
two of them have $p$-values =smaller than $0.05$, and only three of
them were previously selected for inclusion on Card~C.
<<oster.add>>=
f <- "oster-add.rda"
if (file.exists(f)) {
  load(f)
} else {
  oster.add <- fourFoldWayOS(cardAB.data, cardAB.clinical, dx2os.model, sam2os.model)
  oster.add <- data.frame(oster.add, onC=as.numeric(onC))
  save(oster.add, file=f)
}
rm(f)
@ 
<<results=tex,echo=FALSE>>=
xtable(round(oster.add[order(oster.add$geo.mean.p),][1:15,], 4),
       digits=c(0, rep(4, 5), 1, 3, 0),
       align="|l|rrrrrrrr|",
       caption=paste("Highest ranked genes for univariate ability",
         "to predict overall survival after",
         "accounting for clinical variables."),
       label="top15.os.add")
@ 

For the sake of completeness, we cross-validate the feature abnd model
selection process.  The top fifteen genes are listed in
\tref{tbl:os.clin.pred}; only ten of the fifteen genes from
\tref{top15.os.add} are included in the cross-validated list.
Interestingly, however, the five replacement genes are more likely to
come from the set of genes selected for inclusion on Card~C.  The
barplot (\fref{os.clin.pred}) suggests that at most five of the genes
that were also assayed on Card~C can enhance our ability to predict
overall survivial in the presence of known clinical factors.
<<loopOSClin>>=
set.seed(564316)
f <- "loopOSClin.rda"
if (file.exists(f)) {
  load(f)
} else {
  loopOSClin <- myLoop("OSAfterDiagnosis", "OSAfterSample",
                      "NumericVitalStatus", 300,
                       dx2os.model, sam2os.model)
  save(loopOSClin, file=f)
}
rm(f)
@ 
<<os.clin.pred>>=
temp <- loop2df(loopOSClin)
os.clin.pred <- data.frame(oster.add, temp)
@ 
<<echo=FALSE,results=tex>>=
xtable(round(os.clin.pred[rev(order(os.clin.pred$FirstPass)),8:11][1:15,], digits=3),
       digits=c(0,0, 3, 3, 3), align="|l|rrrr|",
       label="tbl:os.clin.pred",
       caption=paste("Highest ranked genes by cross-validation",
         "for predicting overall survival after",
         "accounting for clinical variables."))
@ 
\begin{figure}
<<fig=TRUE,echo=FALSE, width=6.5, height=8>>=
par(mfrow=c(2,1))
temp <- os.clin.pred[oster$onC==1,]
pts <- barplot(300*t(temp[, c("FirstPass")]), 
               names.arg=rep('',nrow(temp)))
mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5)
pts <- barplot(300*t(temp[, c("dxAIC", "samAIC")]), col=c("purple", "orange"),
               names.arg=rep('',nrow(temp)), 
               legend.text=c("From Diagnosis", "From Sample"),
               args.legend=list(x="topleft"))
mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5)
par(mfrow=c(1,1))
@ 
\caption{\textbf{(Top)} Number of times, during cross-validation, that
  a marker is significant in a univariate model to predict overall
  survival when added to te best clinical model.  \textbf{(Bottom)}
  Number of times a marker is retained in a multivariate model to
  predict overall survival.}
\label{os.clin.pred}
\end{figure}

\section{Overall Predictive Ability}
\label{sec:overall}

In this section, we want to pool the data from all four of the
analyses (predicting time-to-treatment or overall survival, both with
or without including clinical variables) that we have performed so
far.  We create separate objects to count the number of times genes
wer significant in univariate models and the number of times they were
retained in multivariate models.  We then create barplot displaying
this data (\fref{overall}).  We make two observations about these
plots.  First, there is little difference between being selected as
significant in a univariate model or surviving in a multivariate
model, so we can probably perform feature selection just oin the
cross-validated univariate results.  Second, the results from
predicting time-to-treatment without the clinical variables (red part
of the bars in the top panel) dominate the selection procedure.  There
are only three genes (WSB@, TNFRSF8, CRY1) that we might add to the
``useful'' list based solely on their ability to predict overall
survival instead of time-to-treatment.
<<fp.aic>>=
fp <- data.frame(Treat=sig.pred$FirstPass, 
                 OS=os.pred$FirstPass, 
                 ClinTreat=sig.clin.pred$FirstPass, 
                 OSTreat=os.clin.pred$FirstPass)
rownames(fp) <- rownames(sig.pred)

aic <- data.frame(sig.pred[,10:11,], 
                 os.pred[,10:11,], 
                 sig.clin.pred[,10:11,], 
                 os.clin.pred[,10:11,])
rownames(aic) <- rownames(sig.pred)
@ 
\begin{figure}
<<fig=TRUE,echo=FALSE, width=6.5, height=8>>=
par(mfrow=c(2,1))
temp <- fp[oster$onC==1,]
#temp <- fp
pts <- barplot(300*t(temp), col=c("red","blue","pink", "skyblue"),
               names.arg=rep('',nrow(temp)))
mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5)
temp <- aic[oster$onC==1,]
pts <- barplot(300*t(temp), col=c("purple", "orange", "blue", "red",
                              "#ff00ff", "#ffaa77", "skyblue", "pink"),
               names.arg=rep('',nrow(temp)), 
               legend.text=c("Dx, treat", "Sample, treat",
                 "Dx OS", "Sample OS"),
               args.legend=list(x="topleft", cex=0.5))
mtext(rownames(temp), side=1, line=0.5, at=pts, las=2, cex=0.5)
par(mfrow=c(1,1))
@ 
\caption{\textbf{(Top)} Number of times, during cross-validation, that
  a marker is significant in a univariate model to predict
  time-to-treatment or overall survival.  \textbf{(Bottom)} Number of
  times a marker is retained in a multivariate model.}
\label{overall}
\end{figure}

\section{Algorithm Details}

In this section, we collect details about the algorithms used in this
analysis. 

\subsection{Continuous Univariate Models}

We want to perform univariate analyses (one gene at a time) using the
microfluidics QRT-PCR data as a continuous predictor of the outcome.
The following function will perform these analyses and record the
results.
<<details,eval=FALSE>>=
<<cont.uni>>
@ 

\subsection{Dichotomous Univariate Models}

We also want to convert each continuous gene expression variable into
a dichotomous variable.  In the vast majority of cases, we split at
the median.  However, if there is a large natural gap in the values,
as detected by the k-means algorithm, then we split at the gap instead
of the median.  Dichotomization is performed by this function:
<<details,eval=FALSE>>=
<<hilo>>
@ 
Fitting the time-to-event model using the binary predictor is then
performed by this function:
<<details,eval=FALSE>>=
<<bin.uni>>
@ 

\subsection{Four Analyses of Time to Treatment}

In order to find robust prognostic markers of time to treatment, we
perform four different analyses on the training data.  This block of
code explains how we \rcode{apply.four.methods}.
<<details,eval=FALSE>>=
<<apply.four.methods>>
@ 

Next, in order to \rcode{pool.p.values}, we assemble the four
$p$-values from these four different analyses into a single data
frame.
<<details,eval=FALSE>>=
<<pool.p.values>>
@ 

We take the geometric mean of the four different p-values computed for
each gene.  This scale appears to be the natural way to combine
$p$-values, in light of Fisher's computation showing that summing the
logs of independent $p$-values yields (with appropriate weights) a
chi-squared statistic.
<<compute.geometric.mean,eval=FALSE>>=
geo.mean <- apply(fourp, 1, function(x) exp(mean(log(x))))
@ 

We also \rcode{compute.other.statistics} (specifically, range and
standard deviation) for each gene in datasets~A and~B, and combine
everything into a data frame.  
<<compute.other.statistics,eval=FALSE>>=
rg <- apply(dataset, 2, function(x) diff(range(x)))
sigma <- apply(dataset, 2, sd)

fourp <- data.frame(fourp, geo.mean.p=geo.mean, 
                     rg, sigma)
@ 

To make it simpler to apply this algorithm to other datasets, we turn
it into a function:
<<details,eval=FALSE,keep.source=TRUE,expand=FALSE>>=
fourFoldWay <- function(dataset, clindata, dx.model=NULL, sam.model=NULL) {
<<apply.four.methods>>
<<pool.p.values>>
<<compute.geometric.mean>>
<<compute.other.statistics>>
  fourp
}
@ 

\subsection{Cross-Validation of Feature and Model Selection}

In order to cross-validate our emthods, we repeatedly select $50$ out
of the $67$ patients who were assayed on Cards~A and~B.  We fit the
same four Cox proportional hazards models and compute the geometric
mean of the $p$-values.  Selecting the genes whose geometric mean $p <
0.03$ (and always requiring at least two genes), we use the Akaike
Information Criterion (AIC) to construct an optimal prognostic model
on the set of $50$ patients.  We record the percentage of times that
each gene passes the first filter (on the geometric mean $p$) and is
retained in the optimal model by AIC.  All of these computations are
performed inside the \rcode{myLoop} function.
<<details,eval=FALSE>>=
<<myLoop>>
@ 

We use the next function to summarize the cross-validation results.
<<details,eval=FALSE>>=
<<loop2df>>
@ 

\subsection{Four Analyses of Overall Survival}

As with predicting time-to-treatment, we also apply four methods (by
combining either binary or continuos predictors with starting points
at either diagnosis or sample collection) to rpedict overall survival.
<<details,eval=FALSE>>=
<<apply.four.survival.methods>>
@ 
We combine these methods with the pooled $p$-values and geometric mean
into another function.
<<ffw.os,eval=FALSE,keep.source=TRUE,expand=FALSE>>=
fourFoldWayOS <- function(dataset, clindata, dx.model=NULL, sam.model=NULL) {
<<apply.four.survival.methods>>
<<pool.p.values>>
<<compute.geometric.mean>>
<<compute.other.statistics>>
  fourp
}
@ 


\section{Appendix}

We save the current workspace.
<<write.out>>=
rm(myLoop, temp, pts, loop2df, aic, fp)
save.image(file="micro04.rda")
@ 

This analysis was run in the following directory:
<<getwd>>=
getwd()
@

<<lib2,echo=F>>=
sessionInfo()
@ 

\end{document}