\documentclass[11pt]{article}
\usepackage{graphicx}
\usepackage{cite}
\usepackage{hyperref}
\pagestyle{myheadings}
\markright{microfluidicsModels05; revised: 1 Sept 10}

\setlength{\topmargin}{0in}
\setlength{\textheight}{8in}
\setlength{\textwidth}{6.5in}
\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}

\def\rcode#1{\texttt{#1}}
\def\fref#1{\textbf{Figure~\ref{#1}}}
\def\tref#1{\textbf{Table~\ref{#1}}}
\def\sref#1{\textbf{Section~\ref{#1}}}

\title{Gene-Based Models of Prognosis in CLL}
\author{Kevin R. Coombes}
\date{15 July 2010; REVISED: 1 September 2010}

\SweaveOpts{prefix.string=Figures/microfluidicsModels05,eps=FALSE}
<<options,echo=FALSE>>=
options(width=92)
options(SweaveHooks = list(fig = function() par(bg='white')),eps=FALSE)
@ 

<<makeFiguresDirectory,echo=FALSE>>=
if (!file.exists("Figures")) {
  dir.create("Figures")
}
@ 

\begin{document}
\maketitle
\tableofcontents

\section{Executive Summary}
\subsection{Introduction}
The main goal of the study is to identify changes in gene expression
in CLL that are associated with clinical outcome (including overall
survival and time-to-treatment).

\subsubsection{Aims/Objectives}
We want to find individual genes and combinations of genes that are
related to prognosis in CLL.  Here ``prognosis'' refers to the ability
to predict time-to-treatment or overall survival.

The current analysis looks \textbf{only at the training set}.

\subsection{(Statistical) Methods}
Exploratory data analyses and summaries.  Cox proportional hazards
models. Stepwise selection of predictive variables using the Akaike
Information Criterion.  Bayesian model averaging.

\subsection{Results}

\begin{itemize}
\item By looking at the percentage of times that genes were selected
  as significant in the cross-validation performed in the previous
  report, we identified four sets of genes (containing $5$, $6$, $12$,
  or $13$ genes) that could be used to construct prognostic models
  (\sref{sec:multi}, \fref{gaps}).
\item We constructed Cox proportional hazards to predict time from
  diagnosis to first signifcant treatment, starting with each of the
  four gene sets.  We then used AIC and BIC to reduce each of these
  models to ones that used fewer genes. In total, this process
  produced eleven different models (\tref{genemat}).  All the models
  were highly statistically significant on the training set from
  Cards~A and~B.
\item Each model was used to compute a prognostic score for each
  patient.  These scores tended to be strongly correlated from one
  model to another (\fref{pairs06}, \fref{pairs.sam}).
\item The prognostic scores were evaluated to see if they could also
  predict the time from sample collection (rather than diagnosis) to
  treatment.  All of the models were highly significant
  (\fref{samp06}, \fref{samp05}, \fref{samp12}, \fref{samp13}).
\item Next, the prognostic scores were evaluated to see if they could
  predict overall survival after diagnosis.  The models were
  significant as continuous variables, although they were not usually
  significant as binary variables dichotomized at the median
  (\sref{sec:surv}, \fref{os.km}, \fref{os.km.red}).
\item The prognostic scores were evaluated to determine if they were
  independent of (or better than) the existing clinical markers of
  prognosis.  (The best clinical model, which used Platelets,
  log-transofrmed LDH levels, and light chain subtype, was constructed
  in a previous report.)  The prognostic scores were independent of,
  and thus added to, the clinical variables (\sref{sec:prog4}).
\end{itemize}

\subsection{Conclusion}

We are now ready to try to validate the results on the Card~C dataset.

\section{Getting Started}

In the previous report (\rcode{microfluidicsModels04.pdf}), we stored
all of the relevant data in a binary R file.  We begin by loading that
data.
<<load.data>>=
load("micro04.rda")
ls()
@ 
We want to perform a variety of time-to-event (``survival'') analyses,
with different starting and ending points.  The basic code for these
analyses is in the following libraries.
<<lib>>=
library(survival)
library(BMA)
@ 

\section{Multivariate Gene Predictors}
\label{sec:multi}

We want to build multivariate models on the whole training set to
predict time-to-treatment (ignoring clinical variables).  First, we
review how often each gene was found to be significant in a univariate
model (during the cross-validation from the previous report).
<<uni>>=
tock <- sig.pred$FirstPass
names(tock) <- rownames(sig.pred)
firstPass <- sort(tock[fourp$onC==1])
rev(firstPass)[1:15]
@ 
Next, we review how often a gene was selected to remain in a
multivariate model.
<<multi>>=
tock <- apply(sig.pred[, 10:11], 1, sum) 
multiKept <- 0.5*sort(tock[fourp$onC==1]) #$
rev(multiKept)[1:15]
@ 
By plotting the sorted values (\fref{gaps}), we can identify natural
gaps that provide potential cutoffs for the number of genes to keep
for a (global) multivariate model.  Based  on this figures, we have
gene sets of size $5$, $6$, $12$, or $13$ to try to use to build
prognostic models. 
<<fig1cd.code,echo=FALSE,eval=FALSE>>=
#windows(width=6.5, height=8)
opar <- par(mfrow=c(2,1), mai=c(1, 1, 0.2, 0.2))
plot(rev(firstPass), pch=16, ylab="Probability Significant")
abline(v=0.5+c(6, 12), col='grey')
plot(rev(multiKept), pch=16, ylab="Probability Retained")
abline(v=0.5+c(5, 13), col='grey')
par(opar)
@ 
<<make.fig1cd>>=
png(file="SKi-SLAM-paper/mfc-figure1cd.png", width=650, height=800, 
    pointsize=16, bg="white")
par(bg="white")
<<fig1cd.code>>
dev.off()

pdf(file="SKI-SLAM-paper/mfc-figure1cd.pdf", width=6.50, height=8.00, 
    pointsize=12, bg="white")
par(bg="white")
<<fig1cd.code>>
dev.off()
@ 
\begin{figure}
<<fig=TRUE,echo=FALSE,width=6.5,height=8>>=

@ 
\caption{Plot of the number of times genes (top) are called
  significant in a univariate model or (bottom) retained in a
  multivariate model.  The genes have been sorted separately (and thus
  differently) in the two plots in order to identify natural gaps.}
\label{gaps}
\end{figure}

Here we store the names of the genes used in each of the four
potential sets of predictors into reusable variables.

<<rs>>=
nrs <- names(rev(firstPass))
set06 <- nrs[1:6]
set12 <- nrs[1:12]
nrs <- names(rev(multiKept))
set05 <- nrs[1:5]
set13 <- nrs[1:13]
rm(nrs, tock, firstPass, multiKept)
@ 

\subsection{Reducing the six-gene model}

In this section, we look at methods to (possibly) refine the
six-gene model.  First, we compute the full six-gene model to
predict time from diagnosis to first significant treatment.
<<mod6>>=
myform <- formula(paste("Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~",
                        paste(set06, collapse="+")))
dataset <- data.frame(cardAB.clinical, cardAB.data[, set06])
mod06 <- coxph(myform, dataset)
summary(mod06)
@ 
Here we see that the overall six-gene model is highly significant.
It also looks like SKI is the gene that contributes the most to the
model, while AICDA is the gene that contributes the least.  As we have
done throughout the feature selection process, we can try using the
Akaike Information Criterion (AIC) to remove unnecessary or redundant
genes from the model.  (The step-wise process works in two directions,
so it can change its mind and put factors back into the model after
removing them.)
<<mod06s>>=
mod06s <- step(mod06, trace=0)
summary(mod06s)
@ 
Based on all of the data, the optimal model by AIC removes four genes
and results in a more statistically significant model.

An alternative approach to decide which genes to retain in the model
is to use Bayesian model averaging (BMA).
<<bma06>>=
bma06 <- bic.surv(myform, data=dataset, factor.type=TRUE, factor.prior.adjust=TRUE)
summary(bma06)
@ 

There are three parts to this summary of the results of BMA.  The
first part gives the posterior probability of each term being included
in a final model.  Three genes (SKI, NT5C2, and SLAMF1) have posterior
probabilities greater than $40\%$, and so we think they should be
retained in the final model.  The results for CD14, AICDA, and FGL2
are less clear, since the posterior probability is less than $25\%$.
The second part gives the posterior estimates of the coefficients for
each term in the model.  The third part lists the five best
models. Based on the Bayes Information Criterion (BIC), the best model
only includes the two genes SKI and NT5C2. Significantly, AICDA snd
FGL2 are omitted from all five of the best models, which agrees with
the findings of the stepwise AIC model.  For future reference, we keep
the best BIC model with only three genes:
<<mod06b>>=
mod06b <- coxph(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ 
                SKI + NT5C2 + SLAMF1, dataset)
summary(mod06b)
@ 
and with four genes:
<<mod06b>>=
mod06c <- coxph(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ 
                SKI + NT5C2 + SLAMF1 + CD14, dataset)
summary(mod06c)
@ 


For both the six-gene model and the AIC and BIC reduced models, we
compute numerical ``prognostic scores'' from the combinations that
predict time-to-treatment.  A pairwise scatter-plot (\fref{pairs06})
of the scores shows that they are (not surprisingly, since they use
some of the same genes) highly correlated, which suggests that all
three of the models constructed so far are likely to have similar
performance when trying to predict the time from diagnosis to
treatment.
<<scores06>>=
scores06 <- predict(mod06)
scores06s <- predict(mod06s)
scores06b <- predict(mod06b)
scores06c <- predict(mod06c)
@ 
\begin{figure}
<<fig=TRUE,echo=FALSE>>=
pairs(data.frame(SixGene=scores06, 
                 AIC6.2=scores06s, 
                 BIC6.3=scores06b,
                 BIC6.4=scores06c), pch=16)
@ 
\caption{Pair-wise scatter plots of the prognostic scores from three
  different models}
\label{pairs06}
\end{figure}

\subsubsection{Six-gene model starting from sample collection}

Here we want to look at the model that results when we start at the
time of sample collection instead of the time of diagnosis.
<<n6>>=
myform2 <- formula(paste("Surv(TimeSample2SigTreat, NumericSigTreatment) ~",
                        paste(set06, collapse="+")))
dataset <- data.frame(cardAB.clinical, cardAB.data[, set06])
mod06.sam <- coxph(myform2, dataset)
mod06s.sam <- step(mod06.sam, trace=0)
bma06.sam <- bic.surv(myform2, data=dataset, factor.type=TRUE, factor.prior.adjust=TRUE)
@ 

The full six-gene model has coefficients that are quite similar to
the coefficients for the model that predicts time from diagnosis to
treatment. 
<<sample.mod>>=
summary(mod06.sam)
@ 

The model selected by AIC contains a different set of three genes,
with CD14 being the most significant.
<<sample.mod>>=
summary(mod06s.sam)
@ 

The model selected by BIC/BMA picks the same three genes as the AIC model.
<<sample.mod>>=
summary(bma06.sam)
@ 

We also compute ``prognostic scores'' using the sample-to-treatment
models.  A pairwise scatter plot (\fref{pairs.sam}) again shows that
the scores are very highly correlated.
<<scores.sam>>=
scores06.sam <- predict(mod06.sam)
scores06s.sam <- predict(mod06s.sam)
CatScore06 <- factor(ifelse(scores06 > median(scores06), "HighScore","LowScore"))
@ 
\begin{figure}
<<fig=TRUE,echo=FALSE>>=
pairs(data.frame(scores06, scores06s, scores06.sam, scores06s.sam), pch=16)
@ 
\caption{Pair-wise scatter plots of the prognostic scores from two
  models starting at diagnosis and two models starting at sample
  collection. }
\label{pairs.sam}
\end{figure}

Our basic conclusion is that it may not matter if we use all six genes
or only two or three of them, and that similar models will work
reasonably well to predict time from diagnosis or from sample
collection to treatment.  To test this idea more thoroughly, we can
use a Cox proportional hazards model to test how well the ``diagnosis
to treatment prognostic score'' predicts the time from sample to
treatment.  By splitting the prognostic score at the median, we can
also prepare Kaplan-Meier plots to see how well this distinguishes
between ptients needing early treatment and patients who will not need
treatment for a long time (\fref{samp06}).
<<score.cox>>=
temp <- data.frame(cardAB.clinical, Score06=scores06, CatScore06=CatScore06)
coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ Score06, data=temp)
coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore06, data=temp)
@ 
\begin{figure}
<<fig=TRUE,echo=FALSE,width=5,height=8>>=
#windows(width=5,height=8)
par(mfrow=c(2,1))
plot(survfit(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ CatScore06, 
             data=temp), col=c("red", "blue"), lwd=2,
     main="Time Diagnosis to Treatment (six genes)", 
     xlab="Time (months)", ylab="Fraction Surviving")
legend("topright", levels(temp$CatScore06), col=c("red", "blue"), lwd=3)
plot(survfit(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore06, 
             data=temp), col=c("red", "blue"), lwd=2,
     main="Time Sample to Treatment (six genes)", 
     xlab="Time (months)", ylab="Fraction Untreated")
legend("topright", levels(temp$CatScore06), col=c("red", "blue"), lwd=3)
par(mfrow=c(1,1))
@ 
\caption{Kaplan-Meier plot of time from (top) diagnosis or (bottom)
  sample collection to treatment, using a \textbf{six-gene}
  prognostic score developed to predict time from diagnosis to
  treatment. }
\label{samp06}
\end{figure}

Because the model starting from diagnosis works so well, we will not
(in the future) bother to construct separate models for the time from
sample collection to treatment.

\subsection{Reducing the five-gene model}

We now repeat the previous analysis using the five-gene model. We do
not construct separate models to predict sample-to-treatment; instead,
we simply check that the ``diagnostic-to-treatment'' model works for
this purpose (\fref{samp05}).
<<n5>>=
myform <- formula(paste("Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~",
                        paste(set05, collapse="+")))
dataset <- data.frame(cardAB.clinical, cardAB.data[, set05])
mod05 <- coxph(myform, dataset)
mod05s <- step(mod05, trace=0)
bma05 <- bic.surv(myform, data=dataset, factor.type=TRUE, factor.prior.adjust=TRUE)
summary(bma05)
summary(mod05s)
summary(mod05)

scores05 <- predict(mod05)
scores05s <- predict(mod05s)
CatScore05 <- factor(ifelse(scores05 > median(scores05), "HighScore","LowScore"))
@ 

<<x>>=
temp <- data.frame(cardAB.clinical, Score05=scores05, CatScore05=CatScore05)
coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ Score05, data=temp)
coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore05, data=temp)
@ 

\begin{figure}
<<fig=TRUE,echo=FALSE,width=5,height=8>>=
par(mfrow=c(2,1))
plot(survfit(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ CatScore05, 
             data=temp), col=c("red", "blue"), lwd=2,
     main="Time Diagnosis to Treatment (five genes)", 
     xlab="Time (months)", ylab="Fraction Surviving")
legend("topright", levels(temp$CatScore05), col=c("red", "blue"), lwd=3)
plot(survfit(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore05, 
             data=temp), col=c("red", "blue"), lwd=2,
     main="Time Sample to Treatment (five genes)", 
     xlab="Time (months)", ylab="Fraction Untreated")
legend("topright", levels(temp$CatScore05), col=c("red", "blue"), lwd=3)
par(mfrow=c(1,1))
@ 
\caption{Kaplan-Meier plot of time from (top) diagnosis or (bottom)
  sample collection to treatment, using a \textbf{five-gene}
  prognostic score developed to predict time from diagnosis to
  treatment. }
\label{samp05}
\end{figure}

\subsection{Reducing the twelve-gene model}

We now repeat the previous analysis using the twelve-gene model. We do
not construct separate models to predict sample-to-treatment; instead,
we simply check that the ``diagnostic-to-treatment'' model works for
this purpose (\fref{samp12}).
<<n12>>=
myform <- formula(paste("Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~",
                        paste(set12, collapse="+")))
dataset <- data.frame(cardAB.clinical, cardAB.data[, set12])
mod12 <- coxph(myform, dataset)
mod12s <- step(mod12, trace=0)
bma12 <- bic.surv(myform, data=dataset, factor.type=TRUE, factor.prior.adjust=TRUE)
summary(bma12)
summary(mod12s)
summary(mod12)

scores12 <- predict(mod12)
scores12s <- predict(mod12s)
CatScore12 <- factor(ifelse(scores12 > median(scores12), "HighScore","LowScore"))

@ 

<<x>>=
temp <- data.frame(cardAB.clinical, Score12=scores12, CatScore12=CatScore12)
coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ Score12, data=temp)
coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore12, data=temp)
@ 

\begin{figure}
<<fig=TRUE,echo=FALSE,width=5,height=8>>=
par(mfrow=c(2,1))
plot(survfit(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ CatScore12, 
             data=temp), col=c("red", "blue"), lwd=2,
     main="Time Diagnosis to Treatment (12 genes)", 
     xlab="Time (months)", ylab="Fraction Surviving")
legend("topright", levels(temp$CatScore12), col=c("red", "blue"), lwd=3)
plot(survfit(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore12, 
             data=temp), col=c("red", "blue"),, lwd=2,
     main="Time Sample to Treatment (12 genes)", 
     xlab="Time (months)", ylab="Fraction Untreated")
legend("topright", levels(temp$CatScore12), col=c("red", "blue"), lwd=3)
par(mfrow=c(1,1))
@ 
\caption{Kaplan-Meier plot of time from (top) diagnosis or (bottom)
  sample collection to treatment, using a \textbf{twelve-gene}
  prognostic score developed to predict time from diagnosis to
  treatment. }
\label{samp12}
\end{figure}

\subsection{Reducing the thirteen-gene model}

We now repeat the previous analysis using the thirteen-gene model. We
do not construct separate models to predict sample-to-treatment;
instead, we simply check that the ``diagnostic-to-treatment'' model
works for this purpose (\fref{samp13}).
<<n13>>=
myform <- formula(paste("Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~",
                        paste(set13, collapse="+")))
dataset <- data.frame(cardAB.clinical, cardAB.data[, set13])
mod13 <- coxph(myform, dataset)
mod13s <- step(mod13, trace=0)
bma13 <- bic.surv(myform, data=dataset, factor.type=TRUE, factor.prior.adjust=TRUE)
summary(bma13)
summary(mod13s)
summary(mod13)

scores13 <- predict(mod13)
scores13s <- predict(mod13s)
CatScore13 <- factor(ifelse(scores13 > median(scores13), "HighScore","LowScore"))

@ 

<<x>>=
temp <- data.frame(cardAB.clinical, Score13=scores13, CatScore13=CatScore13)
coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ Score13, data=temp)
coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore13, data=temp)
@ 

\begin{figure}
<<fig=TRUE,echo=FALSE,width=5,height=8>>=
par(mfrow=c(2,1))
plot(survfit(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ CatScore13, 
             data=temp), col=c("red", "blue"), lwd=2,
     main="Time Diagnosis to Treatment (13 genes)", 
     xlab="Time (months)", ylab="Fraction Surviving")
legend("topright", levels(temp$CatScore13), col=c("red", "blue"), lwd=3)
plot(survfit(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore13, 
             data=temp), col=c("red", "blue"), lwd=2,
     main="Time Sample to Treatment (13 genes)", 
     xlab="Time (months)", ylab="Fraction Untreated")
legend("topright", levels(temp$CatScore13), col=c("red", "blue"), lwd=3)
par(mfrow=c(1,1))
@ 
\caption{Kaplan-Meier plot of time from (top) diagnosis or (bottom)
  sample collection to treatment, using a \textbf{thirteen-gene}
  prognostic score developed to predict time from diagnosis to
  treatment. }
\label{samp13}
\end{figure}

\subsection{Listing all the prognostic models}

Here is a list of the models we have constructed to predict the time
from diagnosis to treatment:
<<modlist>>=
modlist <- c("mod06", "mod06s", "mod06b", "mod06c", "mod06s.sam",
             "mod05", "mod05s", 
             "mod12", "mod12s", 
             "mod13", "mod13s")
@ 
We now want to determine how many genes included in which model.
<<ngenes>>=
whichGenes <- lapply(modlist, function(x) {
  y <- get(x, .GlobalEnv)
  attr(terms(y), "term.labels")
})
ngenes <- unlist(lapply(whichGenes, length))
names(ngenes) <- modlist
ngenes
@ 
Finally, we can prepare a table that indicates whether a genes is or
is not included in each of the five models (\tref{genemat}). 
<<genemat>>=
geneset <- sort(unique(unlist(whichGenes)))
genemat <- matrix("--", 
                  nrow=length(geneset),
                  ncol=length(modlist))
dimnames(genemat) <- list(geneset, modlist)
for (i in 1:length(whichGenes)) {
  genemat[whichGenes[[i]],i] <- "Yes"
}
gcount <- apply(genemat, 1, function(x) sum(x=="Yes"))
genemat <- genemat[rev(order(gcount)),order(ngenes)]
rm(geneset, whichGenes, gcount, ngenes)
@ 

<<echo=FALSE,results=tex>>=
library(xtable)
tab <- xtable(genemat, label="genemat",
              caption=paste("List of which genes are included",
                "in which prognostic models."))
print(tab, type="html", file="modelset.html")
print(tab, size="\\tiny")
@ 


\section{Predicting Survival}

We now check whether any of the existing models are useful for
predicting overall survival from the time of dignosis.  

\subsection{Prognostic scores}

We start by computing the predicted ``prognostic scores'' for each
model.
<<modscores>>=
modscores <- lapply(modlist, function(x) {
  y <- get(x, .GlobalEnv)
  predict(y)
})
dd <- as.data.frame(modscores)
colnames(dd) <- modlist
@ 
Next, we split each set of scores at the median to convert it to a
categorical value.
<<catscores>>=
cutoffs <- apply(dd, 2, median)
catscores <- apply(dd, 2, function(x) {
  factor(ifelse(x > median(x), "HighScore","LowScore"))
})
colnames(catscores) <- paste("Cat", colnames(catscores), sep='.')
@ 
Now we add both the continuous and the categorical scores to the
Card~AB dataset.
<<dataset>>=
datasetAB <- data.frame(cardAB.clinical, dd, catscores)
@ 
We also pause to clean up some of the redundant objects that are
beginning to clutter the R workspace.
<<cleanup>>=
rm(dd, catscores, modscores, dataset, i)#, mod06.sam, mod06s.sam)
rm(opar, myform, myform2, temp)
rm(list=ls(pattern='bma'))
rm(list=ls(pattern='CatScore'))
rm(list=ls(pattern='scores'))
@ 

\subsection{The four non-reduced models}
\label{sec:surv}

\subsubsection{Six genes}
Now we text whether the six-gene model (which was derived to predict
time-to-treatment) is at all useful for predicting overall survival on
the training data. This is not significant as a binary variable, but
it is significant as a continuous variable.
<<os06>>=
coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod06, data=datasetAB)
coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod06, data=datasetAB)
@ 

\subsubsection{Five genes}
Next, we check the five-gene model.  This turns out to be borderline
significant as a binary variable but highly significant as a
continuous variable.
<<os05>>=
coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod05, data=datasetAB)
coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod05, data=datasetAB)
@ 

\subsubsection{Twelve genes}
Next, we check the twelve-gene model.  This is significant as either a
binary variable or as a continuous variable.
<<os12>>=
coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod12, data=datasetAB)
coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod12, data=datasetAB)
@ 

\subsubsection{Thirteen genes}
Next, we check the thirteen-gene model.  This is borderline signficant
as a bionary variable and significant as a continuous variable.  In
\fref{os.km}, we present Kaplan-Meier plots for all four models.
<<os13>>=
coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod13, data=datasetAB)
coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod13, data=datasetAB)
@ 

\begin{figure}
<<fig=TRUE,echo=FALSE>>=
par(mfrow=c(2,2))
plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod06, 
             data=datasetAB), col=c("red", "blue"),
     main="OS After Diagnosis (6 gene)", 
     xlab="Time (months)", ylab="Fraction Surviving")
legend("bottomleft", levels(datasetAB$Cat.mod06), col=c("red", "blue"), lwd=3)
plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod05, 
             data=datasetAB), col=c("red", "blue"),
     main="OS After Diagnosis (5 gene)", 
     xlab="Time (months)", ylab="Fraction Surviving")
legend("bottomleft", levels(datasetAB$Cat.mod05), col=c("red", "blue"), lwd=3)
plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod12, 
             data=datasetAB), col=c("red", "blue"),
     main="OS After Diagnosis (12 gene)", 
     xlab="Time (months)", ylab="Fraction Surviving")
legend("bottomleft", levels(datasetAB$Cat.mod12), col=c("red", "blue"), lwd=3)
plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod13, 
             data=datasetAB), col=c("red", "blue"),
     main="OS After Diagnosis (13 gene)", 
     xlab="Time (months)", ylab="Fraction Surviving")
legend("bottomleft", levels(datasetAB$Cat.mod13), col=c("red", "blue"), lwd=3)
par(mfrow=c(1,1))
@ 
\caption{Kaplan-Meier plots using models that predict
  time-to-treatment to instead predict overall survival.}
\label{os.km}
\end{figure}

\subsubsection{Reduced models}
We also compute the results for various models obtained by reducing
the gene sets using AIC or BIC (\fref{os.km.red}).

<<whatever>>=
summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod06s, data=datasetAB))
summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod06b, data=datasetAB))
summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod06s.sam, data=datasetAB))
summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod05s, data=datasetAB))
summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod06c, data=datasetAB))
summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod05, data=datasetAB))
summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod12s, data=datasetAB))
summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod06, data=datasetAB))
summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod13s, data=datasetAB))
summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod12, data=datasetAB))
summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod13, data=datasetAB))
@ 

\begin{figure}
<<fig=TRUE,echo=FALSE, width=6.5, height=8.5>>=
par(mfrow=c(3,2))
plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod06b, 
             data=datasetAB), col=c("red", "blue"),
     main="OS After Dx (mod06s; 2 genes)", 
     xlab="Time (months)", ylab="Fraction Surviving")
legend("bottomleft", levels(datasetAB$Cat.mod06b), col=c("red", "blue"), lwd=3)
plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod05s, 
             data=datasetAB), col=c("red", "blue"),
     main="OS After Dx (mod05s; 3 genes)", 
     xlab="Time (months)", ylab="Fraction Surviving")
legend("bottomleft", levels(datasetAB$Cat.mod05s), col=c("red", "blue"), lwd=3)
plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod06b, 
             data=datasetAB), col=c("red", "blue"),
     main="OS After Dx (mod06b; 3 genes)", 
     xlab="Time (months)", ylab="Fraction Surviving")
legend("bottomleft", levels(datasetAB$Cat.mod06b), col=c("red", "blue"), lwd=3)
plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod06s.sam, 
             data=datasetAB), col=c("red", "blue"),
     main="OS After Dx (mod06s.sam; 3 genes)", 
     xlab="Time (months)", ylab="Fraction Surviving")
legend("bottomleft", levels(datasetAB$Cat.mod06s.sam), col=c("red", "blue"), lwd=3)
plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod06c, 
             data=datasetAB), col=c("red", "blue"),
     main="OS After Dx (mod06c; 4 genes)", 
     xlab="Time (months)", ylab="Fraction Surviving")
legend("bottomleft", levels(datasetAB$Cat.mod06c), col=c("red", "blue"), lwd=3)
plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod12s, 
             data=datasetAB), col=c("red", "blue"),
     main="OS After Dx (mod12s; 5 genes)", 
     xlab="Time (months)", ylab="Fraction Surviving")
legend("bottomleft", levels(datasetAB$Cat.mod12s), col=c("red", "blue"), lwd=3)
par(mfrow=c(1,1))
@ 
\caption{Kaplan-Meier plots using models that predict
  time-to-treatment to instead predict overall survival.}
\label{os.km.red}
\end{figure}

\section{Composite Scores and Clinical Variables}

We want to test if the composite scores perform better than---or at
least add to---the existing clinical variables for predicting the time
from diagnosis to first significant treatment.  In a previous report,
we found that the best model using exising clinical variables was:
<<clinical>>=
dx2sig.model
@ 
We update this model to use as many training samples as possible:
<<update>>=
dd <- na.omit(datasetAB[, all.vars(formula(dx2sig.model))])
dx2sig.model <- coxph(formula(dx2sig.model), dd)
@ 


\subsection{Prognostic score from the six-gene model}
\label{sec:prog4}

First, we construct a model that adds the prognostic score from the
reduced five-gene model as a continuous variable along with the
clinical variables.
<<test06>>=
timecol <- "TimeDiagnosis2SigTreat"
statcol <- "NumericSigTreatment"
tt <- attr(terms(dx2sig.model), "term.labels")
dx.set <- na.omit(datasetAB[, c(timecol, statcol, "mod05s", tt)])
myform <- update(formula(dx2sig.model), ~ . + mod05s)
dx2sig.score <- coxph(myform, data=dx.set)
summary(dx2sig.score)
@ 
The overall model is statistically significant, and the prognostic
score appears to be an important factor.  A chi-squared test
based on an analysis-of-variance-like computation shows that the
prognostic score adds independent information over-and-above that
provided by the clinical variables.
<<anova06>>=
anova(dx2sig.score,dx2sig.model)
@ 
Using AIC to select the optimal model, the prognostic score is
retained along with the clinical variables.
<<step06>>=
step(dx2sig.score,trace=0)
@ 
Using BIC and BMA yields the same result.
<<bma06>>=
bma <- bic.surv(myform, data=dx.set, factor.type=TRUE, factor.prior.adjust=TRUE)
summary(bma)
@ 


\subsection{Prognostic score from the five-gene model}

We repeat the first part of this analysis for the score from the
five-gene model.  This also adds to the clinical variables.
<<test05>>=
dx.set <- na.omit(datasetAB[, c(timecol, statcol, "mod05", tt)])
dx2sig.score <- coxph(update(formula(dx2sig.model), ~ . + mod05), data=dx.set)
summary(dx2sig.score)
anova(dx2sig.score,dx2sig.model)
step(dx2sig.score, trace=0)
@ 

\subsection{Prognostic score from the twelve-gene model}

We repeat the first part of this analysis for the score from the
twelve-gene model.  This also performs better than the clinical
variables. 

<<test12>>=
dx.set <- na.omit(datasetAB[, c(timecol, statcol, "mod12", tt)])
dx2sig.score <- coxph(update(formula(dx2sig.model), ~ . + mod12), data=dx.set)
summary(dx2sig.score)
anova(dx2sig.score,dx2sig.model)
step(dx2sig.score, trace=0)
@ 

\subsection{Prognostic score from the thirteen-gene model}

We repeat the first part of this analysis for the score from the
thirteen-gene model.  This also performs better than the clinical
variables.

<<test13>>=
dx.set <- na.omit(datasetAB[, c(timecol, statcol, "mod13", tt)])
dx2sig.score <- coxph(update(formula(dx2sig.model), ~ . + mod13), data=dx.set)
summary(dx2sig.score)
anova(dx2sig.score,dx2sig.model)
step(dx2sig.score, trace=0)
@ 

<<whatever>>=
ping <- paste("mod", c("06s", "06b", "06s.sam", "05s",
                       "06c", "05", "12s", "06",
                       "13s", "12", "13"), sep='')
dx.set <- na.omit(datasetAB[, c(timecol, statcol, ping, tt)])

for (p in ping) {
  newform <- update(formula(dx2sig.model), as.formula(paste("~ . +", p)))
  dx2sig.score <- coxph(newform, data=dx.set)
  summary(dx2sig.score)
  print(anova(dx2sig.score,dx2sig.model))
  step(dx2sig.score, trace=0)
}

@ 

\section{Individual Genes and Clinical Predictors}

Finally, we find out what happens if we use AIC on a model that keeps
the individual genes separate instead of pooling them as a prognostic
score. 
<<>>=
dx.set <- na.omit(data.frame(datasetAB[, c(timecol, statcol, tt)],
                             cardAB.data[, set05]))
myform <- formula(paste("Surv(", timecol, ",", statcol, ") ~ ."))
opto <- step(coxph(myform, data=dx.set), trace=0)
opto
@ 
We see that two of the four genes (SKI and SLAMF1) and none of the
clinical variables survive in the optimal model.  We get the same
result starting with mthe six-gene model:
<<>>=
dx.set <- na.omit(data.frame(datasetAB[, c(timecol, statcol, tt)],
                             cardAB.data[, set06]))
myform <- formula(paste("Surv(", timecol, ",", statcol, ") ~ ."))
opto <- step(coxph(myform, data=dx.set), trace=0)
opto
@ 
or the twelve-gene model:
<<>>=
dx.set <- na.omit(data.frame(datasetAB[, c(timecol, statcol, tt)],
                             cardAB.data[, set12]))
myform <- formula(paste("Surv(", timecol, ",", statcol, ") ~ ."))
opto <- step(coxph(myform, data=dx.set), trace=0)
opto
@ 

<<whatever>>=
ping <- paste("mod", c("06s", "06b", "06s.sam", "05s",
                       "06c", "05", "12s", "06",
                       "13s", "12", "13"), sep='')

tempd <- data.frame(datasetAB, cardAB.data)
for (p in ping) {
  varlist <- union(all.vars(formula(dx2sig.model)), 
                   all.vars(formula(get(p))))
  dx.set <- na.omit(tempd[, varlist])
  newform <- Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ .
  dx2sig.score <- coxph(newform, data=dx.set)
  modf <- step(dx2sig.score, trace=0)
  cat(paste("\n---------\n", p))
  print(modf)
}

@ 


\section{Appendix}

<<write.out>>=
rm(bma, myform, set06, set05, set13, set12)
rm(dx.set, dx2sig.score, timecol, statcol, tt)
save.image(file="predictors.rda")
@ 

This analysis was run in the following directory:
<<getwd>>=
getwd()
@


<<lib2,echo=F>>=
sessionInfo()
@ 

\end{document}