\documentclass[11pt]{article} \usepackage{graphicx} \usepackage{cite} \usepackage{hyperref} \pagestyle{myheadings} \markright{microfluidicsModels05; revised: 1 Sept 10} \setlength{\topmargin}{0in} \setlength{\textheight}{8in} \setlength{\textwidth}{6.5in} \setlength{\oddsidemargin}{0in} \setlength{\evensidemargin}{0in} \def\rcode#1{\texttt{#1}} \def\fref#1{\textbf{Figure~\ref{#1}}} \def\tref#1{\textbf{Table~\ref{#1}}} \def\sref#1{\textbf{Section~\ref{#1}}} \title{Gene-Based Models of Prognosis in CLL} \author{Kevin R. Coombes} \date{15 July 2010; REVISED: 1 September 2010} \SweaveOpts{prefix.string=Figures/microfluidicsModels05,eps=FALSE} <>= options(width=92) options(SweaveHooks = list(fig = function() par(bg='white')),eps=FALSE) @ <>= if (!file.exists("Figures")) { dir.create("Figures") } @ \begin{document} \maketitle \tableofcontents \section{Executive Summary} \subsection{Introduction} The main goal of the study is to identify changes in gene expression in CLL that are associated with clinical outcome (including overall survival and time-to-treatment). \subsubsection{Aims/Objectives} We want to find individual genes and combinations of genes that are related to prognosis in CLL. Here ``prognosis'' refers to the ability to predict time-to-treatment or overall survival. The current analysis looks \textbf{only at the training set}. \subsection{(Statistical) Methods} Exploratory data analyses and summaries. Cox proportional hazards models. Stepwise selection of predictive variables using the Akaike Information Criterion. Bayesian model averaging. \subsection{Results} \begin{itemize} \item By looking at the percentage of times that genes were selected as significant in the cross-validation performed in the previous report, we identified four sets of genes (containing $5$, $6$, $12$, or $13$ genes) that could be used to construct prognostic models (\sref{sec:multi}, \fref{gaps}). \item We constructed Cox proportional hazards to predict time from diagnosis to first signifcant treatment, starting with each of the four gene sets. We then used AIC and BIC to reduce each of these models to ones that used fewer genes. In total, this process produced eleven different models (\tref{genemat}). All the models were highly statistically significant on the training set from Cards~A and~B. \item Each model was used to compute a prognostic score for each patient. These scores tended to be strongly correlated from one model to another (\fref{pairs06}, \fref{pairs.sam}). \item The prognostic scores were evaluated to see if they could also predict the time from sample collection (rather than diagnosis) to treatment. All of the models were highly significant (\fref{samp06}, \fref{samp05}, \fref{samp12}, \fref{samp13}). \item Next, the prognostic scores were evaluated to see if they could predict overall survival after diagnosis. The models were significant as continuous variables, although they were not usually significant as binary variables dichotomized at the median (\sref{sec:surv}, \fref{os.km}, \fref{os.km.red}). \item The prognostic scores were evaluated to determine if they were independent of (or better than) the existing clinical markers of prognosis. (The best clinical model, which used Platelets, log-transofrmed LDH levels, and light chain subtype, was constructed in a previous report.) The prognostic scores were independent of, and thus added to, the clinical variables (\sref{sec:prog4}). \end{itemize} \subsection{Conclusion} We are now ready to try to validate the results on the Card~C dataset. \section{Getting Started} In the previous report (\rcode{microfluidicsModels04.pdf}), we stored all of the relevant data in a binary R file. We begin by loading that data. <>= load("micro04.rda") ls() @ We want to perform a variety of time-to-event (``survival'') analyses, with different starting and ending points. The basic code for these analyses is in the following libraries. <>= library(survival) library(BMA) @ \section{Multivariate Gene Predictors} \label{sec:multi} We want to build multivariate models on the whole training set to predict time-to-treatment (ignoring clinical variables). First, we review how often each gene was found to be significant in a univariate model (during the cross-validation from the previous report). <>= tock <- sig.pred$FirstPass names(tock) <- rownames(sig.pred) firstPass <- sort(tock[fourp$onC==1]) rev(firstPass)[1:15] @ Next, we review how often a gene was selected to remain in a multivariate model. <>= tock <- apply(sig.pred[, 10:11], 1, sum) multiKept <- 0.5*sort(tock[fourp$onC==1]) #$ rev(multiKept)[1:15] @ By plotting the sorted values (\fref{gaps}), we can identify natural gaps that provide potential cutoffs for the number of genes to keep for a (global) multivariate model. Based on this figures, we have gene sets of size $5$, $6$, $12$, or $13$ to try to use to build prognostic models. <>= #windows(width=6.5, height=8) opar <- par(mfrow=c(2,1), mai=c(1, 1, 0.2, 0.2)) plot(rev(firstPass), pch=16, ylab="Probability Significant") abline(v=0.5+c(6, 12), col='grey') plot(rev(multiKept), pch=16, ylab="Probability Retained") abline(v=0.5+c(5, 13), col='grey') par(opar) @ <>= png(file="SKi-SLAM-paper/mfc-figure1cd.png", width=650, height=800, pointsize=16, bg="white") par(bg="white") <> dev.off() pdf(file="SKI-SLAM-paper/mfc-figure1cd.pdf", width=6.50, height=8.00, pointsize=12, bg="white") par(bg="white") <> dev.off() @ \begin{figure} <>= @ \caption{Plot of the number of times genes (top) are called significant in a univariate model or (bottom) retained in a multivariate model. The genes have been sorted separately (and thus differently) in the two plots in order to identify natural gaps.} \label{gaps} \end{figure} Here we store the names of the genes used in each of the four potential sets of predictors into reusable variables. <>= nrs <- names(rev(firstPass)) set06 <- nrs[1:6] set12 <- nrs[1:12] nrs <- names(rev(multiKept)) set05 <- nrs[1:5] set13 <- nrs[1:13] rm(nrs, tock, firstPass, multiKept) @ \subsection{Reducing the six-gene model} In this section, we look at methods to (possibly) refine the six-gene model. First, we compute the full six-gene model to predict time from diagnosis to first significant treatment. <>= myform <- formula(paste("Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~", paste(set06, collapse="+"))) dataset <- data.frame(cardAB.clinical, cardAB.data[, set06]) mod06 <- coxph(myform, dataset) summary(mod06) @ Here we see that the overall six-gene model is highly significant. It also looks like SKI is the gene that contributes the most to the model, while AICDA is the gene that contributes the least. As we have done throughout the feature selection process, we can try using the Akaike Information Criterion (AIC) to remove unnecessary or redundant genes from the model. (The step-wise process works in two directions, so it can change its mind and put factors back into the model after removing them.) <>= mod06s <- step(mod06, trace=0) summary(mod06s) @ Based on all of the data, the optimal model by AIC removes four genes and results in a more statistically significant model. An alternative approach to decide which genes to retain in the model is to use Bayesian model averaging (BMA). <>= bma06 <- bic.surv(myform, data=dataset, factor.type=TRUE, factor.prior.adjust=TRUE) summary(bma06) @ There are three parts to this summary of the results of BMA. The first part gives the posterior probability of each term being included in a final model. Three genes (SKI, NT5C2, and SLAMF1) have posterior probabilities greater than $40\%$, and so we think they should be retained in the final model. The results for CD14, AICDA, and FGL2 are less clear, since the posterior probability is less than $25\%$. The second part gives the posterior estimates of the coefficients for each term in the model. The third part lists the five best models. Based on the Bayes Information Criterion (BIC), the best model only includes the two genes SKI and NT5C2. Significantly, AICDA snd FGL2 are omitted from all five of the best models, which agrees with the findings of the stepwise AIC model. For future reference, we keep the best BIC model with only three genes: <>= mod06b <- coxph(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ SKI + NT5C2 + SLAMF1, dataset) summary(mod06b) @ and with four genes: <>= mod06c <- coxph(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ SKI + NT5C2 + SLAMF1 + CD14, dataset) summary(mod06c) @ For both the six-gene model and the AIC and BIC reduced models, we compute numerical ``prognostic scores'' from the combinations that predict time-to-treatment. A pairwise scatter-plot (\fref{pairs06}) of the scores shows that they are (not surprisingly, since they use some of the same genes) highly correlated, which suggests that all three of the models constructed so far are likely to have similar performance when trying to predict the time from diagnosis to treatment. <>= scores06 <- predict(mod06) scores06s <- predict(mod06s) scores06b <- predict(mod06b) scores06c <- predict(mod06c) @ \begin{figure} <>= pairs(data.frame(SixGene=scores06, AIC6.2=scores06s, BIC6.3=scores06b, BIC6.4=scores06c), pch=16) @ \caption{Pair-wise scatter plots of the prognostic scores from three different models} \label{pairs06} \end{figure} \subsubsection{Six-gene model starting from sample collection} Here we want to look at the model that results when we start at the time of sample collection instead of the time of diagnosis. <>= myform2 <- formula(paste("Surv(TimeSample2SigTreat, NumericSigTreatment) ~", paste(set06, collapse="+"))) dataset <- data.frame(cardAB.clinical, cardAB.data[, set06]) mod06.sam <- coxph(myform2, dataset) mod06s.sam <- step(mod06.sam, trace=0) bma06.sam <- bic.surv(myform2, data=dataset, factor.type=TRUE, factor.prior.adjust=TRUE) @ The full six-gene model has coefficients that are quite similar to the coefficients for the model that predicts time from diagnosis to treatment. <>= summary(mod06.sam) @ The model selected by AIC contains a different set of three genes, with CD14 being the most significant. <>= summary(mod06s.sam) @ The model selected by BIC/BMA picks the same three genes as the AIC model. <>= summary(bma06.sam) @ We also compute ``prognostic scores'' using the sample-to-treatment models. A pairwise scatter plot (\fref{pairs.sam}) again shows that the scores are very highly correlated. <>= scores06.sam <- predict(mod06.sam) scores06s.sam <- predict(mod06s.sam) CatScore06 <- factor(ifelse(scores06 > median(scores06), "HighScore","LowScore")) @ \begin{figure} <>= pairs(data.frame(scores06, scores06s, scores06.sam, scores06s.sam), pch=16) @ \caption{Pair-wise scatter plots of the prognostic scores from two models starting at diagnosis and two models starting at sample collection. } \label{pairs.sam} \end{figure} Our basic conclusion is that it may not matter if we use all six genes or only two or three of them, and that similar models will work reasonably well to predict time from diagnosis or from sample collection to treatment. To test this idea more thoroughly, we can use a Cox proportional hazards model to test how well the ``diagnosis to treatment prognostic score'' predicts the time from sample to treatment. By splitting the prognostic score at the median, we can also prepare Kaplan-Meier plots to see how well this distinguishes between ptients needing early treatment and patients who will not need treatment for a long time (\fref{samp06}). <>= temp <- data.frame(cardAB.clinical, Score06=scores06, CatScore06=CatScore06) coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ Score06, data=temp) coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore06, data=temp) @ \begin{figure} <>= #windows(width=5,height=8) par(mfrow=c(2,1)) plot(survfit(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ CatScore06, data=temp), col=c("red", "blue"), lwd=2, main="Time Diagnosis to Treatment (six genes)", xlab="Time (months)", ylab="Fraction Surviving") legend("topright", levels(temp$CatScore06), col=c("red", "blue"), lwd=3) plot(survfit(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore06, data=temp), col=c("red", "blue"), lwd=2, main="Time Sample to Treatment (six genes)", xlab="Time (months)", ylab="Fraction Untreated") legend("topright", levels(temp$CatScore06), col=c("red", "blue"), lwd=3) par(mfrow=c(1,1)) @ \caption{Kaplan-Meier plot of time from (top) diagnosis or (bottom) sample collection to treatment, using a \textbf{six-gene} prognostic score developed to predict time from diagnosis to treatment. } \label{samp06} \end{figure} Because the model starting from diagnosis works so well, we will not (in the future) bother to construct separate models for the time from sample collection to treatment. \subsection{Reducing the five-gene model} We now repeat the previous analysis using the five-gene model. We do not construct separate models to predict sample-to-treatment; instead, we simply check that the ``diagnostic-to-treatment'' model works for this purpose (\fref{samp05}). <>= myform <- formula(paste("Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~", paste(set05, collapse="+"))) dataset <- data.frame(cardAB.clinical, cardAB.data[, set05]) mod05 <- coxph(myform, dataset) mod05s <- step(mod05, trace=0) bma05 <- bic.surv(myform, data=dataset, factor.type=TRUE, factor.prior.adjust=TRUE) summary(bma05) summary(mod05s) summary(mod05) scores05 <- predict(mod05) scores05s <- predict(mod05s) CatScore05 <- factor(ifelse(scores05 > median(scores05), "HighScore","LowScore")) @ <>= temp <- data.frame(cardAB.clinical, Score05=scores05, CatScore05=CatScore05) coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ Score05, data=temp) coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore05, data=temp) @ \begin{figure} <>= par(mfrow=c(2,1)) plot(survfit(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ CatScore05, data=temp), col=c("red", "blue"), lwd=2, main="Time Diagnosis to Treatment (five genes)", xlab="Time (months)", ylab="Fraction Surviving") legend("topright", levels(temp$CatScore05), col=c("red", "blue"), lwd=3) plot(survfit(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore05, data=temp), col=c("red", "blue"), lwd=2, main="Time Sample to Treatment (five genes)", xlab="Time (months)", ylab="Fraction Untreated") legend("topright", levels(temp$CatScore05), col=c("red", "blue"), lwd=3) par(mfrow=c(1,1)) @ \caption{Kaplan-Meier plot of time from (top) diagnosis or (bottom) sample collection to treatment, using a \textbf{five-gene} prognostic score developed to predict time from diagnosis to treatment. } \label{samp05} \end{figure} \subsection{Reducing the twelve-gene model} We now repeat the previous analysis using the twelve-gene model. We do not construct separate models to predict sample-to-treatment; instead, we simply check that the ``diagnostic-to-treatment'' model works for this purpose (\fref{samp12}). <>= myform <- formula(paste("Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~", paste(set12, collapse="+"))) dataset <- data.frame(cardAB.clinical, cardAB.data[, set12]) mod12 <- coxph(myform, dataset) mod12s <- step(mod12, trace=0) bma12 <- bic.surv(myform, data=dataset, factor.type=TRUE, factor.prior.adjust=TRUE) summary(bma12) summary(mod12s) summary(mod12) scores12 <- predict(mod12) scores12s <- predict(mod12s) CatScore12 <- factor(ifelse(scores12 > median(scores12), "HighScore","LowScore")) @ <>= temp <- data.frame(cardAB.clinical, Score12=scores12, CatScore12=CatScore12) coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ Score12, data=temp) coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore12, data=temp) @ \begin{figure} <>= par(mfrow=c(2,1)) plot(survfit(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ CatScore12, data=temp), col=c("red", "blue"), lwd=2, main="Time Diagnosis to Treatment (12 genes)", xlab="Time (months)", ylab="Fraction Surviving") legend("topright", levels(temp$CatScore12), col=c("red", "blue"), lwd=3) plot(survfit(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore12, data=temp), col=c("red", "blue"),, lwd=2, main="Time Sample to Treatment (12 genes)", xlab="Time (months)", ylab="Fraction Untreated") legend("topright", levels(temp$CatScore12), col=c("red", "blue"), lwd=3) par(mfrow=c(1,1)) @ \caption{Kaplan-Meier plot of time from (top) diagnosis or (bottom) sample collection to treatment, using a \textbf{twelve-gene} prognostic score developed to predict time from diagnosis to treatment. } \label{samp12} \end{figure} \subsection{Reducing the thirteen-gene model} We now repeat the previous analysis using the thirteen-gene model. We do not construct separate models to predict sample-to-treatment; instead, we simply check that the ``diagnostic-to-treatment'' model works for this purpose (\fref{samp13}). <>= myform <- formula(paste("Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~", paste(set13, collapse="+"))) dataset <- data.frame(cardAB.clinical, cardAB.data[, set13]) mod13 <- coxph(myform, dataset) mod13s <- step(mod13, trace=0) bma13 <- bic.surv(myform, data=dataset, factor.type=TRUE, factor.prior.adjust=TRUE) summary(bma13) summary(mod13s) summary(mod13) scores13 <- predict(mod13) scores13s <- predict(mod13s) CatScore13 <- factor(ifelse(scores13 > median(scores13), "HighScore","LowScore")) @ <>= temp <- data.frame(cardAB.clinical, Score13=scores13, CatScore13=CatScore13) coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ Score13, data=temp) coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore13, data=temp) @ \begin{figure} <>= par(mfrow=c(2,1)) plot(survfit(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ CatScore13, data=temp), col=c("red", "blue"), lwd=2, main="Time Diagnosis to Treatment (13 genes)", xlab="Time (months)", ylab="Fraction Surviving") legend("topright", levels(temp$CatScore13), col=c("red", "blue"), lwd=3) plot(survfit(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ CatScore13, data=temp), col=c("red", "blue"), lwd=2, main="Time Sample to Treatment (13 genes)", xlab="Time (months)", ylab="Fraction Untreated") legend("topright", levels(temp$CatScore13), col=c("red", "blue"), lwd=3) par(mfrow=c(1,1)) @ \caption{Kaplan-Meier plot of time from (top) diagnosis or (bottom) sample collection to treatment, using a \textbf{thirteen-gene} prognostic score developed to predict time from diagnosis to treatment. } \label{samp13} \end{figure} \subsection{Listing all the prognostic models} Here is a list of the models we have constructed to predict the time from diagnosis to treatment: <>= modlist <- c("mod06", "mod06s", "mod06b", "mod06c", "mod06s.sam", "mod05", "mod05s", "mod12", "mod12s", "mod13", "mod13s") @ We now want to determine how many genes included in which model. <>= whichGenes <- lapply(modlist, function(x) { y <- get(x, .GlobalEnv) attr(terms(y), "term.labels") }) ngenes <- unlist(lapply(whichGenes, length)) names(ngenes) <- modlist ngenes @ Finally, we can prepare a table that indicates whether a genes is or is not included in each of the five models (\tref{genemat}). <>= geneset <- sort(unique(unlist(whichGenes))) genemat <- matrix("--", nrow=length(geneset), ncol=length(modlist)) dimnames(genemat) <- list(geneset, modlist) for (i in 1:length(whichGenes)) { genemat[whichGenes[[i]],i] <- "Yes" } gcount <- apply(genemat, 1, function(x) sum(x=="Yes")) genemat <- genemat[rev(order(gcount)),order(ngenes)] rm(geneset, whichGenes, gcount, ngenes) @ <>= library(xtable) tab <- xtable(genemat, label="genemat", caption=paste("List of which genes are included", "in which prognostic models.")) print(tab, type="html", file="modelset.html") print(tab, size="\\tiny") @ \section{Predicting Survival} We now check whether any of the existing models are useful for predicting overall survival from the time of dignosis. \subsection{Prognostic scores} We start by computing the predicted ``prognostic scores'' for each model. <>= modscores <- lapply(modlist, function(x) { y <- get(x, .GlobalEnv) predict(y) }) dd <- as.data.frame(modscores) colnames(dd) <- modlist @ Next, we split each set of scores at the median to convert it to a categorical value. <>= cutoffs <- apply(dd, 2, median) catscores <- apply(dd, 2, function(x) { factor(ifelse(x > median(x), "HighScore","LowScore")) }) colnames(catscores) <- paste("Cat", colnames(catscores), sep='.') @ Now we add both the continuous and the categorical scores to the Card~AB dataset. <>= datasetAB <- data.frame(cardAB.clinical, dd, catscores) @ We also pause to clean up some of the redundant objects that are beginning to clutter the R workspace. <>= rm(dd, catscores, modscores, dataset, i)#, mod06.sam, mod06s.sam) rm(opar, myform, myform2, temp) rm(list=ls(pattern='bma')) rm(list=ls(pattern='CatScore')) rm(list=ls(pattern='scores')) @ \subsection{The four non-reduced models} \label{sec:surv} \subsubsection{Six genes} Now we text whether the six-gene model (which was derived to predict time-to-treatment) is at all useful for predicting overall survival on the training data. This is not significant as a binary variable, but it is significant as a continuous variable. <>= coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod06, data=datasetAB) coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod06, data=datasetAB) @ \subsubsection{Five genes} Next, we check the five-gene model. This turns out to be borderline significant as a binary variable but highly significant as a continuous variable. <>= coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod05, data=datasetAB) coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod05, data=datasetAB) @ \subsubsection{Twelve genes} Next, we check the twelve-gene model. This is significant as either a binary variable or as a continuous variable. <>= coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod12, data=datasetAB) coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod12, data=datasetAB) @ \subsubsection{Thirteen genes} Next, we check the thirteen-gene model. This is borderline signficant as a bionary variable and significant as a continuous variable. In \fref{os.km}, we present Kaplan-Meier plots for all four models. <>= coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod13, data=datasetAB) coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod13, data=datasetAB) @ \begin{figure} <>= par(mfrow=c(2,2)) plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod06, data=datasetAB), col=c("red", "blue"), main="OS After Diagnosis (6 gene)", xlab="Time (months)", ylab="Fraction Surviving") legend("bottomleft", levels(datasetAB$Cat.mod06), col=c("red", "blue"), lwd=3) plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod05, data=datasetAB), col=c("red", "blue"), main="OS After Diagnosis (5 gene)", xlab="Time (months)", ylab="Fraction Surviving") legend("bottomleft", levels(datasetAB$Cat.mod05), col=c("red", "blue"), lwd=3) plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod12, data=datasetAB), col=c("red", "blue"), main="OS After Diagnosis (12 gene)", xlab="Time (months)", ylab="Fraction Surviving") legend("bottomleft", levels(datasetAB$Cat.mod12), col=c("red", "blue"), lwd=3) plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod13, data=datasetAB), col=c("red", "blue"), main="OS After Diagnosis (13 gene)", xlab="Time (months)", ylab="Fraction Surviving") legend("bottomleft", levels(datasetAB$Cat.mod13), col=c("red", "blue"), lwd=3) par(mfrow=c(1,1)) @ \caption{Kaplan-Meier plots using models that predict time-to-treatment to instead predict overall survival.} \label{os.km} \end{figure} \subsubsection{Reduced models} We also compute the results for various models obtained by reducing the gene sets using AIC or BIC (\fref{os.km.red}). <>= summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod06s, data=datasetAB)) summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod06b, data=datasetAB)) summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod06s.sam, data=datasetAB)) summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod05s, data=datasetAB)) summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod06c, data=datasetAB)) summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod05, data=datasetAB)) summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod12s, data=datasetAB)) summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod06, data=datasetAB)) summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod13s, data=datasetAB)) summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod12, data=datasetAB)) summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod13, data=datasetAB)) @ \begin{figure} <>= par(mfrow=c(3,2)) plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod06b, data=datasetAB), col=c("red", "blue"), main="OS After Dx (mod06s; 2 genes)", xlab="Time (months)", ylab="Fraction Surviving") legend("bottomleft", levels(datasetAB$Cat.mod06b), col=c("red", "blue"), lwd=3) plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod05s, data=datasetAB), col=c("red", "blue"), main="OS After Dx (mod05s; 3 genes)", xlab="Time (months)", ylab="Fraction Surviving") legend("bottomleft", levels(datasetAB$Cat.mod05s), col=c("red", "blue"), lwd=3) plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod06b, data=datasetAB), col=c("red", "blue"), main="OS After Dx (mod06b; 3 genes)", xlab="Time (months)", ylab="Fraction Surviving") legend("bottomleft", levels(datasetAB$Cat.mod06b), col=c("red", "blue"), lwd=3) plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod06s.sam, data=datasetAB), col=c("red", "blue"), main="OS After Dx (mod06s.sam; 3 genes)", xlab="Time (months)", ylab="Fraction Surviving") legend("bottomleft", levels(datasetAB$Cat.mod06s.sam), col=c("red", "blue"), lwd=3) plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod06c, data=datasetAB), col=c("red", "blue"), main="OS After Dx (mod06c; 4 genes)", xlab="Time (months)", ylab="Fraction Surviving") legend("bottomleft", levels(datasetAB$Cat.mod06c), col=c("red", "blue"), lwd=3) plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod12s, data=datasetAB), col=c("red", "blue"), main="OS After Dx (mod12s; 5 genes)", xlab="Time (months)", ylab="Fraction Surviving") legend("bottomleft", levels(datasetAB$Cat.mod12s), col=c("red", "blue"), lwd=3) par(mfrow=c(1,1)) @ \caption{Kaplan-Meier plots using models that predict time-to-treatment to instead predict overall survival.} \label{os.km.red} \end{figure} \section{Composite Scores and Clinical Variables} We want to test if the composite scores perform better than---or at least add to---the existing clinical variables for predicting the time from diagnosis to first significant treatment. In a previous report, we found that the best model using exising clinical variables was: <>= dx2sig.model @ We update this model to use as many training samples as possible: <>= dd <- na.omit(datasetAB[, all.vars(formula(dx2sig.model))]) dx2sig.model <- coxph(formula(dx2sig.model), dd) @ \subsection{Prognostic score from the six-gene model} \label{sec:prog4} First, we construct a model that adds the prognostic score from the reduced five-gene model as a continuous variable along with the clinical variables. <>= timecol <- "TimeDiagnosis2SigTreat" statcol <- "NumericSigTreatment" tt <- attr(terms(dx2sig.model), "term.labels") dx.set <- na.omit(datasetAB[, c(timecol, statcol, "mod05s", tt)]) myform <- update(formula(dx2sig.model), ~ . + mod05s) dx2sig.score <- coxph(myform, data=dx.set) summary(dx2sig.score) @ The overall model is statistically significant, and the prognostic score appears to be an important factor. A chi-squared test based on an analysis-of-variance-like computation shows that the prognostic score adds independent information over-and-above that provided by the clinical variables. <>= anova(dx2sig.score,dx2sig.model) @ Using AIC to select the optimal model, the prognostic score is retained along with the clinical variables. <>= step(dx2sig.score,trace=0) @ Using BIC and BMA yields the same result. <>= bma <- bic.surv(myform, data=dx.set, factor.type=TRUE, factor.prior.adjust=TRUE) summary(bma) @ \subsection{Prognostic score from the five-gene model} We repeat the first part of this analysis for the score from the five-gene model. This also adds to the clinical variables. <>= dx.set <- na.omit(datasetAB[, c(timecol, statcol, "mod05", tt)]) dx2sig.score <- coxph(update(formula(dx2sig.model), ~ . + mod05), data=dx.set) summary(dx2sig.score) anova(dx2sig.score,dx2sig.model) step(dx2sig.score, trace=0) @ \subsection{Prognostic score from the twelve-gene model} We repeat the first part of this analysis for the score from the twelve-gene model. This also performs better than the clinical variables. <>= dx.set <- na.omit(datasetAB[, c(timecol, statcol, "mod12", tt)]) dx2sig.score <- coxph(update(formula(dx2sig.model), ~ . + mod12), data=dx.set) summary(dx2sig.score) anova(dx2sig.score,dx2sig.model) step(dx2sig.score, trace=0) @ \subsection{Prognostic score from the thirteen-gene model} We repeat the first part of this analysis for the score from the thirteen-gene model. This also performs better than the clinical variables. <>= dx.set <- na.omit(datasetAB[, c(timecol, statcol, "mod13", tt)]) dx2sig.score <- coxph(update(formula(dx2sig.model), ~ . + mod13), data=dx.set) summary(dx2sig.score) anova(dx2sig.score,dx2sig.model) step(dx2sig.score, trace=0) @ <>= ping <- paste("mod", c("06s", "06b", "06s.sam", "05s", "06c", "05", "12s", "06", "13s", "12", "13"), sep='') dx.set <- na.omit(datasetAB[, c(timecol, statcol, ping, tt)]) for (p in ping) { newform <- update(formula(dx2sig.model), as.formula(paste("~ . +", p))) dx2sig.score <- coxph(newform, data=dx.set) summary(dx2sig.score) print(anova(dx2sig.score,dx2sig.model)) step(dx2sig.score, trace=0) } @ \section{Individual Genes and Clinical Predictors} Finally, we find out what happens if we use AIC on a model that keeps the individual genes separate instead of pooling them as a prognostic score. <<>>= dx.set <- na.omit(data.frame(datasetAB[, c(timecol, statcol, tt)], cardAB.data[, set05])) myform <- formula(paste("Surv(", timecol, ",", statcol, ") ~ .")) opto <- step(coxph(myform, data=dx.set), trace=0) opto @ We see that two of the four genes (SKI and SLAMF1) and none of the clinical variables survive in the optimal model. We get the same result starting with mthe six-gene model: <<>>= dx.set <- na.omit(data.frame(datasetAB[, c(timecol, statcol, tt)], cardAB.data[, set06])) myform <- formula(paste("Surv(", timecol, ",", statcol, ") ~ .")) opto <- step(coxph(myform, data=dx.set), trace=0) opto @ or the twelve-gene model: <<>>= dx.set <- na.omit(data.frame(datasetAB[, c(timecol, statcol, tt)], cardAB.data[, set12])) myform <- formula(paste("Surv(", timecol, ",", statcol, ") ~ .")) opto <- step(coxph(myform, data=dx.set), trace=0) opto @ <>= ping <- paste("mod", c("06s", "06b", "06s.sam", "05s", "06c", "05", "12s", "06", "13s", "12", "13"), sep='') tempd <- data.frame(datasetAB, cardAB.data) for (p in ping) { varlist <- union(all.vars(formula(dx2sig.model)), all.vars(formula(get(p)))) dx.set <- na.omit(tempd[, varlist]) newform <- Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ . dx2sig.score <- coxph(newform, data=dx.set) modf <- step(dx2sig.score, trace=0) cat(paste("\n---------\n", p)) print(modf) } @ \section{Appendix} <>= rm(bma, myform, set06, set05, set13, set12) rm(dx.set, dx2sig.score, timecol, statcol, tt) save.image(file="predictors.rda") @ This analysis was run in the following directory: <>= getwd() @ <>= sessionInfo() @ \end{document}