\documentclass[11pt]{article} \usepackage{graphicx} \usepackage{cite} \usepackage{hyperref} \pagestyle{myheadings} \markright{microfluidicsModels07; revised 1 Sept 2010} \setlength{\topmargin}{0in} \setlength{\textheight}{8in} \setlength{\textwidth}{6.5in} \setlength{\oddsidemargin}{0in} \setlength{\evensidemargin}{0in} \def\rcode#1{\texttt{#1}} \def\fref#1{\textbf{Figure~\ref{#1}}} \def\tref#1{\textbf{Table~\ref{#1}}} \def\sref#1{\textbf{Section~\ref{#1}}} \title{Validation of a Two-Gene Prognostic Model in CLL} \author{Kevin R. Coombes} \date{16 July 2010; REVISED: 1 September 2010} \SweaveOpts{prefix.string=Figures/microfluidicsModels07,eps=FALSE} <>= options(width=92) options(SweaveHooks = list(fig = function() par(bg='white')),eps=FALSE) @ <>= if (!file.exists("Figures")) { dir.create("Figures") } @ \begin{document} \maketitle \tableofcontents \section{Executive Summary} \subsection{Introduction} The main goal of the study is to identify changes in gene expression in CLL that are associated with clinical outcome (including overall survival and time-to-treatment). \subsubsection{Aims/Objectives} We want to find individual genes and combinations of genes that are related to prognosis in CLL. Here ``prognosis'' refers to the ability to predict time-to-treatment or overall survival. \subsection{(Statistical) Methods} Exploratory data analyses and summaries. Cox proportional hazards models. Stepwise selection of predictive variables using the Akaike Information Criterion. Bayesian model averaging. \subsection{Results} \begin{itemize} \item We have validated numerous individual preditors of prognosis (pp. 3--7). \item The two-gene model to predict time from diagnosis to first significant treatment is convincingly validated (\fref{mod02}). \item As noted in a previous report, the best clinical model did not validate on this dataset, and so it is not surprising that the two-gene model is a better predictor of prognosis (pp. 11-13). \item The best model (by AIC) on the validation data set that is allowed to incorporate both clincal and gene predictors retains SKI, SLAMF1, and platelet levels (pp. 15--17). \item We produced a series of figures (\fref{mod02.clin} -- \fref{mod02.rai}) that show that the gene score adds value to all of the existing clinical variables that can predict prognosis, with the possible exception of mutation status (\fref{mod02.mut}). \end{itemize} \subsection{Conclusion} It is time to write the manuscript and submit it.... \section{Loading the Data} In the previous report (\rcode{microfluidicsModels06.pdf}), we stored all of the relevant data in a binary R file. We begin by loading that data. <>= load("withmod02.rda") ls() @ \subsection{Preliminaries} As mentioned perviously, we want to perform a variety of time-to-event (``survival'') analyses, with different starting and ending points. The basic code for these analyses is in the following libraries. <>= library(survival) library(BMA) @ \section{Validation of Univariate Predictors} We must first get a list of the genes that are common to both Card~ and Cards~A and~B. <>= commonGenes <- intersect(colnames(cardAB.data), colnames(cardC.data)) @ \subsection{Time to first significant treatment} We have previously computed $p$-values (for each gene based on the training data) for the time to first significant treatment from either diagnosis or sample collection, using either a continuous or a binary predictor. Here we apply the same methods to the validation data. <>= fourp.C <- fourFoldWay(cardC.data, cardC.clinical) colnames(fourp) <- paste("AB", colnames(fourp), sep='.') colnames(fourp.C) <- paste("C", colnames(fourp.C), sep='.') both.ways <- data.frame(fourp[commonGenes,], fourp.C[commonGenes,]) write.csv(both.ways, file="both-ways.csv") @ Now we can list the $p$-values based on continuous predictors, starting at diagnosis, for both the training set and the validation set. <>= temp <- both.ways[, c(4, 12)] temp <- temp[order(temp[,1]),] round(temp[1:20,], 5) @ We can also list the $p$-values based on binary predictors, starting at diagnosis, for both the training set and the validation set. <>= temp <- both.ways[, c(3, 11)] temp <- temp[order(temp[,1]),] round(temp[1:20,], 5) @ Finally, we can list the geometric mean $p$-value over all analyses, for both the training set and the validation set. <>= temp <- both.ways[, c(5, 13)] temp <- temp[order(temp[,1]),] round(temp[1:20,], 5) @ \subsection{Overall survival} We performed a similar analysis for overall survival. <>= oster.C <- fourFoldWay(cardC.data, cardC.clinical) colnames(oster) <- paste("AB", colnames(oster), sep='.') colnames(oster.C) <- paste("C", colnames(oster.C), sep='.') both.os <- data.frame(oster[commonGenes,], oster.C[commonGenes,]) write.csv(both.os, file="both-os.csv") @ Based on continuous predictors, starting at diagnosis. <>= temp <- both.os[, c(4, 12)] temp <- temp[order(temp[,1]),] round(temp[1:20,], 5) @ Based on binary predictors, starting at diagnosis. <>= temp <- both.os[, c(3, 11)] temp <- temp[order(temp[,1]),] round(temp[1:20,], 5) @ Based on the geometric mean $p$-value over all analyses <>= temp <- both.os[, c(5, 13)] temp <- temp[order(temp[,1]),] round(temp[1:20,], 5) @ \section{Computing prognostic scores} We start by computing the predicted ``prognostic scores'' for each model. <>= modscores <- lapply(modlist, function(x) { y <- get(x, .GlobalEnv) predict(y, newdata=as.data.frame(cardC.data)) }) dd <- as.data.frame(modscores) colnames(dd) <- modlist @ Using the training data, we have already split each set of scores at the median to convert it to a categorical value. <>= catscores <- lapply(1:length(modlist), function(i) { x <- dd[,i] cutter <- cutoffs[i] factor(ifelse(x > cutter, "HighScore","LowScore")) }) catscores <- as.data.frame(catscores) colnames(catscores) <- paste("Cat", modlist, sep='.') summary(catscores) @ Now we add both the continuous and the categorical scores to the Card~C dataset. <>= datasetC <- data.frame(cardC.clinical, dd, catscores) @ \section{Validating a two-gene model} We start by trying to validate the model that uses the fewest genes: SKI and SLAMF1. First, we validate the continuous prognostic score in the Card~C data: <>= summary(coxph(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ mod02, data=datasetC)) @ Next, we check that the dichotomous version of this score (with the cutoff learned from the training data) also validates in the Card~C data. A Kaplan-Meier plot comparing the patients with a high score (and thus high risk) to patients with a low score shows that the difference in median time-to-treatment is approximately eight years (\fref{mod02}). <>= summary(coxph(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ Cat.mod02, data=datasetC)) @ <>= plot(survfit(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ Cat.mod02, data=datasetC), col=colorcode, lty=ltype, lwd=2, main="Validation", xlab="Time (months)", ylab="Fraction Untreated") legend("topright", levels(datasetC$Cat.mod02), col=colorcode, lty=ltype, lwd=3) @ <>= colorcode <- "black" ltype <- c("solid", "dashed") png(file="SKI-SLAM-paper/mfc-figure2b.png", width=600, height=600, pointsize=16, bg="white") par(bg="white") <> dev.off() pdf(file="SKI-SLAM-paper/mfc-figure2b.pdf", width=6, height=6, pointsize=12, bg="white") par(bg="white") <> dev.off() @ \begin{figure} <>= colorcode <- c("red", "blue") ltype <- "solid" <> @ \caption{Kaplan-Meier plot for time from diagnosis to first significant treatment in the validation dataset.} \label{mod02} \end{figure} \subsection{Time from Sample Collection} Next, we test whether the continuous prognostic score predicts time from sample to treatment in the Card~C data: <>= summary(coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ mod02, data=datasetC)) @ Next, we check the dichotomous version of this score in the Card~C data. <>= summary(coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ Cat.mod02, data=datasetC)) @ \subsection{Overall Survival After Diagnosis} Next, we test whether the continuous prognostic score predicts overall survival in the Card~C data: <>= summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod02, data=datasetC)) @ Next, we check the dichotomous version of this score in the Card~C data. <>= summary(coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod02, data=datasetC)) @ \section{Comparison with clinical predictors} The next step is to determine if the prognostic score adds anything to the existing clinical variables. Recall that we derived the clinical model previously: <>= dx2sig.model @ \subsection{Continuous clinical and gene-based prognostic scores} There are several alternatives for combining the clinical model with the gene-based model. First, we can compute clinical prognostic scores (using the clinical model learned from the training set) in exactly the same way that we computed gene prognostic scores. <>= cutter <- median(predict(dx2sig.model)) clinscore <- predict(dx2sig.model, newdata=cardC.clinical) cat.clinscore <- factor(ifelse(clinscore > cutter, "HighScore","LowScore")) datasetC <- data.frame(datasetC, clinscore, cat.clinscore) rm(clinscore, cat.clinscore, cutter) @ Now we can validate the clinical prognostic score in the Card~C dataset, as either a continuous variable: <>= nullform <- formula(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ 1) tryclin <- coxph(update(nullform, ~ . + clinscore), data=datasetC) tryclin @ or as a dichotomous variable: <>= trycat <- coxph(update(nullform, ~ . + cat.clinscore), data=datasetC) trycat @ Neither of these clinical scores validate as prognostic on the Card~C dataset. We now add the gene-based (continuous) prognostic score to the (continuous) clinical score. <>= myform <- update(formula(tryclin), ~ . + mod02) tempset <- na.omit(datasetC[,all.vars(myform)]) tempmod <- coxph(myform, data=tempset) tempmod @ Both continuous factors appear to be significant. We can use AIC to see if it eliminates either of the continuous scores from the model. <>= step(tempmod, trace=0) @ Based on AIC, only the gene-based score is useful. BIC/BMA reaches the same conclusion: <>= bma <- bic.surv(myform, data=tempset, factor.type=TRUE, factor.prior.adjust=TRUE) summary(bma) @ Similarly, an explicit statistical test shows that the clinical and gene-based continuous scores give independent prognostic information: <>= anova(tryclin, tempmod) @ \subsection{Refitting the clinical model on the validation dataset} Here we add the gene-based prognostic score to the expanded clinical model (instead of to the clinical score). This requires us to refit the parameters of the clinical model on the validation data. <>= myform <- update(formula(dx2sig.model), ~ . + mod02) tempset <- na.omit(datasetC[,all.vars(myform)]) basemod <- coxph(formula(dx2sig.model), data=tempset) basemod @ Now we use AIC, BIC, and ANOVA as above. <>= tempmod <- coxph(myform, data=tempset) step(tempmod, trace=0) bma <- bic.surv(myform, data=tempset, factor.type=TRUE, factor.prior.adjust=TRUE) summary(bma) anova(basemod, tempmod) @ We conclude, from this persepctive, that the gene-based continuous score is a better predictor than most clinical variables, with independent information arising from the platelet levels. \subsection{Combining individual clinical and gene predictors} Yet another alternative is to combine the individual genes (SKI, NT5C2, CD14, and SLAMF1) with the individual clinical predictors and see which ones give the best results. <>= myform <- formula(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ .) tempnames <- unique(c(all.vars(formula(dx2sig.model)), all.vars(formula(mod02)))) tempset <- na.omit(data.frame(datasetC, cardC.data)[,tempnames]) mod0 <- coxph(myform, data=tempset) mod0 step(mod0, trace=0) bma <- bic.surv(myform, data=tempset, factor.type=TRUE, factor.prior.adjust=TRUE) summary(bma) @ Using this approach, we conclude that two of the four genes (SKI and SLAMF1) and the platelet levels are the significant predictors, with the other factors being redundant. \section{Kaplan-Meier plots and clinical covariates} Finally, in this section we produce Kaplan-Meier plots that compare the dichotomous gene score with various dichotomous clinical variables in order to show that the gene score reallty does add something. The following function will be used repeatedly. <>= baz <- function(facname) { myform <- update(geneform, paste("~ . +", facname)) tempnames <- all.vars(myform) tempset <- na.omit(data.frame(datasetC, cardC.data)[,tempnames]) table(tempset[,facname], tempset$Cat.mod02) } @ <>= geneform <- Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ Cat.mod02 foo <- function(facname, x="topright", y=NULL, dispname=facname) { myform <- update(geneform, paste("~ . +", facname)) tempnames <- all.vars(myform) tempset <- na.omit(data.frame(datasetC, cardC.data)[,tempnames]) model <- coxph(myform, data=tempset) mycols <- c("purple", "blue", "red", "brown") plot(survfit(myform, data=tempset), col=mycols, lwd=2, main=dispname, xlab="Time (months)", ylab="Fraction Untreated") fl <- rep(levels(tempset[, facname]), 2) hl <- rep(c("HighScore", "LowScore"), each=2) mynames <- paste(dispname, " ", fl, ", GPS ", rep(c("High", "Low"), each=2), sep='') tab <- baz(facname) mycounts <- unlist(sapply(1:4, function(i) { tab[fl[i], hl[i]] })) mynames <- paste(mynames, ' (N =', mycounts, ')', sep='') if (is.null(y)) { legend(x, mynames, col=mycols, lwd=3) } else { legend(x, y, mynames, col=mycols, lwd=3) } mod0 <- coxph(geneform, data=tempset) s <- summary(model) anova(mod0, model) # text(0, 0, paste('p =', round(s$sctest[3], 6)), adj=0) invisible(mod0) } @ \begin{figure} <>= foo("cat.clinscore", dispname="Clin") @ \caption{Kaplan-Meier plot for time from diagnosis to first significant treatment in the validation dataset.} \label{mod02.clin} \end{figure} \begin{figure} <>= foo("Sex") @ \caption{Kaplan-Meier plot for time from diagnosis to first significant treatment in the validation dataset.} \label{mod02.sex} \end{figure} \begin{figure} <>= foo("mutation.status", dispname="Mutation") @ \caption{Kaplan-Meier plot for time from diagnosis to first significant treatment in the validation dataset.} \label{mod02.mut} \end{figure} \begin{figure} <>= foo("CatB2M") @ \caption{Kaplan-Meier plot for time from diagnosis to first significant treatment in the validation dataset.} \label{mod02.b2m} \end{figure} \begin{figure} <>= foo("CatWBC") @ \caption{Kaplan-Meier plot for time from diagnosis to first significant treatment in the validation dataset.} \label{mod02.wbc} \end{figure} \begin{figure} <>= foo("Zap70Protein", dispname="Zap70", x=110, y=0.95) @ \caption{Kaplan-Meier plot for time from diagnosis to first significant treatment in the validation dataset.} \label{mod02.zap} \end{figure} \begin{figure} <>= foo("CatCD38") @ \caption{Kaplan-Meier plot for time from diagnosis to first significant treatment in the validation dataset.} \label{mod02.cd38} \end{figure} \begin{figure} <>= foo("CatCyto") @ \caption{Kaplan-Meier plot for time from diagnosis to first significant treatment in the validation dataset.} \label{mod02.cyt} \end{figure} \begin{figure} <>= foo("Matutes") @ \caption{Kaplan-Meier plot for time from diagnosis to first significant treatment in the validation dataset.} \label{mod02.mat} \end{figure} \begin{figure} <>= foo("CatRAI") @ \caption{Kaplan-Meier plot for time from diagnosis to first significant treatment in the validation dataset.} \label{mod02.rai} \end{figure} <>= mypng <- function(n, ...) { png(file=file.path("SKI-SLAM-paper", paste("mfc-fig3-", n, ".png", sep='')), width=600, height=600, bg="white", pointsize=16) foo(n, ...) dev.off() } mypng("cat.clinscore", dispname="Clin") mypng("mutation.status", dispname="Mutation") mypng("CatRAI") mypng("Sex") mypng("CatB2M") mypng("CatWBC") mypng("CatCD38") mypng("Zap70Protein", dispname="Zap70", x=110, y=0.95) mypng("CatCyto") @ We try to make one final giant figure <>= #windows(width=12, height=12) png(file="SKI-SLAM-paper/mfc-figure3.png", width=1440, height=1440, pointsize=20, bg="white") opar <- par(mfrow=c(3,3), bg="white") foo("Sex", dispname="Gender") foo("CatRAI", dispname="Rai Stage") foo("CatWBC", dispname="WBC") foo("CatB2M", dispname="B2M") foo("CatCD38", dispname="CD38") foo("mutation.status", dispname="Mut. Stat.") foo("Zap70Protein", dispname="Zap70") foo("CatCyto", dispname="Cytogenetics") foo("cat.clinscore", dispname="ClinScore") par(opar) dev.off() pdf(file="SKI-SLAM-paper/mfc-figure3.pdf", width=14.40, height=14.40, pointsize=16, bg="white") opar <- par(mfrow=c(3,3), bg="white") foo("Sex", dispname="Gender") foo("CatRAI", dispname="Rai Stage") foo("CatWBC", dispname="WBC") foo("CatB2M", dispname="B2M") foo("CatCD38", dispname="CD38") foo("mutation.status", dispname="Mut. Stat.") foo("Zap70Protein", dispname="Zap70") foo("CatCyto", dispname="Cytogenetics") foo("cat.clinscore", dispname="ClinScore") par(opar) dev.off() @ \section{Appendix} This analysis was run in the following directory: <>= getwd() @ <>= sessionInfo() @ \end{document}