\documentclass[11pt]{article} \usepackage{graphicx} \usepackage{cite} \usepackage{hyperref} \pagestyle{myheadings} \markright{microfluidicsModels02; revised: 2 Aug 2010} \setlength{\topmargin}{0in} \setlength{\textheight}{8in} \setlength{\textwidth}{6.5in} \setlength{\oddsidemargin}{0in} \setlength{\evensidemargin}{0in} \def\rcode#1{\texttt{#1}} \def\fref#1{Figure~\ref{#1}} \def\tref#1{Table~\ref{#1}} \title{Clinical Predictors of Survival and Time to Treatment in CLL} \author{Kevin R. Coombes} \date{23 June 2010; Revised: 2 Aug 2010} \SweaveOpts{prefix.string=Figures/microfluidicsModels02,eps=FALSE} <>= options(width=92) options(SweaveHooks = list(fig = function() par(bg='white')),eps=FALSE) @ <>= if (!file.exists("Figures")) { dir.create("Figures") } @ \begin{document} \maketitle \tableofcontents \section{Executive Summary} \subsection{Introduction} The main goal of the study is to identify changes in gene expression in CLL that are associated with clinical outcome (including overall survival and time-to-treatment). \subsubsection{Aims/Objectives} We want to find individual genes and combinations of genes that are related to prognosis in CLL. Here ``prognosis'' refers to the ability to predict time-to-treatment or overall suvival. \subsection{(Statistical) Methods} Exploratory data analyses and summaries. Cox proportional hazards models. Stepwise selection of predictive variables using the Akaike Information Criterion. Bayesian model averaging. \subsection{Results} We identified the best clinical models to predict time-to-event outcomes in the training set of $67$ samples assayed on Cards~A and~B. Specifically: \begin{itemize} \item The best model to predict the time from diagnosis to first significant treatment (\texttt{dx2sig.model}) incorporates \texttt{LogLDH}, \texttt{Light.chain.subtype}, and \texttt{Platelets}. \item The best model to predict the time from sample collection to first significant treatment (\texttt{sam2sig.model}) incorporates: \texttt{mutation.status}, \texttt{light.chain.subtype}, \texttt{Platelets}, \texttt{CatWBC}, and \texttt{LogB2M}. \item The best model to predict overall survival starting at the time of diagnosis (\texttt{dx2os.model}) incorporates \texttt{AgeAtDx} and \texttt{mutation.status}. \item The best model to predict overall survival starting at the time of sample collection (\texttt{sam2os.model}) incorporates \texttt{AgeAtDx}, \texttt{mutation.status}, and \texttt{LogB2M}. \end{itemize} \subsection{Conclusion} With the best clinical models saved as re-usable predictive models, we are ready to assess whether the QRT-PCR microfluifdics data can add anything to the accuracy of these predictions. \section{Loading the Data} We met and agreed that time-to-event analyses should be performed with two different starting points: \begin{itemize} \item Diagnosis \item Sample collection \end{itemize} and two different end points: \begin{itemize} \item First significant treatment \item Death \end{itemize} In the previous report (\rcode{microfluidicsModels.pdf}), we found that about $60\%$ of the samples in the validation (Card~C) dataset were treated within two days of sample acquisition. However, many of these treatments were simple rituximab-based treatments that we are calling ``not significant''. As a consequence, we have dropped the endpoint of ``first treatment of any kind'' from the analysis. We also stored all of the relevant data in a binary R file. We begin by loading that data. <>= load("micro01.rda") ls() @ \subsection{Preliminaries} As mentioned above, we want to perform a variety of time-to-event (``survival'') analyses, with different starting and ending points. The basic code for these analyses is in the following libraries. <>= library(survival) library(BMA) @ \section{Existing Markers} Our first task is to analyze the clinical and laboratory parameters (before incorporating high throughput data) in order to determine the best model that we can construct from pre-existing markers. As noted above, we have to construct four such models. Here is a list of the clinical predictors that we will explore. <>= cardAB.predictors <- colnames(cardAB.clinical)[keepers <- c(6,8, 11,48, 12,27, 13,28, 14:19, 26, 29:33)] cardAB.predictors @ \section{Residuals} We are going to check the continuous predioctors to decide if they need to be log-transformed before being incorporated into Cox proportional hazards models. The following function produces diagnostic plots using different notions of the residulas from the Cox model; we store a set of plots for each continuous predictor, both with and without log transformation. <>= testfigs <- function(X, xname, timeName, statusName, dataset=cardAB.clinical) { myform <- formula(paste("Surv(",timeName, ",", statusName, ") ~ 1")) nullmod <- coxph(myform, dataset) nullmart <- residuals(nullmod, type="martingale") myform <- formula(paste("Surv(",timeName, ",", statusName, ") ~", xname)) ds <- data.frame(dataset[, c(timeName, statusName)], X) colnames(ds)[3] <- xname model <- try(coxph(myform, data=ds)) if (inherits(model, "try-error")) { return(model) } Y <- coef(model)*X + nullmart cz <- cox.zph(model) nax <- is.na(X) X <- X[!nax] Y <- Y[!nax] nm <- nullmart[!nax] this.is.stupid <- as.character(myform) unformed <- paste(this.is.stupid[c(2, 1, 3)], collapse=' ') par(mfrow=c(2,2), bg='white') plot(nm, X, pch=16, ylab= xname, xlab="Null Martingale Residuals", main=unformed) abline(h=median(X)) lines(lowess(nm, X), col='red', lwd=2) plot(X, Y, pch=16, main="Test of Linearity", ylab="component + residual", xlab=xname) abline(coef(lm(Y ~ X))) lines(lowess(X, Y), col='red', lwd=2) plot(dfb <- resid(model, type='dfbeta'), pch=16, ylab=paste("dfbeta residuals of",xname), main="Test for Influential Samples") abline(h=0) abline(h=c(-2, -1, 1, 2)*coef(model), col='blue') plot(cz, main="Test of Proportional Hazards") abline(h=0, col="#666666") abline(h=coef(model), col='blue', lwd=2) par(mfrow=c(1,1)) } @ \subsection{Null martingale residuals: diagnosis to treatment} In this section, we prepare the plots associated with predicting time from diagnosis to first significant treatment. We illustrate this process with the plots for platelets (\fref{platelet}). The panel in the upper left shows the relationship between the Martingale residuals from a null model and the actual platelet levels; this figure gives an indication of whether the predictor can be incorporated as a linear term. The panel in the upper right relates the platelet count to the ``model plus residuals'', and should also follow a straight line. The lower left panel shows the influence of individual samples on the coefficient in the model. The panel in the lower right is a test of whether the coefficient (which is an estimate of the hazard ratio) appears to vary over time; under the proporional hazrad assumption at the core of the Cox model, this should follow a horizontal line. Note that the dashed lines give time-varying $95\%$ confidence intervals. <>= timecol <- "TimeDiagnosis2SigTreat" statcol <- "NumericSigTreatment" @ <>= cont.predictors <- cardAB.predictors[c(5, 7, 9:12, 19)] clinfigs <- file.path("Figures", "Clinical") if(!file.exists(clinfigs)) dir.create(clinfigs) subdir <- file.path(clinfigs, "Base") if (!file.exists(subdir)) dir.create(subdir) for (mypred in cont.predictors) { X <- cardAB.clinical[,mypred] testfigs(X, mypred, timecol, statcol) dev.copy(png, file.path(subdir, paste(mypred, "png", sep='.')), width=1200, height=1200, bg='white') dev.off() } subdir <- file.path(clinfigs, "Log") if (!file.exists(subdir)) dir.create(subdir) for (mypred in cont.predictors) { X <- cardAB.clinical[,mypred] testfigs(log(X), paste("log.", mypred, sep=''), timecol, statcol) dev.copy(png, file.path(subdir, paste(mypred, "png", sep='.')), width=1200, height=1200, bg='white') dev.off() } @ \begin{figure} \includegraphics[width=6.5in]{Figures/Clinical/Base/Platelets} \caption{Residual plots for predicting time to treatment using platelet levels.} \label{platelet} \end{figure} <>= lp <- function(p) log(p/(1-p)) par(ask=T) for (mypred in cont.predictors) { X <- cardAB.clinical[, mypred] par(mfrow=c(2,1)) hist(X, main=mypred) if (any(X==0, na.rm=TRUE)) { LP <- lp((0.1+X)/100) hist(LP, main=paste("logistic of", mypred)) } else { LX <- log(X) hist(LX,main=paste("log of", mypred)) } par(mfrow=c(1,1)) } par(ask=F) @ \subsection{Null martingale residuals: overall survival after diagnosis} Now we compute the analogous figures for models to predict overall survival. <>= timecol <- "OSAfterDiagnosis" statcol <- "NumericVitalStatus" @ <>= subdir <- file.path(clinfigs, "OS-Base") if (!file.exists(subdir)) dir.create(subdir) for (mypred in cont.predictors) { X <- cardAB.clinical[,mypred] testfigs(X, mypred, timecol, statcol) dev.copy(png, file.path(subdir, paste(mypred, "png", sep='.')), width=1200, height=1200, bg='white') dev.off() } subdir <- file.path(clinfigs, "OS-Log") if (!file.exists(subdir)) dir.create(subdir) for (mypred in cont.predictors) { X <- cardAB.clinical[,mypred] testfigs(log(X), paste("log.", mypred, sep=''), timecol, statcol) dev.copy(png, file.path(subdir, paste(mypred, "png", sep='.')), width=1200, height=1200, bg='white') dev.off() } @ After reviewing the figures, we concluded that three variables should be log-transformed for inclusion in predictive models: \begin{enumerate} \item Serum beta-2 microglobulin \item Serum LDH \item White blood count \end{enumerate} <>= cardAB.predictors <- colnames(cardAB.clinical)[keepers <- c(6,8, 11,48, 49,27, 51,28, 14:16, 50, 18:19, 26, 29:33)] cardAB.predictors @ \section{Diagnosis to Significant Treatment} We start by selecting the time-to-event variables in which we are interested. <>= timecol <- "TimeDiagnosis2SigTreat" statcol <- "NumericSigTreatment" @ Next, we perform univariate analysis with each clinical predictor. <>= unip <- rep(NA, length(cardAB.predictors)) names(unip) <- cardAB.predictors dataset <- cardAB.clinical[, c(timecol, statcol, cardAB.predictors)] for (who in cardAB.predictors) { myform <- formula(paste("Surv(",timecol, ",", statcol, ") ~ ", who)) s <- summary(coxph(myform, dataset)) unip[who] <- s$logtest[3] } round(sort(unip), 4) @ Now we only keep predictors that are (at least marginally) significant in the univariate setting. Here we use the criterion that the $p$-value must be less than $0.15$. <>= filter.predictors <- names(unip)[unip < 0.15] dataset <- cardAB.clinical[, c(timecol, statcol, filter.predictors)] nad <- na.omit(dataset) @ Since some of the clinical columns have missing values and the variable selection algorithms behave badly in the face of missing data, we have to omit any samples with missing data for the variables that passed filtering. This operation can substantially reduce the number of samples. <>= dim(dataset) dim(nad) @ We use the Akaike Information Criterion to stepwise add or subtract predictive variables. We can use this in a primarily top-down fashion (starting with all variables) or a bottom-up fashion (starting with no variables). Both directions allow you to change your mind along the way. <>= myform <- formula(paste("Surv(",timecol, ",", statcol, ") ~ 1")) myform2 <- formula(paste("Surv(",timecol, ",", statcol, ") ~ ", paste(filter.predictors, collapse="+"))) bottomup <- step(coxph(myform, nad), scope=myform2, trace=0) topdown <- step(coxph(myform2, nad), trace=0) @ An alternative method for selecting variables is to use Bayesian model averaging. <>= test.bma <- bic.surv(myform2, data=nad, factor.type=TRUE, factor.prior.adjust=TRUE) test.bma @ <>= bottomup topdown @ Finally, we include anything that was selected by any one of the three methods into a common model. <>= possible <- unique(c(test.bma$namesx[test.bma$probne0 > 20], attr(terms(topdown), "term.labels"), attr(terms(bottomup), "term.labels"))) myform3 <- formula(paste("Surv(",timecol, ",", statcol, ") ~ ", paste(possible, collapse="+"))) indy <- na.omit(dataset[, c(timecol, statcol, possible)]) midground <- coxph(myform3, indy) midground @ And we use a step-down method on that model. <>= final.model <- step(midground, trace=0) final.model @ <>= dx2sig.model <- final.model dx2sig.model @ \section{Time from Sample to First Significant Treatment} We repeat exactly the same analysis, using sample collection instead of diagnosis as the starting point. <>= timecol <- "TimeSample2SigTreat" statcol <- "NumericSigTreatment" <> <> <> <> <> @ Here are the ``bottom-up'' and ``top-down'' models. <>= <> @ <>= <> <> @ The best model to predict time from sample to first signifcant treatment agrees with the top-down AIC model. <>= sam2sig.model <- coxph(formula(topdown), dataset) sam2sig.model @ \section{Overall Survival After Diagnosis} In this section, we build models to predict overall survival, starting with the time of diagnosis. <>= timecol <- "OSAfterDiagnosis" statcol <- "NumericVitalStatus" <> <> <> <> <> @ Here are the ``bottom-up'' and ``top-down'' models. <>= <> @ <>= <> <> @ It is primarily age that predicts overall survival, with possible assistance frmo the mutation status. <>= temp <- step(midground, steps=2, trace=0) temp <- update(temp, . ~ . - CatCyto) dx2os.model <- step(coxph(formula(temp), dataset), trace=0) dx2os.model @ \section{Overall Survival After Sample Collection} <>= timecol <- "OSAfterSample" statcol <- "NumericVitalStatus" <> <> <> <> <> @ Here are the ``bottom-up'' and ``top-down'' models. <>= <> @ <>= <> <> @ <>= temp <- step(midground, steps=2, trace=0) #missing <- which(apply(dataset[, attr(terms(temp), "term.labels")], 1, function(x) sum(is.na(x))) > 0) #sam2os.model <- coxph(formula(temp), dataset[-missing,]) sam2os.model <- coxph(formula(temp), dataset) sam2os.model @ \section{Validation of Clinical Predictive Models} \subsection{Validation: diagnosis to treatment} Now we validate the clinical predictions for time from diagnosis to (first significant) treatment. <>= timecol <- "TimeDiagnosis2SigTreat" statcol <- "NumericSigTreatment" model <- dx2sig.model temp0 <- median(predict(model), na.rm=TRUE) temp <- as.vector(predict(model, newdata=cardC.clinical)) temp1 <- rep("HighScore", length(temp)) temp1[temp < temp0] <- "LowScore" vdata <- na.omit(data.frame(cardC.clinical[, c(timecol, statcol)], PRE=temp, CatPre=factor(temp1, levels=c("LowScore", "HighScore")))) summary(vdata) myform <- formula(paste("Surv(", timecol, ",", statcol, ") ~ PRE")) myformb <- formula(paste("Surv(", timecol, ",", statcol, ") ~ CatPre")) coxph(myform, vdata) coxph(myformb, vdata) @ \begin{figure} <>= colset <- c("red", "blue") plot(survfit(myformb, vdata), col=colset, lwd=2, xlab="Time from diagnosis to treatment (months)", ylab="Fraction untreated", main="Clinical model") legend("topright", levels(vdata$CatPre), col=colset, lwd=2) @ \caption{Kaplan-Meier plot on validation data set of predictions from the best clinical model to predict time from diagnosis to treatment.} \label{dx2sig} \end{figure} \subsection{Validation: Sample to treatment} Now we repeat for the time from sample collection to first significant treatment. <>= timecol <- "TimeSample2SigTreat" statcol <- "NumericSigTreatment" model <- sam2sig.model temp0 <- median(predict(model), na.rm=TRUE) temp <- as.vector(predict(model, newdata=cardC.clinical)) temp1 <- rep("HighScore", length(temp)) temp1[temp < temp0] <- "LowScore" vdata <- na.omit(data.frame(cardC.clinical[, c(timecol, statcol)], PRE=temp, CatPre=factor(temp1, levels=c("LowScore", "HighScore")))) summary(vdata) myform <- formula(paste("Surv(", timecol, ",", statcol, ") ~ PRE")) myformb <- formula(paste("Surv(", timecol, ",", statcol, ") ~ CatPre")) coxph(myform, vdata) coxph(myformb, vdata) @ \begin{figure} <>= plot(survfit(myformb, vdata), col=colset, lwd=2, xlab="Time from sample to treatment (months)", ylab="Fraction untreated", main="Clinical model") legend("topright", levels(vdata$CatPre), col=colset, lwd=2) @ \caption{Kaplan-Meier plot on validation data set of predictions from the best clinical model to predict time from sample collection to treatment.} \label{sam2sig} \end{figure} \subsection{Validation: OS after diagnosis} And for overall survival after diagnosis. <>= timecol <- "OSAfterDiagnosis" statcol <- "NumericVitalStatus" model <- dx2os.model temp0 <- median(predict(model), na.rm=TRUE) temp <- as.vector(predict(model, newdata=cardC.clinical)) temp1 <- rep("HighScore", length(temp)) temp1[temp < temp0] <- "LowScore" vdata <- na.omit(data.frame(cardC.clinical[, c(timecol, statcol)], PRE=temp, CatPre=factor(temp1, levels=c("LowScore", "HighScore")))) summary(vdata) myform <- formula(paste("Surv(", timecol, ",", statcol, ") ~ PRE")) myformb <- formula(paste("Surv(", timecol, ",", statcol, ") ~ CatPre")) coxph(myform, vdata) coxph(myformb, vdata) @ \begin{figure} <>= plot(survfit(myformb, vdata), col=colset, lwd=2, xlab="Overall survival after diagnosis (months)", ylab="Fraction untreated", main="Clinical model") legend("bottomright", levels(vdata$CatPre), col=colset, lwd=2) @ \caption{Kaplan-Meier plot on validation data set of predictions from the best clinical model to predict overall survival after diagnosis.} \label{dx2os} \end{figure} \subsection{Validation: OS after sample} And for overall survival after sample collection. <>= timecol <- "OSAfterSample" statcol <- "NumericVitalStatus" model <- sam2os.model temp0 <- median(predict(model), na.rm=TRUE) temp <- as.vector(predict(model, newdata=cardC.clinical)) temp1 <- rep("HighScore", length(temp)) temp1[temp < temp0] <- "LowScore" vdata <- na.omit(data.frame(cardC.clinical[, c(timecol, statcol)], PRE=temp, CatPre=factor(temp1, levels=c("LowScore", "HighScore")))) summary(vdata) myform <- formula(paste("Surv(", timecol, ",", statcol, ") ~ PRE")) myformb <- formula(paste("Surv(", timecol, ",", statcol, ") ~ CatPre")) coxph(myform, vdata) coxph(myformb, vdata) @ \begin{figure} <>= plot(survfit(myformb, vdata), col=colset, lwd=2, xlab="Overall survival after sample (months)", ylab="Fraction untreated", main="Clinical model") legend("bottomright", levels(vdata$CatPre), col=colset, lwd=2) @ \caption{Kaplan-Meier plot on validation data set of predictions from the best clinical model to predict overall survival after sample collection.} \label{sam2os} \end{figure} \section{Appendix} We save the current state of the workspace. <>= save.image(file="micro02.rda") @ This analysis was run in the following directory: <>= getwd() @ <>= sessionInfo() @ \end{document}