\documentclass[11pt]{article} \usepackage{graphicx} \usepackage{cite} \usepackage{hyperref} \pagestyle{myheadings} \markright{microfluidicsModels06; Revised: 1 Sept 2010} \setlength{\topmargin}{0in} \setlength{\textheight}{8in} \setlength{\textwidth}{6.5in} \setlength{\oddsidemargin}{0in} \setlength{\evensidemargin}{0in} \def\rcode#1{\texttt{#1}} \def\fref#1{\textbf{Figure~\ref{#1}}} \def\tref#1{\textbf{Table~\ref{#1}}} \def\sref#1{\textbf{Section~\ref{#1}}} \title{Training a Two-Gene Prognostic Model in CLL} \author{Kevin R. Coombes} \date{5 August 2010; Revised 1 September 2010} \SweaveOpts{prefix.string=Figures/microfluidicsModels06,eps=FALSE} <>= options(width=92) options(SweaveHooks = list(fig = function() par(bg='white')),eps=FALSE) @ <>= if (!file.exists("Figures")) { dir.create("Figures") } @ \begin{document} \maketitle \tableofcontents \section{Executive Summary} \subsection{Introduction} The main goal of the study is to identify changes in gene expression in CLL that are associated with clinical outcome (including overall survival and time-to-treatment). \subsubsection{Aims/Objectives} We want to find individual genes and combinations of genes that are related to prognosis in CLL. Here ``prognosis'' refers to the ability to predict time-to-treatment or overall survival. \subsection{(Statistical) Methods} Cox proportional hazards models. Kaplan-Meier plots. \subsection{Results} The two-gene model using SKI and SLAMF1 is an effective predictor of time from diagnosis to treatment, or of time from sample to treatment. While not significant, it has the correct trend to potentially predict overall survival after diagnosis. \subsection{Conclusion} We should attempt to validate the two-gene model in our validation dataset. \section{Loading the Data} In the previous report (\rcode{microfluidicsModels05.pdf}), we stored all of the relevant data in a binary R file. We begin by loading that data. <>= load("predictors.rda") ls() @ \subsection{Preliminaries} As mentioned previously, we want to perform a variety of time-to-event (``survival'') analyses, with different starting and ending points. The basic code for these analyses is in the following libraries. <>= library(survival) library(BMA) @ \section{Training a two-gene model} In the previous report, we considered a wide variety of different models on the training data. In reviewing the results, we noticed the following facts: \begin{enumerate} \item Many models were able to predict time-to-treatment from either diagnosis or sample collection, but were less effective at predicting overall survival. The latter predictions were usually better as continuous scores than as binary variables. \item On reviewing Figure 1 of report \texttt{microfluidicsModels04}, we noted that the most frequently retained genes to predict time from diagnosis to first significant treatment were SKI and SLAMF1. \item When we used AIC to see which genes were significant in the presence of clinical variables that predicted time-to-treatment, the same two genes were always retained: SKI and SLAMF1. \item Of the models we looked at, the smallest one that included SKI and SLAMF1 and appeared to predict overall survival in addition to time-to-treatment was \texttt{mod06c}, which included a total of four genes: SKI, SLAMF1, CD14, and NT5C2. \end{enumerate} In this report, we want to examine the performance (on the training set) of a model that just uses SKI and SLAMF1 as predictors. \section{Diagnosis to Treatment} <>= dset <- data.frame(cardAB.data, cardAB.clinical) mod02 <- coxph(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ SKI + SLAMF1, data=dset) mod02 @ The two-gene model is a highly significant continuous predictor of time from diagnosis to treatment. Now we compare the two models using a chi-squared test: <>= anova(mod06c, mod02) @ Since this is NOT significant, it suggests that the two-gene model is as good as the four-gene model. \subsection{Prognostic scores} We now convert the predictions into a continuous and a binary prognostic score. <>= x <- datasetAB$mod02 <- predict(mod02) datasetAB$Cat.mod02 <- factor(ifelse(x > median(x), "HighScore","LowScore")) rm(x) @ The score remains significant as a continuous or a binary predictor: <>= coxph(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ mod02, datasetAB) coxph(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ Cat.mod02, datasetAB) @ <>= plot(survfit(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ Cat.mod02, data=datasetAB), col=colorcode, lty=ltype, lwd=2, main="Training", xlab="Time (months)", ylab="Fraction Untreated") legend("topright", levels(datasetAB$Cat.mod02), col=colorcode, lty=ltype, lwd=3) @ <>= colorcode <- "black" ltype <- c("solid", "dashed") png(file="SKI-SLAM-paper/mfc-figure2a.png", width=600, height=600, bg="white", pointsize=16) par(bg="white") <> dev.off() pdf(file="SKI-SLAM-paper/mfc-figure2a.pdf", width=6, height=6, bg="white", pointsize=12) par(bg="white") <> dev.off() @ \begin{figure} <>= colorcode <- c("red", "blue") ltype <- "solid" <> @ \caption{Kaplan-Meier plot; training samples; prediction of time-to-treatment using two genes.} \label{dx2rx} \end{figure} \section{Overall Survival From Diagnosis} The prognostic score (from the time-to-treatment analysis) is borderline significant as a continuous predictor of overall survival: <>= coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ mod02, datasetAB) @ but it is not significant as a binary predictor: <>= coxph(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod02, datasetAB) @ However, the trend is in the correct direction, and the quality of the predictions is not terribly different from the ones we observed previously when using more genes (\fref{dx2os}). \begin{figure} <>= plot(survfit(Surv(OSAfterDiagnosis, NumericVitalStatus) ~ Cat.mod02, data=datasetAB), col=c("red", "blue"), lwd=2, main="Training; Overal Survival After Dx (SKI + SLAMF1)", xlab="Time (months)", ylab="Fraction Surviving") legend("topright", levels(datasetAB$Cat.mod02), col=c("red", "blue"), lwd=3) @ \caption{Kaplan-Meier plot; training samples; prediction of overall survival using the score from a time-to-treatment model.} \label{dx2os} \end{figure} \section{Sample Collection to Treatment} The prognostic score (from the analysis of time from diagnosis to treatment) is significant as a continuous predictor of time from sample collection to first significant treatment: <>= coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ mod02, datasetAB) @ and it remains significant as a binary predictor (\fref{sam2os}): <>= coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ Cat.mod02, datasetAB) @ \begin{figure} <>= plot(survfit(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ Cat.mod02, data=datasetAB), col=c("red", "blue"), lwd=2, main="Training; Sample to Rx (SKI + SLAMF1)", xlab="Time (months)", ylab="Fraction Untreated") legend("topright", levels(datasetAB$Cat.mod02), col=c("red", "blue"), lwd=3) @ \caption{Kaplan-Meier plot; training samples; prediction of overall survival using the score from a time-to-treatment model.} \label{sam2os} \end{figure} \section{Appendix} <>= modlist <- c(modlist, "mod02") cutoffs <- c(cutoffs, median(datasetAB$mod02)) save.image(file="withmod02.rda") @ This analysis was run in the following directory: <>= getwd() @ <>= sessionInfo() @ \end{document}