\documentclass[11pt]{article} \usepackage{graphicx} \usepackage{cite} \usepackage{hyperref} \pagestyle{myheadings} \markright{microfluidicsModels01; revised 2 Aug 2010} \setlength{\topmargin}{0in} \setlength{\textheight}{8in} \setlength{\textwidth}{6.5in} \setlength{\oddsidemargin}{0in} \setlength{\evensidemargin}{0in} \def\rcode#1{\texttt{#1}} \def\fref#1{Figure~\ref{#1}} \def\tref#1{Table~\ref{#1}} \title{Validating Prognostic Models on Microfluidics Data (Part 1)} \author{Kevin R. Coombes} \date{21 June 2010; REVISED: 2 Aug 2010} \SweaveOpts{prefix.string=Figures/microfluidicsModels01,eps=FALSE} <>= options(width=92) options(SweaveHooks = list(fig = function() par(bg='white')),eps=FALSE) @ <>= if (!file.exists("Figures")) { dir.create("Figures") } @ \begin{document} \maketitle \tableofcontents \section{Executive Summary} \subsection{Introduction} The main goal of the study is to identify changes in gene expression in CLL that are associated with clinical outcome (including overall survival and time-to-treatment). \subsubsection{Aims/Objectives} We want to find individual genes and combinations of genes that are related to prognosis in CLL. Here ``prognosis'' refers to the ability to predict time-to-treatment or overall suvival. \subsection{(Statistical) Methods} Exploratory data analyses and summaries. Cox proportional hazards models. \subsection{Results} Recall that in a previous analysis, we identified three genes (AICDA, OAS3, and NRIP1) that predicte time from presentation at M.D. Anderson to first treatment of any kind. Since we have updated the clinical data nad decided on different defintions of the starting events, we refit the three gene model on the Card~A and~B data before trying to validate it. We found \begin{itemize} \item The three-gene model successfully predicts time from diagnosis to first signficant treatment (\fref{bin.dx2sig}) or to first treatment of any kind (\fref{bin.dx2tr}. \item The three-gene model \textbf{does not predict} time from sample collection to first significant treatment (\fref{bin.sam2sig}) or first treatment of any kind \fref{bin.sam2tr}). This failure is almost certainly due to the fact (\fref{bin.sam2tr}) that the first treatment occured within two days of sample collection for about $60\%$ of the patients who were profiled on Card~C. \end{itemize} \subsection{Conclusion} We are unlikely to be able to validate \textbf{any} models that try to predict time-to-any-treatment on the Card~C data, so we should restrict our attention to ``time-to-first signifcant-treatment'' or to overall survival. We also expect that the updated clinical data will allow us to find a slightly better (and validate-able) model than the three-gene model (even though this is actually quite significant on the validation set). We will explore this possibiiltiy in the next anlaysis. \section{Loading the Data} We met and agreed that time-to-event analyses should be performed with two different starting points: \begin{itemize} \item Diagnosis \item Sample collection \end{itemize} and three different end points: \begin{itemize} \item First treatment \item First significant treatment (defined below) \item Death \end{itemize} In the previous report (\rcode{cleanNewClinical.pdf}), we stored all of the relevant data in a binary R file. We begin by loading that data. <>= load("currData.rda") ls() @ In that previous report, we computed times betwen various events, but did not use the last-follow-up time to compute censoring variables. The next function performs that task. <>= getTime2EventInfo <- function(t.event, t.censor, eventName) { if (length(t.event) != length(t.censor)) { stop("time vectors must be the same length") } if (length(eventName) != 2) { stop("must supply character strings for event and its absence") } time <- t.event status <- rep(eventName[1], length(t.event)) time[is.na(t.event)] <- t.censor[is.na(t.event)] time <- time*12/365.25 # convert from days to months status[is.na(t.event)] <- eventName[2] numericStatus <- as.numeric(status==eventName[1]) data.frame(time=time, status=status, numericStatus=numericStatus) } @ We use this function to extract the censoring status and time-to-event measures that interest us. <>= attach(extraClinical) temp0 <- getTime2EventInfo(time.sample.2.rx1, time.sample.2.lfu, c("Treated", "Untreated")) colnames(temp0) <- c("TimeSample2Treat", "Treatment", "NumericTreatment") temp1 <- getTime2EventInfo(time.sample.2.sig.treat, time.sample.2.lfu, c("Treated", "Untreated")) colnames(temp1) <- c("TimeSample2SigTreat", "SigTreatment", "NumericSigTreatment") temp2 <- getTime2EventInfo(time.sample.2.death, time.sample.2.lfu, c("Dead", "Alive")) colnames(temp2) <- c("OSAfterSample", "VitalStatus", "NumericVitalStatus") temp3 <- getTime2EventInfo(time.diagnosis.2.rx1, time.diagnosis.2.lfu, c("Treated", "Untreated")) colnames(temp3) <- c("TimeDiagnosis2Treat", "Treatment", "NumericTreatment") temp4 <- getTime2EventInfo(time.diagnosis.2.sig.treat, time.diagnosis.2.lfu, c("Treated", "Untreated")) colnames(temp4) <- c("TimeDiagnosis2SigTreat", "SigTreatment", "NumericSigTreatment") temp5 <- getTime2EventInfo(time.diagnosis.2.death, time.diagnosis.2.lfu, c("Dead", "Alive")) colnames(temp5) <- c("OSAfterDiagnosis", "VitalStatus", "NumericVitalStatus") detach() outcomeData <- data.frame(temp0, temp1, temp2, temp3, temp4, temp5) outcomeData <- outcomeData[, c(1, 10, 2:3, 4, 13, 5:6, 7, 16, 8:9)] rm(temp0, temp1, temp2, temp3, temp4, temp5) @ We perform a few sanity checks to make sure we got the event coding correct. <>= table(outcomeData$Treatment, outcomeData$NumericTreatment) table(outcomeData$SigTreatment, outcomeData$NumericSigTreatment) table(outcomeData$VitalStatus, outcomeData$NumericVitalStatus) table(extraClinical$Type.1st.sig.treat, outcomeData$SigTreatment) table(newclin$Survival.status.at.LFU, outcomeData$VitalStatus) # 17 FCR without data of death @ Finally, we assemble the clinical data columns we need into a common structure. <>= extraclinCols <- c(1:7, 10:11) # outcomeData: all newclinCols <- c(7:8, 31, 34:35, 43:47, 56:58, 60:63, 67, 70, 76:81, 99, 106) finalClinical <- data.frame(newclin[, newclinCols], extraClinical[, extraclinCols], outcomeData) dim(finalClinical) #write.csv(colnames(finalClinical), "colnames.csv") newnames <- read.csv("colnames.csv", row.names=2, as.is=TRUE) if (all(rownames(newnames) == colnames(finalClinical))) { colnames(finalClinical) <- newnames$NewName } else { stop("Column names have changed!") } summary(finalClinical) @ We note that some samples are taken from the same patient at different times. <>= st <- as.character(finalClinical$SampleType) names(st) <- rownames(finalClinical) hip <- read.csv("ClinicalHIPPA.csv") rownames(hip) <- rownames(finalClinical) hip$spl.date <- newclin$Date.of.sample tab <- table(hip$MRN) w <- names(tab)[which(tab>1)] for (i in 1:length(w)) { ptr <- which(hip$MRN %in% w[i]) x <- hip[ptr,] r <- rank(x$spl.date) base <- rownames(x)[r==1] st[ptr[r > 1]] <- base } finalClinical$SampleType <- factor(st) rm(st, hip, tab, w, i, ptr, x, r, base) @ I know I already said this was final, but I was wrong. We add a few more computed columns. <>= cr <- rep("Low", nrow(finalClinical)) cr[finalClinical$Rai.Stage %in% c(3, 4)] <- "High" finalClinical$CatRAI <- factor(cr, levels=c("Low", "High")) @ <>= finalClinical$LogB2M <- log(finalClinical$Serum.beta.2.microglobulin) finalClinical$LogLDH <- log(finalClinical$Serum.LDH) finalClinical$LogWBC <- log(finalClinical$White.blood.count) finalClinical <- finalClinical[, c(1:26, 28:52, 27)] dim(finalClinical) @ Now we are really at the final clinical data. \subsection{Microfluidics Data} Now we check the microfluidics data. First, the data from Card~A and Card~B on the same samples must be combined. <>= dim(cardA.data) dim(cardB.data) temp <- cardB.data[rownames(cardA.data),] cardAB.data <- data.frame(cardA.data, temp) rm(temp) cardAB.clinical <- finalClinical[rownames(cardAB.data),] w <- !( cardAB.clinical$SampleType %in% c("PreTreated", "NotCLL") ) cardAB.data <- cardAB.data[w,] cardAB.clinical <- cardAB.clinical[w,] dim(cardAB.data) @ Second, we have to change a couple of the names on Card~C to match the corresponding names used on Cards~A and~B. <>= dim(cardC.data) colnames(cardC.data)[1] <- "r18S" colnames(cardC.data)[5] <- "ATRX" cardC.clinical <- finalClinical[rownames(cardC.data),] w <- !( cardC.clinical$SampleType %in% c("PreTreated", "NotCLL") ) cardC.data <- cardC.data[w,] cardC.clinical <- cardC.clinical[w,] rm(w) dim(cardC.clinical) dim(cardC.data) @ Finally, we also restrict the~A and~B data to only the genes that were selected to produce Card~C. <>= onC <- colnames(cardAB.data) %in% colnames(cardC.data) ABsubset <- cardAB.data[, onC] @ \section{Univariate Analysis of Time to Treatment} As mentioned above, we want to perform a variety of time-to-event (``survival'') analyses,with different starting and ending points. The basic code for these analyses is in the following library. <>= library(survival) @ We want to perform univariate analyses (one gene at a time) using the microfluidics QRT-PCR data as a continuous predictor of the outcome. The following function will perform these analyses and record the results. <>= continuous.univariate <- function(timecol, statcol, dset=cardAB.data, clindata=cardAB.clinical) { GeneUnivariate <- matrix(NA, ncol(dset), 5) colnames(GeneUnivariate) <- c("coef", "HR", "SE", "Z", "P") rownames(GeneUnivariate) <- colnames(dset) for (i in 1:ncol(dset)) { mygene <- colnames(dset)[i] dataset <- data.frame(clindata[, c(timecol, statcol)], Y=dset[,i]) colnames(dataset)[3] <- mygene myform <- formula(paste("Surv(", timecol, ", ", statcol, ") ~", mygene)) model <- coxph(myform, data=dataset) GeneUnivariate[i,] <- summary(model)$coef[mygene,] } GeneUnivariate } @ We also want to convert each continuous gene expression variable into a dichotomous variable, by splitting at the median. The following functions will perfrom this analysis. <>= hilo <- function(x) { factor(ifelse(x < median(x), "LowCycle", "HighCycle"), levels=c("HighCycle", "LowCycle")) } binary.univariate <- function(timecol, statcol, dset=cardAB.data, clindata=cardAB.clinical) { GeneUnivariate <- matrix(NA, ncol(dset), 5) colnames(GeneUnivariate) <- c("coef", "HR", "SE", "Z", "P") rownames(GeneUnivariate) <- colnames(dset) for (i in 1:ncol(dset)) { mygene <- colnames(dset)[i] dataset <- data.frame(clindata[, c(timecol, statcol)], Y=hilo(dset[,i])) colnames(dataset)[3] <- mygene myform <- formula(paste("Surv(", timecol, ", ", statcol, ") ~", mygene)) model <- coxph(myform, data=dataset) GeneUnivariate[i,] <- summary(model)$coef[paste(mygene, "LowCycle", sep=''),] } GeneUnivariate } @ \section{Previous Model} \subsection{Predicting time from diagnosis to significant treatment} In our original analysis of the Card~A and~B data, we developed a model that predicted time from first M.D. Anderson presentation to first treatment of any kind. The best univariate model used the following three genes: <>= old0 <- coxph(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ AICDA + OAS3 + NRIP1, data.frame(cardAB.data, cardAB.clinical)) summary(old0) @ We note that the coefficients have the same signs but slightly different values than the ones obtained previously on this data. This change is to be expected, since we are using (a) updated clinical data and (b) different definitions of the starting and ending point for the time-to-event analysis. In spite of those changes, however, we can find a statistically signficant model on the training (AB) dataset. Now we look at the test set from Card~C. We start by computing the value of the linear predictor using this model on the new data. <>= new0 <- predict(old0, newdata=data.frame(cardC.data, cardC.clinical)) @ Next, we test whether this score is statistically significant as a predictor of time from diagnosis to significant treatment in the new dataset. <>= cutter <- median(predict(old0)) bing <- factor(c("LowScore", "HighScore")[1+1*(new0 > cutter)], levels=c("LowScore", "HighScore")) temp <- data.frame(cardC.data, cardC.clinical, Bing=bing, New0=new0) mod0.cont <- coxph(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ New0, data=temp) summary(mod0.cont) @ We see that the score computed from the model trained on the AB data is highly significant on the Card~C validation dataset. We also convert the score into a binary predictor. <>= mod0.bin <- coxph(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ Bing, data=temp) summary(mod0.bin) @ This binary score is also highly significant (\fref{bin.dx2sig}). \begin{figure} <>= mycol <- c("Red", "Blue") plot( survfit(Surv(TimeDiagnosis2SigTreat, NumericSigTreatment) ~ Bing, data=temp) , xlab="Time from diagnosis to treatment (months)", ylab="Fraction Untreated", main="Validation of Binary Predictor(OAS3, AICDA, NRIP1)", col=mycol, lwd=2) legend("topright", levels(temp$Bing), col=mycol, lwd=2) tab <- table(temp$SigTreat, temp$Bing) tags <- paste("N =", paste(tab["Treated",], apply(tab, 2, sum), sep='/')) text(165, 0.38, tags[1]) text(165, 0.1, tags[2]) text(180, 0.85, paste("logrank p =", round((summary(mod0.bin)$logtest)[3], 4)), adj=0) @ \caption{Kaplan-Meier plot of time from diagnosis to treatment in the validation dataset (Card~C) using a three-gene (AICDA, OAS3, NRIP1) model completely specified by the training dataset (Cards~A and~B).} \label{bin.dx2sig} \end{figure} \subsection{Predicting time from sample to significant treatment} We now repeat the same analysis with a different choice of the time-to-event outcome. <>= old0 <- coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ AICDA + OAS3 + NRIP1, data.frame(cardAB.data, cardAB.clinical)) summary(old0) @ The coefficients again have the same signs but slightly different values than the earlier ones. However, the model remains significant on the training (AB) dataset. Now we look at the test set from Card~C. We start by computing the value of the linear predictor using this model on the new data. <>= new0 <- predict(old0, newdata=data.frame(cardC.data, cardC.clinical)) @ Next, we test whether this score is statistically significant as a predictor of time from diagnosis to significant treatment in the new dataset. <>= cutter <- median(predict(old0)) bing <- factor(c("LowScore", "HighScore")[1+1*(new0 > cutter)], levels=c("LowScore", "HighScore")) temp <- data.frame(cardC.data, cardC.clinical, Bing=bing, New0=new0) mod0.cont <- coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ New0, data=temp) summary(mod0.cont) @ We see that this score computed from the model trained on the AB data is \textbf{not} significant on the Card~C validation dataset. We try converting the score into a binary predictor. <>= mod0.bin <- coxph(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ Bing, data=temp) summary(mod0.bin) @ This binary score is almost significant (\fref{bin.sam2sig}). \begin{figure} <>= mycol <- c("Red", "Blue") plot( survfit(Surv(TimeSample2SigTreat, NumericSigTreatment) ~ Bing, data=temp) , xlab="Time from sample to significant treatment (months)", ylab="Fraction Untreated", main="Validation of Binary Predictor (AICDA, OAS3, NRIP1)", col=mycol, lwd=2) legend("topright", levels(temp$Bing), col=mycol, lwd=2) tab <- table(temp$SigTreat, temp$Bing) tags <- paste("N =", paste(tab["Treated",], apply(tab, 2, sum), sep='/')) text(80, 0.40, tags[1]) text(80, 0.18, tags[2]) text(80, 0.85, paste("logrank p =", round((summary(mod0.bin)$logtest)[3], 4)), adj=0) @ \caption{Kaplan-Meier plot of time from sample to significant treatment in the validation dataset (Card~C) using a three-gene (AICDA, OAS3, NRIP1) model completely specified by the training dataset (Cards~A and~B).} \label{bin.sam2sig} \end{figure} \subsection{Predicting time from diagnosis to any treatment} Now we try replacing 'first significant treatment'' by ``first treatment of any kind''. <>= old0 <- coxph(Surv(TimeDiagnosis2Treat, NumericTreatment) ~ AICDA + OAS3 + NRIP1, data.frame(cardAB.data, cardAB.clinical)) summary(old0) @ Once more, the values of the coefficients have changed. In spite of those changes, however, the model remains significant on the training (AB) dataset. Now we look at the test set from Card~C. We start by computing the value of the linear predictor using this model on the new data. <>= new0 <- predict(old0, newdata=data.frame(cardC.data, cardC.clinical)) @ Next, we test whether this score is statistically significant as a predictor of time from diagnosis to any treatment in the new dataset. <>= cutter <- median(predict(old0)) bing <- factor(c("LowScore", "HighScore")[1+1*(new0 > cutter)], levels=c("LowScore", "HighScore")) temp <- data.frame(cardC.data, cardC.clinical, Bing=bing, New0=new0) mod0.cont <- coxph(Surv(TimeDiagnosis2Treat, NumericTreatment) ~ New0, data=temp) summary(mod0.cont) @ We see that the score computed from the model trained on the AB data is \textbf{significant} on the Card~C validation dataset. We try converting the score into a binary predictor. <>= mod0.bin <- coxph(Surv(TimeDiagnosis2Treat, NumericTreatment) ~ Bing, data=temp) summary(mod0.bin) @ This binary score is also \textbf{significant} (\fref{bin.dx2tr}). \begin{figure} <>= mycol <- c("Red", "Blue") plot( survfit(Surv(TimeDiagnosis2Treat, NumericTreatment) ~ Bing, data=temp) , xlab="Time from diagnosis to treatment (months)", ylab="Fraction Untreated", main="Validation of Binary Predictor", col=mycol, lwd=2) legend("topright", levels(temp$Bing), col=mycol, lwd=2) tab <- table(temp$Treat, temp$Bing) tags <- paste("N =", paste(tab["Treated",], apply(tab, 2, sum), sep='/')) text(170, 0.35, tags[1]) text(170, 0.1, tags[2]) text(180, 0.85, paste("logrank p =", round((summary(mod0.bin)$logtest)[3], 4)), adj=0) @ \caption{Kaplan-Meier plot of time from diagnosis to \textbf{any} treatment in the validation dataset (Card~C) using a three-gene (AICDA, OAS3, NRIP1) model completely specified by the training dataset (Cards~A and~B).} \label{bin.dx2tr} \end{figure} \subsection{Predicting time from sample to any treatment} We now repeat the same analysis for the time from sample to any first treatment. <>= old0 <- coxph(Surv(TimeSample2Treat, NumericTreatment) ~ AICDA + OAS3 + NRIP1, data.frame(cardAB.data, cardAB.clinical)) summary(old0) @ The model remains significant on the training (AB) dataset. Now we look at the test set from Card~C. We start by computing the value of the linear predictor using this model on the new data. <>= new0 <- predict(old0, newdata=data.frame(cardC.data, cardC.clinical)) @ Next, we test whether this score is statistically significant as a predictor of time from diagnosis to any treatment in the new dataset. <>= cutter <- median(predict(old0)) bing <- factor(c("LowScore", "HighScore")[1+1*(new0 > cutter)], levels=c("LowScore", "HighScore")) temp <- data.frame(cardC.data, cardC.clinical, Bing=bing, New0=new0) mod0.cont <- coxph(Surv(TimeSample2Treat, NumericTreatment) ~ New0, data=temp) summary(mod0.cont) @ We see that this score computed from the model trained on the AB data is \textbf{not significant} on the Card~C validation dataset. We try converting the score into a binary predictor. <>= mod0.bin <- coxph(Surv(TimeSample2Treat, NumericTreatment) ~ Bing, data=temp) summary(mod0.bin) @ This binary score is also \textbf{not significant} (\fref{bin.sam2tr}). \begin{figure} <>= mycol <- c("Red", "Blue") plot( survfit(Surv(TimeSample2Treat, NumericTreatment) ~ Bing, data=temp) , xlab="Time from sample to any treatment (months)", ylab="Fraction Untreated", main="Validation of Binary Predictor", col=mycol, lwd=2) legend("topright", levels(temp$Bing), col=mycol, lwd=2) tab <- table(temp$Treat, temp$Bing) tags <- paste("N =", paste(tab["Treated",], apply(tab, 2, sum), sep='/')) text(75, 0.33, tags[1]) text(75, 0.17, tags[2]) text(80, 0.85, paste("logrank p =", round((summary(mod0.bin)$logtest)[3], 4)), adj=0) @ \caption{Kaplan-Meier plot of time from sample to \textbf{any} treatment in the validation dataset (Card~C) using a three-gene (AICDA, OAS3, NRIP1) model completely specified by the training dataset (Cards~A and~B).} \label{bin.sam2tr} \end{figure} \section{Appendix} <>= save.image(file="micro01.rda") write.csv(finalClinical, file="finalClinical.csv") @ This analysis was run in the following directory: <>= getwd() @ <>= sessionInfo() @ \end{document}