\documentclass[11pt]{article}
\usepackage{graphicx}
\usepackage{cite}
\usepackage{hyperref}
\pagestyle{myheadings}
\markright{cleanNewClinical; revised 2 Aug 2010}

\setlength{\topmargin}{0in}
\setlength{\textheight}{8in}
\setlength{\textwidth}{6.5in}
\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}

\def\rcode#1{\texttt{#1}}
\def\fref#1{Figure~\ref{#1}}
\def\tref#1{Table~\ref{#1}}

\title{Cleaning the CLL Clinical Data}
\author{Kevin R. Coombes}
\date{14 June 2010; REVISED: 2 August 2010}

\SweaveOpts{prefix.string=Figures/cleanNewClinical,eps=FALSE}
<<options,echo=FALSE>>=
options(width=88)
options(SweaveHooks = list(fig = function() par(bg='white')),eps=FALSE)
@ 

<<makeFiguresDirectory,echo=FALSE>>=
if (!file.exists("Figures")) {
  dir.create("Figures")
}
@ 

\begin{document}
\maketitle
\tableofcontents

\section{Executive Summary}
\subsection{Introduction}
The main goal of the study is to identify genetic abnormalities in CLL
that are associated with clinical outcome (including overall survival
and time-to-treatment).

\subsubsection{Aims/Objectives}
We want to check the clinical data for consistency.

\subsection{(Statistical) Methods}
Exploratory data anlayses and summaries.

\subsection{Results}

\begin{itemize}
\item A variety of typographical errrors have been identified and corrected.
\item Clinical columns that represent factors (i.e.,  variables that
  can take on a fixed finite set of values) have been checked for consistency.
\item Columns that are supposed to hold percentages have been checked
  to ensure that they only take on values between 0 and 100.
\item Dates have been tested for consistency of format.  They have
  also been checked to ensure that events occur in the proper order.
  We found the following anomalies:
  \begin{itemize}
%  \item Five CLL samples have ``presentation to MDACC'' dates before
%    diagnosis.  \textbf{These should be checked and corrected as
%    appropriate.}
  \item Five FCR samples show that the sample was collected before
    presentation at MDACC. This is explainable since the samples were
    collected at CRC as part of the FCR300 trial.
%  \item Two CLL samples indicate that the last follow up occured
%    before the most recent (3rd) treatment.  \textbf{Presumably, the
%    LFU date should be corrected.}
%  \item Two CLL samples indicate that the last follow up occured after
%    death.  It is unclear whether the date of death or the date of LFU
%    was recorded incorrectly.  \textbf{This should be determined and
%    the dates corrected.}
  \end{itemize}
\item Information on mutation status and homology to baseline was
  checked for consistency.  We found that the mutation status is
  recorded for six FCR samples without any supporting evidence (i.e.,
  percentage mutated).  \textbf{The missing evidence should be included.}
\item We created categorical variables for ZAP70, B2M, WBC,
  conventional cytogenetics, CD38, and Matutes score.
\item We computed the date of and response to the first significant
  treatment.
\item We also computed the time between various critical events in the
  natural history of the disease.% \textbf{We should meet and agree on
%    definitions for relevant starting times and end times for these
%    kinds of analyses.}
\item All computed values and cleaned data were stored in a data frame;
  everything was then saved to an R binary file called
  ``\rcode{currData.rda}''.
\end{itemize}

\subsection{Conclusion}

We are almost ready to use this clinical data for real analyses.  
%The few anomalies listed above should be corrected, and we ned to
%agree on start and end times for time-to-event analyses.
We have met and agreed that time-to-event analyses should be performed
with starting points
\begin{itemize}
\item Diagnosis
\item Sample collection
\end{itemize}
and end points
\begin{itemize}
\item First treatment
\item First significant treatment (defined below)
\item Death
\end{itemize}

\section{Loading the Data}
In the previous report (\rcode{documentDataSets.pdf}), we loaded all
of the relevant data into a single binary R file.  We begin by
loading that data.
<<load>>=
load("combinableData.rda")
dim(newclin)
@ 
We found a number of typographical errors, mostly of the form that
include extra spaces at the end of an entry.  Here we globally
correct them.
<<typos>>=
newclin[newclin==""] <- NA
newclin[newclin=="ND"] <- NA
newclin[newclin=="ND "] <- NA
newclin[newclin=="NA$"] <- NA
newclin[newclin=="positive "] <- "positive"
newclin[newclin=="M "] <- "M"
newclin[newclin=="U "] <- "U"
newclin[newclin=="CR "] <- "CR"
newclin[newclin=="lamda"] <- "lambda"
newclin[newclin=="FR  "] <- "FR"
newclin[newclin=="FR "] <- "FR"
newclin[newclin=="FCR  "] <- "FCR"
newclin[newclin=="FCR "] <- "FCR"
newclin[newclin=="Campath "] <- "Campath"
newclin[newclin=="Rituxan "] <- "Rituxan"
@ 

\subsection{Column Types}

We manually assign a ``type'' to each column, reflecting the kinds of
values they are expected to hold.
<<coltype>>=
coltype <- rep(NA, ncol(newclin))
names(coltype) <- colnames(newclin)
@ 
We can almost compute the date columns automatically.
<<dateCol>>=
countdates <- apply(newclin, 2, function(x) sum(regexpr("(\\d\\d)/(\\d\\d)/(\\d\\d)", x) > 0, na.rm=TRUE))
dateCol <- which(countdates > 0)
magic <- c(10,19)
colnames(newclin)[dateCol][magic]
dateCol <- dateCol[-magic]
coltype[dateCol] <- "date"
@ 

<<coltype.2>>=
coltype[c(6:8, 10:11, 13:14, 16:17, 24:25, 27, 31:32, 35, 37, 41:42,
          43:46, 50, 52:53, 55:56, 67, 70, 74, 99, 106)] <- "factor"
coltype[c(18, 82, 85)] <- "integer"
coltype[c(20, 34, 40, 84)] <- "character"
coltype[c(21:23, 57:59, 61:66, 68:69, 71:73, 94, 98, 100, 104:105)] <- "numeric"
newclin[,104] <- sub("a", '', newclin[,104]) # remove footnotes
coltype[c(47:49, 51, 54, 60, 76:81, 87:90, 93, 95:97, 101:103)] <- "percent"
table(coltype)

for(w in which(coltype=="factor")) newclin[,w] <- factor(newclin[,w])
for(w in which(coltype %in% c("numeric", "percent", "integer"))) newclin[,w] <- as.numeric(newclin[,w])
@ 

We handle treatment columns separately.
<<treatCols>>=
recurse <- grep("ype", colnames(newclin)[temp <- grep("reat", colnames(newclin))])
treatCols <- temp[recurse]
treatTypes <- unique(na.omit(as.character(as.matrix(newclin[, treatCols]))))
for(w in treatCols) newclin[,w] <- factor(newclin[,w], levels=treatTypes)
@ 
We also handle response columns separately.
<<repsCol>>=
respCol <- grep("Response", colnames(newclin))[1:4]
colnames(newclin)[respCol]
for(w in respCol) newclin[,w] <- factor(newclin[,w], levels=c("CR", "nPR", "PR", "NR", "ongoing"))
rm(respCol)
@ 

Here is a summary of the full clinical dataset.  We do this in several
parts based on the column type.  We are going to ignore the dates
(until later) and the comments (character strings) that will not ever
be used in computations.  

We start with the factors, since those should have well-specified
values.  These should be checked to see that there are no odd
spellings or other strangeness.
<<summ.f>>=
summary(newclin[, coltype=="factor"])
@ 
Now we check the percentages.  When values exist, they should be
between 0 and 100.
<<summ.p>>=
summary(newclin[, coltype=="percent"])
@ 
Now we check the other numeric columns.
<<summ.n>>=
summary(newclin[, coltype %in% c("numeric", "integer")])
@ 

\section{Making Sense of Dates}

First, we extract all of the date-related columns in the clinical dataset.
<<dateCol>>=
justDates <- newclin[, dateCol]
@ 
Next, we have to rewrite all of the dates in a consistent format that
can be used by the usual date routines.  The preferred format is a
four-digit year, followed by a two-digit month and a two-digit day,
with hyphens as separators.
<<iHateDates>>=
iHateDates <- function(x) {
  upper <- as.Date("2011-01-01")
  offset <- upper - as.Date("1911-01-01")
  y <- as.Date(x, format="%m/%d/%y")
  if (!is.na(y) & y > upper) {
    y <- as.Date(y - offset)
  }
  as.character(y)
}

fixup <- matrix(NA, nrow=nrow(justDates), ncol=ncol(justDates))
dimnames(fixup) <- dimnames(justDates)
for(j in 1:ncol(fixup)) {
  for (i in 1:nrow(fixup)) {
    fixup[i,j] <- iHateDates(justDates[i,j])
  }
}

newclin[,dateCol] <- fixup
@

\subsection{Rules about dates}

The correct ordering of events should be:
\begin{enumerate}
\item Birth
\item Diagnosis
\item Presentation at MDACC
\item Sample collection
\item First treatment (if any)
\item Subsequent treatments (if any)
\item Death or LFU
\end{enumerate}

\subsubsection{Required dates exist}

We should always have dates for birth, diagnosis, presentation,
sample, and LFU.
<<dates.exist>>=
summary(fixup[,c("Date.of.birth",
                 "First.diagnosis.of.CLL",
                 "X1st.Presentation.at.MDACC",
                 "Date.of.sample",
                 "Last.follow.up..LFU."
                 )])
@ 
Good; everything that should always be there actually is.
<<no.lfu,echo=FALSE,eval=FALSE>>=
w <- which(is.na(fixup[, "First.diagnosis.of.CLL"]))
newclin[w,1:6]
@ 

\subsubsection{Birth before diagnosis}

We use this to compute the age in years.
<<age>>=
date.birth <- as.Date(fixup[, "Date.of.birth"])
date.diagnosis <- as.Date(fixup[,"First.diagnosis.of.CLL"])
age.at.dx <- as.numeric(date.diagnosis - date.birth, units="days")/365.25
summary(age.at.dx)
@ 

Okay.

\subsubsection{Diagnosis before presentation at MDACC}

Unless otherwise specified, all time spans are computed in days.
<<dx.mdacc>>=
date.MDACC  <- as.Date(fixup[, "X1st.Presentation.at.MDACC"])
time.diagnosis.2.MDACC <- as.numeric(date.MDACC - date.diagnosis, units="days")
summary(time.diagnosis.2.MDACC)
#w <- which(time.diagnosis.2.MDACC < 0)
#fixup[w,1:6]
@ 

%We have identified five patients who presented to MDACC before
%diagnosis.  For two of the five (specifically, CL015 and CL180), the
%difference is a matter of days, and it seems reasonable to replace the
%diagnosis date by the presentation date.  For two others
%(specifically, CL018 and CL041), the first presentation is many years
%before the diagnosis.  This suggest that they came to Anderson
%previously for a completely different reason, and the presenation date
%should probably be replaced by the diagnosis date.  The final patient
%(CL227) shows a difference of a few months, so it is unclear if this
%is a typo or something else.

Okay; the errors detected in a previous analysis have been corrected.

\subsubsection{Diagnosis before sample}

We should at least ensure that the sample was taken after diagnosis.
<<sample.dx>>=
date.sample <- as.Date(fixup[, "Date.of.sample"])
time.diagnosis.2.sample <- as.numeric(date.sample - date.diagnosis, units="days")
summary(time.diagnosis.2.sample)
@ 

Okay.

\subsubsection{First presentation before sample}
<<sample.mdacc>>=
time.MDACC.2.sample <- as.numeric(date.sample - date.MDACC, units="days")
summary(time.MDACC.2.sample)
w <- which(time.MDACC.2.sample < 0)
fixup[w,1:6]
@ 

%Since all five of the odd samples are part of the FCR study, they can
%be corrected later....

All five of the ``odd'' samples were part of the FCR300 study, and
samples were obtained through the CRC before presentation at MDACC.

\subsubsection{Sample at or before 1st treatment}
<<sample.rx>>=
date.1st.rx <- as.Date(fixup[, "Date.of.1st.treatment"])
time.sample.2.rx1 <- as.numeric(date.1st.rx - date.sample, units="days")
summary(time.sample.2.rx1)
w <- which(time.sample.2.rx1 < 0)
newclin[w,c(1:6,106)]
@ 

Okay; we already knew that both of these patients were treated before
the sample was collected.

\subsubsection{Correct order of treatments}
<<rx>>=
date.2nd.rx <- as.Date(fixup[,"Date.of.2nd.treatment"])
date.3rd.rx <- as.Date(fixup[,"Date.of.3rd.treatment"])
date.4th.rx <- as.Date(fixup[,paste("Date.of.4th.treatment..in.case.it.was.the.first.time.",
                                    "of.FCR.or.any.other..first.significant..therapy..data.",
                                    "entered.only.in.case.it.was.the.1st.or.2nd.time.of.a.",
                                    "significant..treatment.", sep='')])

summary(as.numeric(date.2nd.rx - date.1st.rx, units="days"))
summary(as.numeric(date.3rd.rx - date.2nd.rx, units="days"))
summary(as.numeric(date.4th.rx - date.3rd.rx, units="days"))
@ 

These are all okay.

\subsubsection{Sample before last follow up}

<<sample.lfu>>=
date.lfu <- as.Date(fixup[,"Last.follow.up..LFU."])
time.sample.2.lfu <- as.numeric(date.lfu - date.sample, units="days")
summary(time.sample.2.lfu)
@ 

Okay.

\subsubsection{All treatments before last follow up}

<<rx.lfu>>=
summary(as.numeric(date.lfu - date.1st.rx, units="days"))
summary(as.numeric(date.lfu - date.2nd.rx, units="days"))
summary(odd.time <- as.numeric(date.lfu - date.3rd.rx, units="days"))
summary(as.numeric(date.lfu - date.4th.rx), units="days")
@ 

%Treatments 1, 2, and 4 are okay, but there is something odd about the
%third treatment.
<<odd3,eval=FALSE,echo=FALSE>>=
w <- which(odd.time < 0)
fixup[w,c(1:3,6:7,12)]
@ 
%Here we find two samples where the last follow up claims to have
%occurred before the most recent treatment.  We should probably replace
%the LFU by the 3rd treatment date.

Okay; the errors found in a previous analysis have been corrected.

\subsubsection{Death before last follow up}

If a date of death is recorded, it should equal the date of last
follow up.
<<death>>=
date.death <- as.Date(fixup[, "Date.of.death"])
summary(odd.time <- as.numeric(date.lfu - date.death, units="days"))

w <- which(is.na(date.death) & newclin$Survival.status.at.LFU=="DEAD")
newclin[w, 1:6]
#w <- which(odd.time > 0)
#fixup[w,c(1:5,12,14)]
@ 
%The date of death for patient CL017 is almost certainly incorrect
%(unless we treated someone two years after they died).  For patient
%CL038, I suspect that the date of last follow up should be changed to
%equal the date of death.

Okay; previous errors have been corrected.  However, two patients are
indicated as dead, but without a recorded date of death.

\section{More Rules}

Every ``percentage'' column should have values restricted to lie
between 0 and 100.
<<precentages>>=
for (w in which(coltype=="percent")) {
  x <- newclin[,w]
  print(paste(all(x >= 0 & x <= 100, na.rm=TRUE), names(coltype)[w], sep=": "))
}
@

The two columns with mutation/homology percentages should sum to 100.
<<mut.sum>>=
all(newclin[,47] + newclin[,48] == 100, na.rm=TRUE)
@ 

A sample is only called mutated if the homology percentage is less
than \textbf{or equal} to $98\%$.
<<rule98>>=
summary(newclin[newclin[,45]=="M",47])
summary(newclin[newclin[,45]=="U",47])
@ 
Equivalently, a sample is called mutated if the muation percentage is
greater than \textbf{or equal} ro $2\%$.
<<rule2>>=
summary(newclin[newclin[,45]=="M",48])
summary(newclin[newclin[,45]=="U",48])
@ 

Wherever possible, the mutation status should be supported by
computations of the percent homology.  However, this supporting data is
missing for two patients:
<<>>=
w <- which(is.na(newclin$Consensus.IGHV.gene.identity.to.germline....)
           & ! is.na(newclin$Consensus.mutation.status))
newclin[w, c(45,47)]

@ 

\section{Computed Quantities}

We must convert some of the numerical measures into dichotomous
variables for future analyses. these include
\begin{itemize}
\item ZAP70 status
<<zap70>>=
zap.flow <- newclin$ZAP70.result.UCSD....positive.cells.by.flow.
zap.ihc <- newclin$ZAP70.result.MDA..by.immunohistochemistry.
zap70 <- zap.ihc
zap70[is.na(zap.ihc) & zap.flow > 20] <- "POS"
zap70[is.na(zap.ihc) & zap.flow < 20] <- "NEG"
sum(is.na(zap.ihc) & zap.flow==20, na.rm=TRUE) # oh good
@ 
\item Serum beta-2 microglobulin
<<b2m>>=
b2m <- rep(NA, nrow(newclin))
b2m[newclin$Serum.beta.2.microglobuline..mg.L. <= 4] <- "Low"
b2m[newclin$Serum.beta.2.microglobuline..mg.L. > 4] <- "High"
@ 
\item White blood count
<<wbc>>=
wbc <- rep(NA, nrow(newclin))
wbc[newclin$White.blood.count..G.L. < 150] <- "Low"
wbc[newclin$White.blood.count..G.L. >= 150] <- "High"
@ 
\item Conventional cytogenetics
<<cyt>>=
cyt <- rep(NA, nrow(newclin))
cyt[newclin$Absolute.number.of.abnormalities.in.conventional..cytogenetics < 3] <- "Simple"
cyt[newclin$Absolute.number.of.abnormalities.in.conventional..cytogenetics >= 3] <- "Complex"
@ 
\item CD38 levels
<<cd38>>=
cd38 <- rep(NA, nrow(newclin))
temp <- newclin$X..CD19.CD38.positive.CLL.cells.per.BM.lymphocytes..cases.with.PB.values.in.yellow.
cd38[temp >= 30] <- "High"
cd38[temp < 30] <- "Low"
@ 
\item Typical or atypical CLL
<<matutes>>=
matutes <- grep("Matutes", colnames(newclin))
atyp <- rep(NA< nrow(newclin))
atyp[newclin[, matutes] > 3] <- "Typical"
atyp[newclin[, matutes] <= 3] <- "Atypical"
@ 
\end{itemize}


We also want to be able to compute ``time to first significant
treatment''.  Here the treatments that are not significant have been
specified as follows:
<<nst>>=
notSignificantTreatment <- c("autologous activated T cells",
                             "Idiotype Vaccine",
                             "ISF-35 transduced autologous cell therapy",
                             "Rituxan",
                             "Rituxan + Cidofovir",
                             "Rituxan Early Stage",
#                             "Rituxan + GM-CSF",
                             "Rituxan + Steroids",
                             "Sunesis",
                             "Vidaza",
                             "Anti-CD40"
                             )
@ 
The next messy block of code marches through the first four
treatments to determine which one is the first significant treatment.
<<mess>>=
treatDates <- c(5, 9, 12, 15)
treatCols <- c(7, 10, 13, 16)
treatResp <- c(8, 11, 14, 17)

#godawful <- unlist(lapply(treatCols, function(x) as.character(newclin[,x])))
#
td <- newclin[, treatDates[1]]
tr <- as.character(newclin[, treatResp[1]])
tt <- as.character(newclin[, treatCols[1]])
whichTreat <- curDate <- curResp <- curTreat <- rep(NA, length(tt))
curTreat[!is.na(tt) & !(tt %in% notSignificantTreatment)] <- "Significant"
curTreat[!is.na(tt) & (tt %in% notSignificantTreatment)] <- "NotSignificant"
table(tt, curTreat)
sig <- !is.na(curTreat) & curTreat=="Significant"
curDate[sig] <- td[sig]
curResp[sig] <- tr[sig]
whichTreat[sig] <- 1

for (a in 2:4) {
  pending <- !is.na(curTreat) & curTreat=="NotSignificant"
  td <- newclin[, treatDates[a]]
  tr <- as.character(newclin[, treatResp[a]])
  tt <- as.character(newclin[, treatCols[a]])
  curTreat[pending & !is.na(tt) & !(tt %in% notSignificantTreatment)] <- "Significant"
  curTreat[pending &!is.na(tt) & (tt %in% notSignificantTreatment)] <- "NotSignificant"
  sig <- pending & !is.na(curTreat) & curTreat=="Significant"
  if (any(sig)) {
    curDate[sig] <- td[sig]
    curResp[sig] <- tr[sig]
    whichTreat[sig] <- a
  }
}
ns <- curTreat=="NotSignificant"
curDate[ns] <- NA
curResp[ns] <- NA
curTreat[ns] <- NA
curTreat[is.na(curTreat)] <- "NoSig"
@ 

Now we can compute the time from sample or from diagnosis to the first
significant treatment.
<<times>>=
date.sig <- as.Date(curDate)
time.sample.2.sig.treat <- as.numeric(date.sig - date.sample, units="days")
time.diagnosis.2.sig.treat <- as.numeric(date.sig - date.diagnosis, units="days")
time.diagnosis.2.rx1 <- as.numeric(date.1st.rx - date.diagnosis, units="days")
time.diagnosis.2.lfu <- as.numeric(date.lfu - date.diagnosis, units="days")
time.diagnosis.2.death <- as.numeric(date.death - date.diagnosis, units="days")
time.sample.2.death <- as.numeric(date.death - date.sample, units="days")
@ 

We assemble all of this extra clinical information into a data frame.
<<extra>>=
extraClinical <- data.frame(CatB2M=factor(b2m, levels=c("Low", "High")),
                            CatWBC=factor(wbc, levels=c("Low", "High")),
                            CatCD38=factor(cd38, levels=c("Low", "High")),
                            CatCyto=factor(cyt, levels=c("Simple", "Complex")),
                            Matutes=factor(atyp),
                            AgeAtDx=as.numeric(age.at.dx),
                            ZAP70=zap70,
                            Date.1st.sig.treat=curDate,
                            Type.1st.sig.treat=curTreat,
                            Response.1st.sig.treat=curResp,
                            whichTreat=factor(paste("T", whichTreat, sep='')),
                            time.sample.2.sig.treat,
                            time.diagnosis.2.sig.treat,
                            time.sample.2.rx1,
                            time.diagnosis.2.rx1,
                            time.sample.2.lfu,
                            time.diagnosis.2.lfu,
                            time.sample.2.death,
                            time.diagnosis.2.death
                            )
rownames(extraClinical) <- rownames(newclin)
summary(extraClinical)
@ 

\section{Save}

We remove some objects that we will not need in the future.
<<cleanup>>=
rm(coltype, countdates, data.source, dateCol, iHateDates, justDates, fixup)
rm(temp, i, j, x, w, magic, recurse, odd.time, makecol)
rm(list=ls(pattern="date"))
rm(list=ls(pattern="dir"))
rm(list=ls(pattern="time"))
rm(zap.flow, zap.ihc, zap70, age.at.dx, b2m, wbc, cyt, cd38)
rm(atyp, matutes, notSignificantTreatment, td, tr, tt, pending)
rm(whichTreat, curTreat, curDate, curResp, sig, a, ns)
rm(treatCols, treatDates, treatResp, treatTypes)
@ 

Now we save everything so that we have it all in one place for future
analyses. 
<<save>>=
save.image(file='currData.rda')
@ 

\section{Appendix}



This analysis was run in the following directory:
<<getwd>>=
getwd()
@


<<lib2,echo=F>>=
sessionInfo()
@ 

\end{document}