\documentclass[11pt]{article} \usepackage{graphicx} \usepackage{cite} \usepackage{hyperref} \pagestyle{myheadings} \markright{documentDataSets; revised 2 Aug 2010} \setlength{\topmargin}{0in} \setlength{\textheight}{8in} \setlength{\textwidth}{6.5in} \setlength{\oddsidemargin}{0in} \setlength{\evensidemargin}{0in} \def\rcode#1{\texttt{#1}} \def\fref#1{Figure~\ref{#1}} \def\tref#1{Table~\ref{#1}} \title{Documenting the CLL Datasets} \author{Kevin R. Coombes} \date{14 June 2010; REVISED 2 August 2010} \SweaveOpts{prefix.string=Figures/documentDataSets,eps=FALSE} <>= options(width=88) options(SweaveHooks = list(fig = function() par(bg='white')),eps=FALSE) @ <>= if (!file.exists("Figures")) { dir.create("Figures") } @ \begin{document} \maketitle \tableofcontents \section{Executive Summary} \subsection{Introduction} The main goal of the study is to identify genetic abnormalities in CLL that are associated with clinical outcome (including overall survival and time-to-treatment). \subsubsection{Aims/Objectives} The goals of this analysis are to document the existing microfludics, SNP chip, and clinical datasets. We are particularly interested in determining which patient samples were included in which study and in ensuring that we have clinical data for all patients. \subsection{(Statistical) Methods} We simply load the different datasets, make sure the identfiers are in the same format, and count overlaps. \subsection{Results} We have the following datasets: \begin{enumerate} \item Microfluidics Card A: 94 genes in 67 samples. \item Microfluidics Card B: 96 genes in 79 samples. \item Union of Cards~A and~B: 190 genes in 67 samples. \item Microfludics Card C: 48 genes in 76 samples. \item Union of Cards~A, B, and~C: 42 genes in 143 samples. \item Illumina SNP chip: Data on 173 samples. \item Union of A, B, C with SNP data: 101 samples. \item Union of A, B with SNP data: 50 samples. \item Clinical data: 105 clinical data columns on 296 samples. \item Affymetruix U133A data on 30 samples. Of these, 19 were profiled on SNP chips, 28 were profiled on the set of 48 microfluidics genes, and 17 were profiled on all platforms. \end{enumerate} We also verified that all patients whose sampels were used in any of the microfluidics or SNP studies are present in the clinical dataset. \subsection{Conclusions} The next step is to check that the clinical data meets all of the rules that Carmen specified to ensure consistency. \section{Understanding the Data Sets} We begin by defining the paths to the data source. These can appear different depending on the operating system of the machine we are using to access them. <>= dir.root <- ifelse(.Platform$OS.type == "windows", "//mdabam1", "/data") dir.base <- file.path(dir.root, "bioinfo", "CLL", "Abruzzo") dir.ldoc <- file.path(dir.base, "LDOC1") dir.Boutput <- file.path(dir.base, "Analysis-MF-CardB", "Output") dir.cardC <- file.path(dir.base, "Data-MF-CardC") dir.affy <- file.path(dir.base, "Data-Affymetrix", "QuantifiedArrays") dir.workspace <- ifelse(.Platform$OS.type == 'windows', "//mdadqsfs02/workspace", "/workspace") dir.SNP <- file.path(dir.workspace, "krc", "Abruzzo", "SNP-Analysis") @ \subsection{Card A Dataset} We load the older version of the clinical data along with the first microfluidics study. <>= cardA.clin <- read.table(file.path(dir.ldoc, "clinical-mf-data.tsv"), header=TRUE, sep="\t", quote='', comment.char='', row.names=1) dim(cardA.clin) @ We separate the clinical data from the microfluidics gene expression values. <>= cardA.data <- cardA.clin[, 39:132] cardA.genes <- colnames(cardA.data) cardA.clin <- cardA.clin[, 1:38] @ Next, we standardize the sample names. In the old microfluidics study, we used ``CLL\#'', where the number had no initial padding. In the SNP study, we have used ``CL\#\#\#'', where the numbers are always padded with leading zeros to reach three digits. <>= mf.names <- rownames(cardA.clin) cll.number <- as.integer(substring(mf.names, 4)) padding <- rep("", length(cll.number)) padding[cll.number < 100] <- "0" padding[cll.number < 10] <- "00" mf.standard.names <- paste("CL", padding, cll.number, sep='') cardA.names <- rownames(cardA.clin) <- rownames(cardA.data) <- mf.standard.names rm(mf.names, cll.number, padding, mf.standard.names) @ In summary, the data from Card A consists of clinical information on \Sexpr{nrow(cardA.clin)} patient samples together with microfluidics semiquantitative real-rime polymerase chain reaction (QRT-PCR) data on \Sexpr{ncol(cardA.data)} genes. \subsection{Card B Dataset} Next we load the data from the second part of the microfludiics study. <>= cardB.data <- read.table(file.path(dir.Boutput, "normCtValuesCardB.tsv"), header=TRUE, sep="\t", quote='', comment.char='', row.names=1) dim(cardB.data) @ The microfluidics data looks at \Sexpr{nrow(cardB.data)} genes (five of which are housekeeping genes) in \Sexpr{ncol(cardB.data)} patient samples. As before, we have to standardize the sample names. <>= mf.names <- sapply(colnames(cardB.data), function(x) strsplit(x, "\\.")[[1]][1]) names(mf.names) <- NULL cll.number <- as.integer(substring(mf.names, 4)) padding <- rep("", length(cll.number)) padding[cll.number < 100] <- "0" padding[cll.number < 10] <- "00" mf.standard.names <- paste("CL", padding, cll.number, sep='') cardB.names <- colnames(cardB.data) <- mf.standard.names cardB.data <- t(cardB.data) rm(mf.names, cll.number, padding, mf.standard.names) @ Here we verify that all of the samples run on CardA were also run on CardB. <>= length(cardA.names) length(cardB.names) sum(cardA.names %in% cardB.names) all(cardA.names %in% cardB.names) @ We also check the gene lists. Specifically, we want to confirm that the only overlap is the five housekeeping genes. <>= cardB.genes <- colnames(cardB.data) length(cardA.genes) length(cardB.genes) # check that the only overlap is the set of housekeeping genes sum(overlap <- cardA.genes %in% cardB.genes) HK.genes <- cardA.genes[overlap] HK.genes rm(overlap) length(cardAB.genes <- unique(c(cardA.genes, cardB.genes))) # 5 of these are HK @ In summary, the microfluidics data for Card~B consists of measurements of \Sexpr{ncol(cardB.data)-5} new genes measured on \Sexpr{nrow(cardB.data)} patient samples. However, only \Sexpr{nrow(cardA.data)} of these samples were also measured on Card~A. \subsection{Card C DataSet} Now we load the microfluidics data from Card C. <>= cardC.data <- read.csv(file.path(dir.cardC, "normedCardC.csv"), row.names=1) cardC.data <- t(cardC.data) dim(cardC.data) @ In this dataset, we padded the CLL identifiers numbers with 0's, but we used the ``CLL'' prefix rather than the ``CL'' prefix. So, we again update the names to the current standard. <>= cardC.names <- sub("CLL", "CL", rownames(cardC.data)) rownames(cardC.data) <- cardC.names @ We check that none of the samples on Card~C were run on Card~A. However, nine of the twelve samples that were run on Card~B but not on Card~A were repeated on Card~C. <>= sum(cardC.names %in% cardA.names) cardAC.names <- sort(union(cardA.names, cardC.names)) sum(bc.common <- cardC.names %in% cardB.names) cardC.names[bc.common] rm(bc.common) @ Now we gather some data on the genes used on Card~C. Since two of the strings used to name genes changed, we have to fix them to make them consistent with the earlier datasets. <>= cardC.genes <- colnames(cardC.data) cardC.genes[cardC.genes == "18S"] <- "r18S" cardC.genes[cardC.genes == "ATRX;LOC728849"] <- "ATRX" @ Now we count various overlaps: <>= length(cardC.genes) sum(cardC.genes %in% HK.genes) sum(cardC.genes %in% cardA.genes) sum(cardC.genes %in% cardB.genes) sum(commonGenes <- cardC.genes %in% cardAB.genes) new.genes <- cardC.genes[!(cardC.genes %in% cardAB.genes)] new.genes @ In summary, Card~C contains data on \Sexpr{length(cardC.genes)} genes measured on \Sexpr{nrow(cardC.data)} patient samples. Since none of these samples were included on Card~A, that means that we have data on a total of \Sexpr{nrow(cardA.data)+nrow(cardC.data)} different patients for \Sexpr{length(commonGenes)-5} non-housekeeping genes. \subsection{SNP Dataset: Sample Names} Because the SNP data set is so large, we only load the sample names at this point. <>= load(file.path(dir.SNP, "allSampleNames.rda")) snp.names <- names(sampleNames) rm(sampleNames, shortNames) @ As part of the analysis of the SNP data, we identified three samples that failed QC, so we have to remove them from the SNP-chip list of names. <>= failedQC <- c("CL013", "CL072", "CL153") snp.names <- setdiff(snp.names, failedQC) rm(failedQC) length(snp.names) @ Next, we count how many samples were included in both the microfluidics study and the SNP study. <>= sum(snp.names %in% cardAC.names) sum(cardAC.names %in% snp.names) sum(snp.names %in% cardA.names) sum(cardA.names %in% snp.names) @ Just so we have a record, we list the sample IDs that were profiled on micropfluidics cards but not on SNP chips: <>= cardAC.names[!(cardAC.names %in% snp.names)] @ We also list samples that were profiled on SNP chips but not on microfluidics cards. <>= snp.names[!(snp.names %in% cardAC.names)] @ \section{New Clinical Dataset} We are now ready to load the current version of the clinical database. This data comes from an Excel spreadsheet; the next lines of code report the last modification date and time for this file, from the operating system. This timestamp serves as a surrogate for the file version. <>= data.source <- "CARMEN'S DREAM_08-09-10.xls" file.info(data.source)$mtime @ Now we actually load the clinical data, which comes from two different worksheets within an Excel file. <>= fixFCRnames <- function(x) { y <- sub("\\(FCR\\) ", "FCR", x) y <- sub("FCR-GM", "F.G", y) w <- which(regexpr("FCR", x) > 0 ) prefix <- substring(y, 1, 3) counter <- as.numeric(substring(y, 4)) pad <- rep("", length(counter)) pad[counter < 100] <- "0" pad[counter < 10] <- "00" y[w] <- paste(prefix, pad, counter, sep='')[w] y } library(gdata) # for read.xls newclin.treatment <- read.xls(data.source, sheet=2, header=TRUE, as.is=TRUE) newclin.treatment <- newclin.treatment[newclin.treatment$Date.of.birth != '',] temp <- newclin.treatment$Sample.code rownames(newclin.treatment) <- fixFCRnames(temp) newclin.phenotype <- read.xls(data.source, sheet=3, header=TRUE, as.is=TRUE) newclin.phenotype <- newclin.phenotype[newclin.phenotype$Date.of.birth != '',] temp <- newclin.phenotype$Sample.code rownames(newclin.phenotype) <- fixFCRnames(temp) rm(temp, fixFCRnames) @ We check for consistency between the two parts of the clinical data. <>= dim(newclin.treatment) dim(newclin.phenotype) all(rownames(newclin.treatment) == rownames(newclin.phenotype)) for (i in 1:5) { print(all(newclin.treatment[,i] == newclin.phenotype[,i])) } rm(i) @ We remove the MRN, patient name, and redundant sample code. <>= write.csv(newclin.treatment[,1:5], file="ClinicalHIPPA.csv", row.names=FALSE) newclin <- data.frame(newclin.treatment[, c(3, 5:45)], newclin.phenotype[,6:68]) rm(newclin.treatment, newclin.phenotype) dim(newclin) @ In principle, all samples that were used in the microfluidics study should still be included in the clinical dataset. Here we verify this assertion. <>= all(cardA.names %in% rownames(newclin)) all(cardB.names %in% rownames(newclin)) all(cardC.names %in% rownames(newclin)) all(snp.names %in% rownames(newclin)) @ Phew. Every patient included in either the microfluidics study or in the SNP chip study is included in the current clinical database. \section{Bad Samples} Carmen's review of the charts identified several patients who had either been previously treated or had been misdiagnosed. In order to track these patients (or to remove them from certain analyses), we add another column to the clinical data. <>= SampleType <- rep("OK", nrow(newclin)) names(SampleType) <- rownames(newclin) SampleType[names(SampleType) %in% c("CLZ.14", "FCR157", "FCR282")] <- "NotCLL" SampleType[names(SampleType) %in% c("CL177", "CL215", "CL218", "CL114", "CL142", "CL130", "CL128", "CL030", "CL187", "CL164", "CL163", "CL115", "CL077", "CL172", "CL065", "CL089")] <- "PreTreated" newclin$SampleType <- SampleType rm(SampleType) @ \section{Affymetrix Data} A long time ago (in a galaxy far, far away), we profiled some CLL samples using Affymetrix U133A microarrays. Here is a list of the samples used in that study. <>= affy.list <- read.table(file.path(dir.affy, "mut-stat.txt"), header=TRUE, sep="\t", row.names=2) affy.names <- sub("CLL", "CL0", rownames(affy.list)) @ There were \Sexpr{length(affy.names)} patients included in that study, and we have updated clinical data on all of them: <>= length(affy.names) sum(affy.names %in% rownames(newclin)) @ Nineteen of these samples were included in the SNP study, 23 were profiled on Cards~A and~B, and 28 were profiled on Cards~``A and B or C''. <>= sum(affy.names %in% snp.names) sum(affy.names %in% cardA.names) sum(affy.names %in% cardAC.names) sum(affy.names %in% intersect(cardAC.names, snp.names)) @ \section{Summary} We also put together a summary of which samples have been profiled on which platforms. <<>>= makecol <- function(namelist) { x <- rep("No", nrow(newclin)) names(x) <- rownames(newclin) if (all(namelist %in% names(x))) { x[namelist] <- "Yes" } else { stop("Unknown names included in list") } factor(x) } temp <- data.frame(CardA=makecol(cardA.names), CardB=makecol(cardB.names), A.and.B=makecol(intersect(cardA.names, cardB.names)), cardC=makecol(cardC.names), A.and.B..or.C=makecol(union(cardA.names, cardC.names)), snp=makecol(snp.names), snp.and.MF=makecol(intersect(snp.names, cardAC.names)), snp.and.A=makecol(intersect(snp.names, cardA.names)), affy=makecol(affy.names), snp.and.affy=makecol(intersect(snp.names, affy.names)), MF.and.affy=makecol(intersect(cardAC.names, affy.names)), A.and.affy=makecol(intersect(cardA.names, affy.names)) ) summary(temp) write.csv(temp, "datasetSummary.csv") @ \section{Save} We load the SNP summary data. <>= load(file.path(dir.SNP, "colorset.rda")) load(file.path(dir.SNP, "fullPool.rda")) load(file.path(dir.SNP, "markerSet.rda")) @ Now we save everything so that we have it all in one place for future analyses. <>= save.image(file='combinableData.rda') @ \section{Appendix} This analysis was run in the following directory: <>= getwd() @ <>= sessionInfo() @ \end{document}