Analyzing high-throughput bioinformatics data is hard. The data is highly multivariate and complex in structure, one must account for many factors associated with processing, and one must place the results of the analysis in a broader biological context. Unfortunately, this very complexity can serve to hide novel methodology against the noisy background of other structure, even though there is a great need for broader methodological insight.
Wandaliz Torres-Garcia won the award for best poster presentation at the 8th Annual Biotechnology and Bioinformatics Symposium (2011) that was held at Rice University, October 20-21 2011.
In an attempt to place more focus on the methodology of analyzing microarray data, the Duke University Bioinformatics Shared Resource initiated in 2001 the annual Critical Analysis of Microarray Data (CAMDA) competitions. These conferences seek to focus on the methodology by requiring presenters to analyze one of a small number of publicly available data sets, thus largely factoring out differential effects due to the biological context.
The MD Anderson Bioinformatics Section has entered and won numerous competitions. In all of these competitions, our entries were designed to address issues that we felt would advance the analysis of cancer data, and was related to data that we had encountered here at MD Anderson. We believe such advances are needed, and the evidence suggests that we're among the best there are at coming up with them.
The papers from the various CAMDA competitions have appeared in a series of books from Kluwer, "Methods of Microarray Data Analysis" I, II and III, with IV forthcoming. The papers from the Proteomics Data Mining Competition appeared in the September 2003 issue of the journal Proteomics.
The MD Anderson Bioinformatics Section first entered the CAMDA competition in 2001. (The section itself was formed in 2000.) One of the data sets chosen for analysis that year consisted of microarray profiles of the NCI60 cell lines. We chose to examine this data set because of the implications for cancer knowledge: characterizing broad differences in major cancer subgroups. To this end, we focused our attention on clustering the data, but only after using known biological subgroupings (eg, which chromosome the gene resides on) to filter the data. Generic clustering can have multiple signals obscured by broader background noise, and intelligent filtering can throw different groups into sharper relief. Our paper, "Biology-Driven Clustering of Microarray Data: Applications to the NCI60 Data Set" by Kevin R. Coombes, Keith A. Baggerly, David N. Stivers, Jing Wang, David Gold, Hsi-Guang Sung and Sang-Joon Lee won CAMDA 2001.
In 2002, one of the data sets provided for analysis involved data from "Project Normal", where replicate samples from 3 organs from each of 6 genetically identical male mice were run to establish the scale of noise to be expected in a biologically "null" context. Our interest initially focused on the problem of finding appropriate ways of comparing profiles across organs, which we saw as having implications for tracking the evolution of cancer throughout the body. The key difficulty here is that one of the standard assumptions in analyzing microarray data, that "most genes don't change" between contexts, can be violated when data are acquired from different tissues. As many common methods for calibrating the data (including those used in the initial paper) implicitly make this assumption, inferences using these methods can be misleading. We succeeded in showing that loess normalization was inappropriate in this context, and that other methods gave better results.
In addressing this question, however, we were confronted with a much larger and unforseen problem: the data itself had been corrupted. In particular, midway through the experiment the links between gene annotations and the corresponding expression levels had been randomly scrambled for roughly 1/3 of the genes on the array. We identified the problem, figured out the extent to which this problem pervaded the data, and managed to identify two possible annotation/expression linkages which might possibly be correct. Then, by exploiting information available at the NCBI about which genes are more strongly expressed in which organs, we were able to correctly deduce which of the two linkages was correct. Our team was the only one to recognize the problem, and likewise to derive a solution. In addition, our presentation reoriented much of the conference that year to a discussion of how to ensure data quality in microarray experiments. Our paper, "Organ-Specific Differences in Gene Expression and UniGene Annotations Describing Source Material", by David Stivers, Jing Wang, Gary Rosner, and Kevin Coombes won CAMDA 2002. Further, we were invited to supply a second report for publication, "Monitoring the Quality of Microarray Experiments", by Kevin Coombes, Jing Wang, and Lynne Abruzzo.
In 2003, the context of the contest was changed to address a growing problem in the microarray community: the synthesis of results from multiple studies. Four data sets on lung cancer were supplied, and participants were required to analyze at least two. Such meta-analysis requires adjusting for differences in patient populations, differences in the physical structures of the arrays themselves, and potentially differences between labs. We elected to combine the data from studies done at Michigan and Harvard on two different types of Affymetrix gene chips, and further to use the data to address a biological question not addressed in the initial papers: identifying genes whose expression profiles supplied information about survival above and beyond that which could already be forecast using easily available clinical covariates such as age, gender, and smoking history. Most previous studies had looked for association with survival, but without the further conditioning. To combine the data from the two platforms, we used the actual sequence information from the individual probes to construct new synthetic probe sets across both chip types using the latest build of UniGene. This led to different groupings of probes than those initially suggested by Affymetrix; ours corrected for the gains in information over the years since the initial experiment. Quantifications of gene expression using probes common to the two platforms thus enabled quantitative comparison. Having addressed the quantification problem, multivariate Cox models were then used to assess the predictive power added by each gene. A short list of 26 genes was assembled; several of these genes had been previously linked to lung cancer and only 1 of them had been found in either of the two initial studies. Our paper, "Identification of Prognostic Genes, Combining Information Across Different Studies and Oligonucleotide Arrays" by Jeff Morris, Guosheng Yin, Keith Baggerly, Chunlei Wu, and Li Zhang won CAMDA 2003.
Now, microarray data is not the only type of high-throughput biological data generating wide interest. Another type of data of increasing importance is proteomic profiling, derived in the most part from mass spectrometry. Drawing in part on the success of the CAMDA conferences, another group at Duke decided to host a Proteomics Data Mining competition in 2002 organized on the same lines. A data set consisting of MALDI-TOF (matrix-assisted, laser desorbtion and ionization time-of-flight) spectra from a group of lung cancer patients and a separate group of normal controls was provided for analysis. We chose to enter because we were beginning to encounter this type of problem ourselves with data from MD Anderson.
Proteomic data is currently less well-characterized than microarray data, so a much larger part of our analysis was centered on gaining a better understanding of the structure of such data -- how spikes correspond to proteins, how there is a base level of noise associated with the chemical fixatives used in preparing the samples themselves, and how fractionation can be used to better focus on proteins of interest. In the process of this analysis, we also developed several different ways of visualizing the data. While we found structure, we also found that at this stage most of the structure was due to non-biological artifacts that needed to be removed before the biology could be found. Looking at the data, we were able to identify sinusoidal noise likely deriving from feedback from a loose power cord, recurrent periodic spikes associated with the flushing of a computer buffer in the recording stream, a breakdown of the fractionation machine being used, and evidence of detergent residue. We were able to identify several peaks of interest after cleaning out these confounding factors. None of the other entrants noticed these features in the data. Our paper, "A Comprehensive Approach to the Analysis of Matrix-Assisted Laser Desorption/Ionization Time-of-Flight Proteomics Spectra from Serum Samples" by Keith Baggerly, Jeff Morris, Jing Wang, David Gold, Lian-Chun Xiao, and Kevin Coombes won the Proteomics Data Mining Competition.