From MD Anderson Bioinformatics

Kevin R. Coombes, Ph.D.

In this document, we describe the standard analysis methods applied by the Section of Bioinformatics to a single microarray produced in the Cancer Genomics Core Laboratory. Experiments performed in the Core Laboratory use the two-color fluorescence technique, so each experiment involves two experiment samples. One sample is labeled with a green dye (Cy3) and the other is labeled with a red dye (Cy5). An experiment involving a single microarray is typically a screening experiment, whose goal is to identify genes that are differentially expressed between the two samples.

A paper describing the statistical model underlying the method decribed here is awaiting publication:

Baggerly, K.A., Coombes, K.R., Hess, K.R., Stivers, D.N., Abruzzo, L.V., Zhang, W. Identifying differentially expressed genes in cDNA microarray experiments. J Comp Biol. In press.

The S-Plus library described here is available from our Software and Services page.

## Contents |

Before starting the statistical analysis, we make it a point to look at the scanned images of the microarray to ensure that the quality is reasonable. We usually rely on MATLAB (The MathWorks, Inc., Natick MA) for this step, since it allows us to view the images with a variety of colormaps. (People aren't very good at discriminating shades of gray, and they are even less good at distinguishing colors that shade from black into red or green.) A false color image prepared in MATLAB can give a good idea of the quality of the spots and of any strange behavior in the background.

For more details about dissecting microarray images, see the tutorial by Keith Baggerly.

After we have decided that the images are good enough, we load the quantification data into S-Plus (Insightful Corp., Seattle WA) and start our analysis. (Note: All figures included in this document were generated using our working code in S-Plus from the images of an actual microarray.) We begin with a couple of sanity checks to make sure that this data set resembles other data sets we've seen in the past.

We start by estimating the distributions of the logarithm of the raw volume intensity estimates and the raw background estimates for each channel. The example we are using here shows a fairly typical distribution pattern. The volume has a high peak with a long right tail. The background has a tighter peak located at about the same point, without the extended tail. The background-corrected volume is much more spread out. We interpret these distributions as telling us that a large number of spots are near or below the background levels; the spots that truly show significant gene expression are the ones in the long tail at the right.

We also look at cartoon images of the log background; the log transformation allows us to see more detail at lower intensity levels, where the background is more important. In our example, the red channel background is plotted in the top figure and shows very few significant features; the most visible glitch consists of a few spots near the left end. As usually happens, the background in the green channel is both slightly higher on average (as reflected in the estimate of the median) and more variable. In this case, the background is substantially higher on the right third of the green channel.

Normalization is a critical step in the analysis of any microarray project. Because different channels have been scanned at different gains, we cannot simply take the ratios between the background-corrected values and use them immediately. They have to be normalized to remove systematic biases from the data.

Any normalization method relies on some assumptions about the underlying biology. Some researchers normalize the values on any slide so that the median ratio between channels is set equal to 1. This normalization method makes several assumptions. First, it assumes that most spots on the microarray show measurable levels of gene expression, so that it is possible to compute meaningful ratios. Second, it assumes that most genes are not differentially expressed in the biological samples being compared in the two channels. Finally, it assumes that the genes that do change are as likely to be overexpressed as underexpressed.

We see a number of problems with this common normalization method. First, we have often seen experiments where fewer than half of the spots are expressed at levels far enough above background to be considered reliable. Second, we have also seen experiments where one expects more genes to be overexpressed (because some known pathway has been activated) than underexpressed. Third, by normalizing ratios instead of intensities, one can only compare the ratios when analyzing experiments that use more than one microarray. This last restriction may be the most important one, since it has nontrivial implications for the design of multi-array experiments that restrict how many different comparisons can be carried out.

Our S-Plus tools allow for various normalization methods. By default, we normalize channels separately by rescaling the background-corrected values to set the 75th percentile of the spots to equal 1000. Under assumptions similar to the ones described above, this normalization method should bring the median ratio between reliably expressed spots very close to 1, thus giving results similar to those obtained by normalizing ratios. It has the added virtue of allowing us to extract individual channels from different experiments and compare them directly.

We can vary the percentile that is being normalized. We can also switch from using a fixed percentile of all spots to using the median of any preselected subset of spots.

After normalization, we typically replace all negative and all small positive normalized values by some threshold value. The default value of the threshold is 150 (where the 75th percentile has been scaled to 1000). This number was arrived at by trial and error over a moderate number of experiments. It has no statistical justification at the moment; it's just a magic number. In some experiments, we find it helpful to change the number. Sometimes we use a threshold as low as 1; we rarely go higher than 150. We are investigating better ways to set the threshold. The most promising idea is to set it at the 95th percentile of the estimates obtained at spots on the array that are known to be blanks (or negative controls).

We set thresholds primarily to allow us to take logarithms without feeling the pain inflicted by trying to compute the logarithm of a negative number or zero. We also don't want to divide by zero to take ratios; by setting a threshold, we make conservative estimates of ratios in the situation where we have clear signal in one channel and a signal that is not truly detectable above background in the other channel.

The microarrays produced in the Cancer Genomics Core Laboratory have the wonderful virtue that every gene of interest is spotted on the array in duplicate. Because of these replications, we can estimate how variable the measurements are. We make scatter plots comparing the first and second member of each pair of replicates. More useful, however, are the Bland-Altman plots we get by rotating those scatter plots by 45 degrees. The latter plots correspond to plotting the difference between the log intensity of the replicates (whose absolute value is an estimate of the standard deviation) as a function of the mean log intensity. We typically see a much wider spread (a "fan" or a "fishtail") at the left end of the plot (low intensity) and much tighter reproducibility at the right end (high intensity). We can fit a smooth curve estimating the typical variability as a function of the mean intensity. In the graphs included here, we have plotted confidence bands at plus and minus four times those smooth curves. Replicates whose difference lies outside these confidence bands are intrinsically less reliable than replicates whose difference lies within the bands.

Note that we do not discard these poor replicates from our analysis; we simply compute a value that tells us how much more variable they are than the typical replicate, so we can decide later on whether or not we want to trust them.

Now that we have estimated the variability within a channel by looking at the replicate spots, we are ready to compare the two channels. The smooth curves that estimate the within-channel standard deviation are pooled to produce a single curve estimating the standard deviation (sigma) across the entire array. Using this estimate, we can compute "smooth t-statistics" by looking at

tscore = (log(B) - log(A)) / sigma

for each replicated gene. We then choose some kind of significance level, and select the genes whose t-scores exceed this level as the ones that are differentially expressed.

We can display the results graphically in two ways. First, we can simply make a scatter plot of one channel against the other, highlighting the genes that are differentially expressed. Given the way the results are obtained, however, it makes more sense to again rotate the axes by 45 degrees. This rotation corresponds to plotting the log ratio as a function of the mean log intensity. We can include significance bands on this plot, which correspond to plotting a multiple of the pooled smooth curves, where the multiple used is the chosen significance cutoff. In the example below, we have picked out all genes with t-score greater than 3.

We usually produce a final set of pictures to decide how much we believe the results. We produce a histogram of all the t-scores arising from the genes on the microarray. This should look like a t-distribution (of all things), which is rather like a normal distribution with heavier tails. We also produce a normal probability plot (also called a qq-plot) to see that the extreme t-scores really do deviate from the sort of straight line one would expect from pure chance. We plot the t-scores as a function of the log intensity. Finally, we produce a plot.