Keith Baggerly, Ph. D.
Most of the currently available publications on the analysis of microarray data focus on one stage of the problem: given the relative expression levels for various genes across different samples, which clusters of genes exhibit different levels of expression across some subset of the samples? While this problem is interesting, and we will be addressing it ourselves shortly, we're going to focus for now on where the data comes from.
Specifically, the "relative expression level" is typically quoted as a number. In most cases, this number nominally corresponds to the log of the ratio of two intensity measurements, corresponding to the two dyes with which the two types of cells being compared have been respectively tagged. Again, most commonly, these dyes are Cy5, red, and Cy3, green. Thus, the single number quoted is a derived value; it derives from the two intensity values.
These intensity values are also derived quantities. They are derived from images. For our purposes, these images will represent bedrock. Images are our raw data.
These images are scans of slides with lots of dots on them, each dot corresponding to the location of a cDNA probe to which mRNA from the cells of interest have been bound. In some of our experiments here, there are approximately 4800 dots on a slide, arranged in a 4 by 12 grid of clusters, with each cluster being a 10 by 10 grid of dots. When the image of the slide is produced, we get a 3248 by 1248 array of grayscale pixel values. Each pixel is a 16-bit intensity measurement, so values range from 0 to 65535. Each of these images is about 8 megabytes in size, which is large enough to make manipulation and transmission somewhat unwieldy. It should be noted that the 16-bit nature of the images can make things difficult to work with in wasy not having to do with file size. Some image viewing software assumes that the values are 8-bit, ranging from 0 to 255, and consequently either fails to show the large image or shows it as full white (all values set to 255). The values can be converted to 8-bit fairly simply, as 8-bit = floor(16-bit/256), but we lose gradation information. As the dynamic range of these images is quite large, this loss would be damaging for the purposes of analysis. The loss is not damaging for the purposes of getting a qualitative feel for what the data looks like, so we have taken the full data sets, initially 16-bit TIFF files, and converted them to 8-bit JPG files. The compression routines for JPG images are better, so the files have shrunk from 8.1M apiece to about 0.5-0.7M apiece. These large images are given below:
While the slide and dots look fairly regular, there is obvious speckling around the edges, and there is a scratch cutting across the right half of the fourth row of grids. The main point is that the data ar somewhat gritty. This is a statement of fact. Given that we know that variability and imperfections in our dot measurements are present, the question becomes one of how to handle them best.
To make things more concrete in getting down to the actual dot level, we will focus on a single 10 by 10 grid, the one in the top left of the large images. The red and green images are shown below; we will discuss how we cropped the images after some quick comments on the figures.
A few things are immediately apparent.
First, the "dots" are not really "dot-like" in most cases. Rather, there are rings of high intensity about lower-level centers. This is true across both channels, indicating that the ring pattern matches the amount of cDNA on the slide. It is not immediately clear why this ring pattern should form. Two mechanisms have been suggested to me so far: the "needle-like" deposition tool may cause a bowl-shaped distortion, or surface tension on the drop as it dries may cause clumping at the edges. A more detailed explanation would be welcome. In any event, how does morphology affect our measurements?
Second, the dots are not of equal size. This may make it difficult for an automatic procedure to find the appropriate placement of a dot-shaped target ring.
Third, there is some mottling in the upper left corner (most visible in the Green channel). How does this affect our assessment of how intense the dots in that region are?
To address these questions in more detail requires working with numerical arrays, not just judging by eye. We're going to do this by loading the images into Matlab(TM) for manipulation. We will focus on the small Green channel image, describing how we got the small image along the way. If you have matlab, you can follow along with the analysis after we get to the small image, grabbing the image from the page above (right-click) to work with.
First off, we read in the data. Our raw images were 16-bit grayscale TIFF files. Unfortunately, some of our image viewing software (xv) did not handle these files, so we needed to convert them to another format (using matlab).
ja = imread('July1#3.1626.04A.TIF','tif'); % Green channel jb = imread('July1#3.1626.04B.TIF','tif'); % Red channel
Invoking "whos" in matlab at this point returns
Name Size Bytes Class ja 3248x1248 8107008 uint16 array jb 3248x1248 8107008 uint16 array
Two 8M matrices of 16-bit unsigned integer arrays. To look at these, we want to 1) restrict our attention to submatrices of more manageable size, such as 300 by 300 or so, and 2) convert the uint16 class to double so that numerical operations (including surf) can be performed on the resultant matrices.
We did this in two stages:
jasmall = ja(1:700,1:500); a first guess as to an appropriate size jasmall = double(jasmall) + 1 % suggested in Matlab Documentation
To convert the image back to uint16 for TIFF output, we use
jasmall = uint16(jasmall - 1).
Note: double is 4 times the size of uint16, so converting all of ja would result in a 32M matrix. As it is, matlab is already slow in drawing mesh or surf plots with jasmall, so we will refrain from this to the extent possible.
gives a bird's eye view of the data. It also shows how matlab reads in the image data - it has read in the top left corner of the figure, but it plots it in a figure with top and bottom reversed. We have added a white box around the grid to be investigated further.
Viewing this part of the image shows the dot pattern that we expect, but it also shows brightness - the image is saturated at the edges of the TIF file. These edges need to be trimmed off before looking at the 3-d plots will be informative. In this case, we can grab the 10 by 10 dot grid of interest by letting
jasmall = jasmall(425:675,175:400);
Note that the first index (rows of the matrix) is on the vertical axis. This figure is small enough to work with, so we save it so that we can just work with this one in future sessions. To get the figures to show up in our version of Netscape, we'll save the small figures in PNG format.
We now start things over again.
Read in the data:
ja = imread('jatiny.png','png'); % load image in specified format ja = double(ja) + 1; % ja was initially of type "unsigned integer, 16-bit"
Now the data is in the form of a matrix of numbers that we can manipulate. This allows us to look at the data in different ways:
surf(ja); % produce a surface plot shading interp; % get rid of black lines showing the mesh view(2); % look at the surface from on top axis([1 226 1 251]); % trim the axes to get rid of white space at the edges
axis([1 226 1 251]);
All of the above images were grabbed from the matlab display as follows:
[X,map] = capture(1); % The figure is labeled figure 1 imwrite(X,map,'matlab0#.png'); other file formats work, depends on suffix
With this last figure, we have returned to an image like the grayscale one presented before. However, the image has had top and bottom reversed, as can be seen from the location of the mottling in the lower left. This has to do with matlab's indexing of image entries. Now let's take a closer look at a single dot, specifically the one in the lower right corner.
jadot = ja(30:50,190:210); surf(jadot); shading interp; view(2); axis([1 21 1 21]);
In looking at the dot, a few things show up - specifically, there is variability around the radius (it is not radially symmetric), and the intensity shows a definite "volcano" or "biali" shape where there is a ring of high intensity surrounding a central depression. (Side note: a biali is a deli item akin to a bagel, where the hole in the center is not complete. We found this a tempting analogy.) This seems to be the norm, rather than the exception. It is not completely clear to me why this ring structure should form. Wei has suggested that this may be the result of surface tension effects during drying. I would like to arrive at some certainty about this eventually.
One question about these dots pertains to how the overall intensity should be measured. Suggestions include using the mean intensity over the region, or the total intensity, or the median intensity, or some other robust estimate of "central intensity". In making this measurement, we also need do define the region. For now, we shall take the region to be the entire square area. Some automated array fitting software uses only the central disk, where the size of the disk is typically the same for all dots. The same analysis issues that are apparent with the rectangle will be present for all shapes. Intensity is to be measured against "local background". To get a better feel for this, we plotted the 441 sorted pixel values associated with jadot,
plot(sort(jadot(:))); xlabel('Sorted Pixel Number'); ylabel('Intensity'); title('Sorted Pixel Intensities for a Single Dot');
Looking at the plot of sorted intensities from this first dot, the median seems like a poor choice of typical intesity - almost half of the image is at the background level. This is of course affected by the size of the boundary assumed. Trimming away the outermost pixel on all sides would remove 80 background-level observations. This "boundary" size affects the mean intensity as well. Howver, it does not affect the total (summed) intensity if the background level is subtracted away so that all background pixels contribute zero. This leaves us with the problems of 1) assessing the background level, 2) cleaning out abnormalities of dust specks or scratches, 3) being wary of saturated intensity values. To illustrate the need for looking at different sized sqaures, let's look at the next dot over.
jadot2 = ja(30:50,170:190); % moving to the left of the first dot
This dot is slightly smaller in size, and the boundary clips the next dot over. We need to trim it back a bit.
jadot2 = ja(30:50,172:189);
Now we're pretty focused on just the dot we're interested in. Let's take another look at the intensity values.
This plot of sorted intensity values looks slightly different than the one before. In the first plot, there is a flat region, an S curve, and then a second increase. In the second plot, there is no bump in the ascension. The reason this is important is that we'd like to be able to define a "typical" plot so that we can better automate the procedure of deciding when the data is clean and how to perform operations on it. We'd like to look at a whole bunch of these plots to get a feeling for the variability of shapes, not so much of scale. To do this, we start by gridding the 10 by 10 dot setup, identifying good cut points. We do this by playing with the axis command, trimming off bits at a time. Good dividing values found by trial and error are:
horizontal_cuts = [16 34 54 73 92 114 132 152 172 190 210]; vertical_cuts = [29 50 70 90 110 130 150 170 190 209 230];
These divide up the dots as follows:
There seem to be more slight vertical offsets than horizontal ones, in that different columns may be at slightly different vertical offsets than others. This may be due to a column being put down all at once. The vertical displacement of the dots from one column to the next is subject to some random displacement, but the horizontal displacement is roughly constant between columns.
Looking at pixel intensities again, let's take a look at the sorted intensities for all 100 dots.
No common overall shape is readily apparent here, but we can see two things right away. (1) There are some dots where the intensity is saturated - the reading is 65536. The true reading should be higher, but we've run out of room. Saturated dots are a problem, because the intensity that we assess will be less than it should be, severely biasing ratios and other derived estimates. In this case the saturated dots are the two dots that show up as dark red. In this case, that made sense. When the slides are run, the same cDNA probe is often put down at different locations, to provide replicates which can let us assess the level of random variation in intensity. On the slide corresponding to this image, every dot on the left half of the grid was replicated in the same position on the right half. Thus, the two red dots correspond to the same gene. There is another pair of dots which saturates over a small portion of the signal; these are the dots with red rings that appear right above the solid red dots. (2) Looking at the left end of the sorted pixel intensity plot, we can see that the background intensity value (typically the lowest intensity in a square) varies quite a bit from square to square. It turns out that the lowest value at the left-hand side is about 3700, and the largest is about 6100. This variability highlights the importance of using a local estimate of background to correct for the total intensity present.