|

A Close Look at a Microarray Image
Keith Baggerly, Ph. D.
Most of the currently available publications on the analysis of
microarray data focus on one stage of the problem: given the
relative expression levels for various genes across different
samples, which clusters of genes exhibit different levels of
expression across some subset of the samples? While this problem
is interesting, and we will be addressing it ourselves shortly,
we're going to focus for now on where the data comes from.
Specifically, the "relative expression level" is typically
quoted as a number. In most cases, this number nominally corresponds to
the log of the ratio of two intensity measurements, corresponding
to the two dyes with which the two types of cells being compared
have been respectively tagged. Again, most commonly, these dyes
are Cy5, red, and Cy3, green. Thus, the single number quoted
is a derived value; it derives from the two intensity values.
These intensity values are also derived quantities. They are
derived from images. For our purposes, these images will represent
bedrock. Images are our raw data.
These images are scans of slides with lots of dots on them,
each dot corresponding to the location of a cDNA probe to
which mRNA from the cells of interest have been bound. In some
of our experiments here, there are approximately 4800 dots on
a slide, arranged in a 4 by 12 grid of clusters, with each
cluster being a 10 by 10 grid of dots. When the image of the
slide is produced, we get a 3248 by 1248 array of grayscale
pixel values. Each pixel is a 16-bit intensity measurement,
so values range from 0 to 65535. Each of these images is about
8 megabytes in size, which is large enough to make manipulation
and transmission somewhat unwieldy.
It should be noted that the 16-bit nature of the images can
make things difficult to work with in wasy not having to do with
file size. Some image viewing software
assumes that the values are 8-bit, ranging from 0 to 255, and
consequently either fails to show the large image or shows it
as full white (all values set to 255). The values can be converted
to 8-bit fairly simply, as 8-bit = floor(16-bit/256), but we lose
gradation information. As the dynamic range of these images is
quite large, this loss would be damaging for the purposes of analysis.
The loss is not damaging for the purposes of getting a qualitative
feel for what the data looks like, so we have taken the full data
sets, initially 16-bit TIFF files, and converted them to 8-bit JPG
files. The compression routines for JPG images are better, so
the files have shrunk from 8.1M apiece to about 0.5-0.7M apiece. These
large images are given below:
Red Channel (Cy5, sample, 0.74Meg)
Green Channel (Cy3, control, 0.55Meg)
While the slide and dots look fairly regular, there is obvious
speckling around the edges, and there is a scratch cutting across
the right half of the fourth row of grids. The main point is that
the data ar somewhat gritty. This is a statement of fact. Given
that we know that variability and imperfections in our dot measurements
are present, the question becomes one of how to handle them best.
To make things more concrete in getting down to the actual dot level,
we will focus on a single 10 by 10 grid, the one in the top left of
the large images.
The red and green images are shown below; we will discuss how we cropped
the images after some quick comments on the figures.
Red:
Green:
A few things are immediately apparent.
First, the "dots" are not
really "dot-like" in most cases. Rather, there are rings of high
intensity about lower-level centers. This is true across both channels,
indicating that the ring pattern matches the amount of cDNA on the
slide. It is not immediately clear why this ring pattern should form.
Two mechanisms have been suggested to me so far: the "needle-like"
deposition tool may cause a bowl-shaped distortion, or surface tension
on the drop as it dries may cause clumping at the edges. A more
detailed explanation would be welcome. In any event, how does morphology
affect our measurements?
Second, the dots are not of equal size. This may make it difficult for
an automatic procedure to find the appropriate placement of a dot-shaped
target ring.
Third, there is some mottling in the upper left corner (most visible
in the Green channel). How does this affect our assessment of how
intense the dots in that region are?
To address these questions in more detail requires working with
numerical arrays, not just judging by eye. We're going to do this
by loading the images into Matlab(TM) for manipulation. We will
focus on the small Green channel image, describing how we got the
small image along the way. If you have matlab, you
can follow along with the analysis after we get to the small
image, grabbing the image from the page above (right-click)
to work with.
First off, we read in the data. Our raw images were 16-bit grayscale
TIFF files. Unfortunately, some of our image viewing software (xv)
did not handle these files, so we needed to convert them to another
format (using matlab).
ja = imread('July1#3.1626.04A.TIF','tif'); % Green channel
jb = imread('July1#3.1626.04B.TIF','tif'); % Red channel
Invoking "whos" in matlab at this point returns
Name Size Bytes Class
ja 3248x1248 8107008 uint16 array
jb 3248x1248 8107008 uint16 array
Two 8M matrices of 16-bit unsigned integer arrays.
To look at these, we want to
1) restrict our attention to submatrices of more manageable
size, such as 300 by 300 or so, and
2) convert the uint16 class to double so that numerical
operations (including surf) can be performed on
the resultant matrices.
We did this in two stages:
jasmall = ja(1:700,1:500); a first guess as to an appropriate size
jasmall = double(jasmall) + 1 % suggested in Matlab Documentation
To convert the image back to uint16 for TIFF output,
we use jasmall = uint16(jasmall - 1).
Note: double is 4 times the size of uint16, so converting
all of ja would result in a 32M matrix. As it is, matlab is
already slow in drawing mesh or surf plots with jasmall,
so we will refrain from this to the extent possible.
Invoking
mesh(jasmall);
view(2);
gives a bird's eye view of the data.
It also shows how matlab
reads in the image data - it has read in the top left corner of
the figure, but it plots it in a figure with top and bottom reversed.
We have added a white box around the grid to be investigated further.
Viewing this part of the image shows the dot pattern that
we expect, but it also shows brightness - the image is saturated
at the edges of the TIF file. These edges need to be trimmed off
before looking at the 3-d plots will be informative.
In this case, we can grab the 10 by 10 dot grid of interest
by letting
jasmall = jasmall(425:675,175:400);
Note that the first index (rows of the matrix) is on the vertical axis.
This figure is small enough to work with, so we save it so
that we can just work with this one in future sessions. To
get the figures to show up in our version of Netscape, we'll
save the small figures in PNG format.
imwrite(jasmall,'jatiny.png');
We now start things over again.
Read in the data:
ja = imread('jatiny.png','png'); % load image in specified format
ja = double(ja) + 1; % ja was initially of type "unsigned integer, 16-bit"
Now the data is in the form of a matrix of numbers that we can manipulate.
This allows us to look at the data in different ways:
surf(ja); % produce a surface plot
shading interp; % get rid of black lines showing the mesh
view(2); % look at the surface from on top
axis([1 226 1 251]); % trim the axes to get rid of white space at the edges
surf(ja);
shading interp;
view(2);
axis([1 226 1 251]);
All of the above images were grabbed from the matlab display as follows:
[X,map] = capture(1); % The figure is labeled figure 1
imwrite(X,map,'matlab0#.png'); other file formats work, depends on suffix
With this last figure, we have returned to an image like the
grayscale one presented before. However, the image has had top
and bottom reversed, as can be seen from the location of the mottling
in the lower left. This has to do with matlab's indexing of image
entries. Now let's take a closer look at a single dot, specifically
the one in the lower right corner.
jadot = ja(30:50,190:210);
surf(jadot);
shading interp;
view(2);
axis([1 21 1 21]);
In looking at the dot, a few things show up - specifically, there
is variability around the radius (it is not radially symmetric),
and the intensity shows a definite "volcano" or "biali" shape
where there is a ring of high intensity surrounding a central depression.
(Side note: a biali is a deli item akin to a bagel, where the hole
in the center is not complete. We found this a tempting analogy.)
This seems to be the norm, rather than the exception. It is not completely
clear to me why this ring structure should form. Wei has suggested
that this may be the result of surface tension effects during drying.
I would like to arrive at some certainty about this eventually.
One question about these dots pertains to how the overall intensity
should be measured. Suggestions include using the mean intensity
over the region, or the total intensity, or the median
intensity, or some other robust estimate of "central intensity".
In making this measurement, we also need do define the region.
For now, we shall take the region to be the entire square area.
Some automated array fitting software uses only the central
disk, where the size of the disk is typically the same for all dots.
The same analysis issues that are apparent with the rectangle will be
present for all shapes.
Intensity is to be measured against "local background".
To get a better feel for this, we plotted the 441 sorted pixel values
associated with jadot,
plot(sort(jadot(:)));
xlabel('Sorted Pixel Number');
ylabel('Intensity');
title('Sorted Pixel Intensities for a Single Dot');
Looking at the plot of sorted intensities from this first dot,
the median seems like a poor choice of typical intesity -
almost half of the image is at the background level.
This is of course affected by the size of
the boundary assumed. Trimming away the outermost pixel on all
sides would remove 80 background-level observations. This "boundary"
size affects the mean intensity as well.
Howver, it does not affect the
total (summed) intensity if the background level is subtracted
away so that all background pixels contribute zero. This leaves us
with the problems of 1) assessing the background level, 2) cleaning
out abnormalities of dust specks or scratches, 3) being wary of
saturated intensity values. To illustrate the need for looking at
different sized sqaures, let's look at the next dot over.
jadot2 = ja(30:50,170:190); % moving to the left of the first dot
This dot is slightly smaller in size, and the boundary clips the
next dot over. We need to trim it back a bit.
jadot2 = ja(30:50,172:189);
Now we're pretty focused on just the dot we're interested in.
Let's take another look at the intensity values.
This plot of sorted intensity values looks slightly different than
the one before. In the first plot, there is a flat region, an S curve,
and then a second increase. In the second plot, there is no bump in the
ascension. The reason this is important is that we'd like to be able
to define a "typical" plot so that we can better automate the procedure
of deciding when the data is clean and how to perform operations on it.
We'd like to look at a whole bunch of these plots to get a
feeling for the variability of shapes, not so much of scale. To do
this, we start by gridding the 10 by 10 dot setup, identifying good
cut points. We do this by playing with the axis command, trimming off
bits at a time. Good dividing values found by trial and error are:
horizontal_cuts = [16 34 54 73 92 114 132 152 172 190 210];
vertical_cuts = [29 50 70 90 110 130 150 170 190 209 230];
These divide up the dots as follows:
There seem to be more slight vertical offsets than horizontal ones,
in that different columns may be at slightly different vertical offsets
than others.
This may be due to a column being put down all at once. The vertical
displacement of the dots from one column to the next is subject to
some random displacement, but the horizontal displacement is roughly
constant between columns.
Looking at pixel intensities again, let's take a look at the sorted
intensities for all 100 dots.
No common overall shape is readily apparent here, but we can see
two things right away. (1) There are some dots where the intensity
is saturated - the reading is 65536. The true reading should be
higher, but we've run out of room. Saturated dots are a problem,
because the intensity that we assess will be less than it should
be, severely biasing ratios and other derived estimates. In this case
the saturated dots are the two dots that show up as dark red.
In this case, that made sense. When the slides are run, the same
cDNA probe is often put down at different locations, to provide
replicates which can let us assess the level of random variation
in intensity. On the slide corresponding to this image, every
dot on the left half of the grid was replicated in the same position
on the right half. Thus, the two red dots correspond to the same gene.
There is another pair of dots which saturates over a small portion
of the signal; these are the dots with red rings that appear right
above the solid red dots.
(2) Looking at the left end of the sorted pixel intensity plot,
we can see that the background intensity value (typically the
lowest intensity in a square) varies quite a bit from square to
square. It turns out that the lowest value at the left-hand side is
about 3700, and the largest is about 6100. This variability
highlights the importance of using a local estimate of background
to correct for the total intensity present.
|