%
% Lectures for Microarray Course
%
% This is the template file for all lectures to ensure consistent
% layout, fonts, and colors.
%
% Remember to update the hypersetup to include individual lecture
% titles, change the leftheader to include the lecture title, and
% then start at \begin{document} ....
% Also, produce a printable (black and white version) according
% to the cues immediately following \begin{document}.
%
% Author: Keith Baggerly and Kevin R. Coombes
% Date : 19 September 2006
%
% Compile this document with:
% pdflatex .tex
% and then:
% ppower4pb .pdf .pdf
%
%\documentclass[Screen4to3,25pt,headrule,footrule]{foils}
\documentclass[landscape,25pt,headrule,footrule]{foils}
\usepackage[pdftex]{color} %for colors
\definecolor{Teal}{rgb}{0.0,0.47,0.46}
\definecolor{PaleTeal}{rgb}{0.0,0.8,0.6}
\definecolor{DarkTeal}{rgb}{0.0,0.2157,0.2118}
\definecolor{Lemon}{rgb}{1.0,1.0,0.4}
\definecolor{Orange}{rgb}{1.0,0.6,0.0}
\definecolor{PaleBlue}{rgb}{0.0,0.8,1.0}
\definecolor{Gray7}{rgb}{0.7,0.7,0.7}
\definecolor{DarkBurgundy}{rgb}{0.2745,0.0,0.0}
\definecolor{Burgundy}{rgb}{0.62,0.06,0.41}
\definecolor{DarkGreen}{rgb}{0.5,0.8,0.6}
\definecolor{RGBblack}{rgb}{0.0,0.0,0.0} %black doesn't work (?) for
% {h,v}pagecolor so use this
\usepackage[pdftex]{graphicx} %for importing graphics
\usepackage{times} %for nice fonts
\usepackage[pdftex]{hyperref} %for hyperlinking and other effects
\hypersetup{
pdftitle={Lecture 7: Affymetrix, R, and BioConductor},
pdfsubject={Analysis of Microarray Data},
pdfauthor={Keith A. Baggerly and Kevin R. Coombes,
Department of Biostatistics and Applied Mathematics,
UT M.D. Anderson Cancer Center,
,
},
pdfkeywords={microarray,R},
pdfpagemode={None},
linkcolor=PaleBlue,
citecolor=Gray7,
pagecolor=PaleBlue,
urlcolor=PaleBlue
}
\usepackage{pause} %required for ppower4
\usepackage{background} %required for ppower4
\usepackage{pp4slide} %required for ppower4
\usepackage{pp4link}
\usepackage[pdftex]{geometry} %for resizing the header areas
\geometry{hscale=0.85, %make the slides 85% of horizontal
vscale=0.8 % and 80% of vertical
}
\setcounter{page}{0} %don't count the title page
\setlength{\foilheadskip}{-0.5in} %skip less space below the title
% use different colored bullets in nested lists
% color choices come from MDACC graphics standards
\renewcommand{\labelitemi}{\textcolor{PaleTeal}{$\bullet$}}
\renewcommand{\labelitemii}{\textcolor{Orange}{$\bullet$}}
\renewcommand{\labelitemiii}{\textcolor{PaleBlue}{$\bullet$}}
\renewcommand{\labelitemiv}{\textcolor{Lemon}{$\bullet$}}
% change color of slide titles
%\renewcommand\normalcolor{\color{black}}
%\def\Black#1{\color{black}#1}
% change the default background, again an MDACC standard
\vpagecolor[RGBblack]{Teal}
\raggedright %no right-justify
% put lecture specific items in the header and consistent course
% items in the footer
\leftheader{\hyperlink{start}{\textsc{Affymetrix, R, and BioConductor}}}
\rightheader{\textsf{\thepage}}
\MyLogo{{\copyright\ Copyright 2004--2007 Kevin R.~Coombes and Keith
A.~Baggerly}} % left footer
\rightfooter{\textsc{GS01 0163: Analysis of Microarray Data}}
\def\rcode#1{\texttt{\textcolor{Orange}{#1}}}
\SweaveOpts{prefix.string=Figs/ma07}
\SweaveOpts{strip.white=TRUE}
\begin{document}
\hypersetup{pdfpagetransition=Replace}
<>=
options(width=52)
options(SweaveHooks=list(fig=function() par(bg="white")))
@
%\color{black} % for black-and-white version
% Also, run M-x replace-string \textcolor{white} \textcolor{black}.
% Then, using pdflatex only (no ppower4 step) should produce the
% black and white version with pauses shown as colored boxes.
% Remember to run M-x replace-string \textcolor{black} \textcolor{white}
% afterwards to reset!
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Title Page
\title{\textcolor{Lemon}{GS01 0163\\ Analysis of Microarray Data}}
\author{\textcolor{white}{Keith Baggerly and Kevin Coombes}\\
\textcolor{white}{Section of Bioinformatics}\\
\textcolor{white}{Department of Bioinformatics and Computational Biology}\\
\textcolor{white}{UT M. D. Anderson Cancer Center}\\
\textcolor{Orange}{\tt kabagg@mdanderson.org}\\
\textcolor{Orange}{\tt kcoombes@mdanderson.org}}
\date{\textcolor{Lemon}{20 September 2007}}
\LogoOff
\maketitle
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Individual Lecture Title Page
\foilhead{\hypertarget{start}{Lecture 7}: Affymetrix, R, and BioConductor}
\LogoOn
% In most cases, this slide should be an outline of topics to be
% covered in this lecture, presented as an itemized list. Each item in
% the list should provide a hyperlink to the slide where discussion of
% that topic begins.
\begin{itemize}
\item \hyperlink{readaffy}{Reading Affymetrix data with BioConductor}
\item \hyperlink{affyprocess}{Processing Affymetrix data}
\item \hyperlink{sum}{Quantification = summarization}
\item \hyperlink{quant}{Description of quantification methods}
\begin{itemize}
\item \hyperlink{mas5}{MAS 5.0}
\item \hyperlink{rma}{RMA}
\item \hyperlink{pdnn}{PDNN}
\end{itemize}
\item \hyperlink{qc}{Quality control assessment}
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{\hypertarget{readaffy}{Reading Affymetrix data with BioConductor}}
Once upon a time, the BioConductor \texttt{affy} package included a
graphical interface (via the function \rcode{ReadAffy()}) that made it
easier to read in Affymetrix data and contruct \texttt{AffyBatch}
objects. This method started to break in version in 2.4, and is now
essentially useless in version 2.5. If we are really lucky, they may
fix it in version 2.6.
In the meantime, we have to do everything by writing scripts by hand.
Since we are interested in docuemting things and making them
reproducible, it may be to our advantage that the (irreproducible) GUI
no longer works.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Getting started}
We start by loading the basic Affymetrix R library.
<>=
library(affy)
@
% This output is not captured by Sweave, so we have to copy and paste
% it manually.
\begin{verbatim}
Loading required package: Biobase
Loading required package: tools
Welcome to Bioconductor
Vignettes contain introductory material. To view, type
'openVignette()'. To cite Bioconductor, see
'citation("Biobase")' and for packages 'citation(pkgname)'.
Loading required package: affyio
\end{verbatim}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{File locations}
As we did with dChip, we have to let the system know where the files
are located. In this example (as in most examples), all the CEL files
are stored in a single directory. The associated sample information
files are nearby.
<>=
datapath <- "G:/Public/Singh-Prostate-Affymetrix"
celpath <- file.path(datapath, "CelFiles")
@
There is a function that makes it easy to get a list of the CEL files
<>=
filenames <- list.celfiles(celpath)
filenames[1:3]
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{AnnotatedDataFrame}
In earlier version of BioConductor, the ``sample information file''
was loaded into an object of class \rcode{phenoData}. In the most
recent version, this has been replaced by the
\rcode{AnnotatedDataFrame} class. Both objects basically represent a
single data frame, with auxiliary information that expands on the
column names.
We can read the same sample information file that we used for dChip
into R and make it into an AnnotatedDataFrame:
<>=
adf <- read.AnnotatedDataFrame(file.path(datapath, "/subsamples.txt"),
header=TRUE, sep="\t", row.names=2)
adf
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Actually creating the annotations}
As you just saw, the extra annotations are not created when you read a
file this way. We prepared another file describing the columns.
<>=
tmp <- read.table(file.path(datapath, "explain.txt"),
header=TRUE, sep="\t", quote="", row.names="Id")
varMetadata(adf) <- tmp
rm(tmp)
adf
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{MIAME}
MIAME = "Minimum Information About a Microarray Experiment"
Here is a MIAME object for the Singh experiment:
<>=
miame <- new("MIAME",
name = "Dinesh Singh",
lab = "William F. Sellers",
title = paste("Gene",
"expression correlates of clinical",
"prostate cancer behavior"),
pubMedIds = c("12086878"))
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Creating an AffyBatch}
R (on a 32-bit machine) is unable to read all 103 CEL files into an
AffyBatch at once. (We will talk later about how to get around some
of these limitations.) In this case, the sample file we read in
earlier actually only contains descriptions of $20$ of the CEL files,
which is an amount we can easily read.
<>=
filenames <- paste(pData(adf)$Array.name, "CEL", sep=".") #$
abatch <- read.affybatch(filenames=file.path(celpath, filenames),
phenoData=adf, description=miame)
rm(adf, miame)
@
<>=
if (file.exists("abatch.Rda")) {
load("abatch.Rda")
} else {
<>
save(abatch, file="abatch.Rda")
}
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{The AffyBatch}
Now we can look at a summary of what we have so far:
<>=
abatch
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{The Dilution DataSet}
FOr the rest of this lecture, we are going to use the sampe dataq set
that ships with BioConductor.
<>=
library(affydata)
data(Dilution)
Dilution
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{\hypertarget{affyprocess}{Processing Affymetrix Data}}
BioConductor breaks down the low-level processing of Affymetrix data
into four steps. The design is highly modular, so you can choose
different algorithms at each step. It is highly likely that the
results of later (high-level) analyses will change depending on your
choices at these steps.
\begin{itemize}
\item Background correction
\item Normalization (on the level of features = probes)
\item PM-correction
\item Summarization
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Background correction}
The list of available background correction methods is stored in a
variable:
<>=
bgcorrect.methods
@
\begin{description}
\item[none]{Do nothing}
\item[mas]{Use the algorithm from MAS 5.0}
\item[rma]{Use the algorithm from the current version of RMA}
\item[rma2]{Use the algorithm from an older version of RMA}
\end{description}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Background correction in MAS 5.0}
MAS 5.0 divides the microarray (more precisely, the CEL file) into 16
regions. In each region, the intensity of the dimmest 2\% of features
is used to define the background level. Each probe is then adjusted by
a weighted average of these 16 values, with the weights depending on
the distance to the centroids of the 16 regions.
\begin{center}
\includegraphics[height=3in]{Figs/masbgcel}
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Background correction in RMA}
RMA takes a different approach to background correction. First, only
PM values are adjusted, the MM values are not changed. Second, they
try to model the distribution of PM intensities as a sum of
\vskip -0.5in
\begin{itemize}
\item exponential signal with mean $\lambda$
\item normal noise with mean $\mu$ and variance $\sigma^2$ (truncated
at $0$ to avoid negatives).
\end{itemize}
If we observe a signal $X=x$ at a PM feature, we adjust it by
$$
E(s | X=x) = a + b \frac{\phi(a/b) - \phi((x-a)/b)}
{\Phi(a/b) + \Phi((x-a)/b) - 1}
$$
where $b=\sigma$ and $a= s -\mu-\lambda\sigma^2$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Comparing background methods}
<>=
d.mas <- bg.correct(Dilution[,1], "mas")
d.rma <- bg.correct(Dilution[,1], "rma")
bg.with.mas <- pm(Dilution[,1]) - pm(d.mas)
bg.with.rma <- pm(Dilution[,1]) - pm(d.rma)
@
<>=
if (file.exists("bg.Rda")) {
load("bg.Rda")
} else {
<>
save(d.mas, d.rma, bg.with.mas, bg.with.rma, file="bg.Rda")
}
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{}
\begin{center}
<>=
hist(bg.with.mas)
@
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{}
\begin{center}
<>=
hist(bg.with.rma)
@
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{}
<>=
summary(data.frame(bg.with.mas, bg.with.rma))
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Difference in background estimates}
On this array, RMA gives slightly larger background
estimates, and gives estimates that are more nearly constant across the
array. The overall differences can be displayed in a histogram.
\foilhead{}
\begin{center}
<>=
hist(bg.with.rma - bg.with.mas)
@
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{How big is 20 units?}
<>=
tmp <- data.frame(pm(Dilution[,1]), mm(Dilution[,1]))
colnames(tmp) <- c("PM", "MM")
summary(tmp)
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{\hypertarget{sum}{Quantification = summarization}}
I'm going to avoid talking about normalization and PM correction for
the moment, and jump ahead to summarization. This step is the critical
final component in analyzing Affymetrix arrays, since it's the one
that combines all the numbers from the PM and MM probe pairs in a
probe set into a single number that represents our best guess at the
expression level of the targeted gene.
The available summarization methods, like the other available methods,
can be obtained from a variable.
<>=
express.summary.stat.methods
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{expresso}
The recommended way to put together all the steps for processing
Affymetrix arrays in BioConductor is with the function
\texttt{\textcolor{Orange}{expresso}}. Here is an example that blocks
everything except the summarization:
<>=
tempfun <- function(method) {
expresso(Dilution, bg.correct=FALSE,
normalize=FALSE,
pmcorrect.method="pmonly",
summary.method=method)
}
ad <- tempfun("avgdiff") # MAS4.0
al <- tempfun("liwong") # dChip
am <- tempfun("mas") # MAS5.0
ar <- tempfun("medianpolish") # RMA
@
<>=
if (file.exists("firstSum.Rda")) {
load("firstSum.Rda")
} else {
<>
save(ad, al, am, ar, file="firstSum.Rda")
}
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Bland-Altman (M-versus-A) plots}
Early in the study of microarrays, several groups (including ours)
introduced what have come to be known in the microarray world as
``M-versus-A'' plots or sometimes just MA-plots. Statisticians knew
these as ``Bland-Altman'' plots long before anyone started studying
microarrays, since they were among the first people to use them.
The problem being solved by a Bland-Altman MA-plot is that of
providing a useful graphical display of two vectors of data, $x$ and
$y$, which typically represent two measurements that should (almost
always) be the same. The first thing that comes to mind is to plot
$y$ against $x$ in the usual way, and see how well the points follow
the identity line $y = x$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Bland-Altman (M-versus-A) plots}
The difficulties with this simple approach are
\begin{enumerate}
\item Humans can recognize horizontal lines much more easily than the
can recognize diagonal lines.
\item Different aspect ratios (i.e., different scales along the axes)
can move the line away from the diagonal.
\item Deviations from a tilted diagonal line are hard to estimate
accurately by eye.
\end{enumerate}
The Bland-Altman solution is to rotate the plot by 45 degrees, which
turns the diagonal line into a horizontal line. To do this, they plot
the average ($(x+y)/2$) along the horizontal axis and the difference
($y-x$) along the vertical axis.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{MA-plots in BioConductor}
The \texttt{affy} package includes a function called
\texttt{\textcolor{Orange}{mva.pairs}} to make it easier to generate
these plots. (You should also check out the \textcolor{Orange}{MAplot}
function.) We are going to use this to compare the different
quantification/summary methods.
<>=
temp <- data.frame(exprs(ad)[,1], exprs(al)[,1],
exprs(am)[,1], 2^exprs(ar)[,1])
dimnames(temp)[[2]] <- c('Mas4', 'dChip',
'Mas5', 'RMA')
mva.pairs(temp)
@
\foilhead{}
\begin{center}
<>=
<>
@
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Alternate preprocessing}
It is possible that the differences we see in the MA-plots are caused
because we did no processing before summarization. We will try again,
but this time we will use \texttt{expresso} to correct background with
the RMA method, perform quantile normalization, and just use the PM
values for summarization.
<>=
tempfun <- function(method) {
expresso(Dilution,
bgcorrect.method="rma",
normalize.method="quantiles",
pmcorrect.method="pmonly",
summary.method=method)
}
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Alternate preprocessing}
Now we repeat the same commands as before, which use the four
different summarization methods on the same array and put them into a
temporary data frame for display.
<>=
ad <- tempfun("avgdiff") # MAS4.0
al <- tempfun("liwong") # dChip
am <- tempfun("mas") # MAS5.0
ar <- tempfun("medianpolish") # RMA
@
<>=
if (file.exists("secondSum.Rda")) {
load("secondSum.Rda")
} else {
<>
save(ad, al, am, ar, file="secondSum.Rda")
}
@
<>=
temp <- data.frame(exprs(ad)[,1], exprs(al)[,1],
exprs(am)[,1], 2^exprs(ar)[,1])
dimnames(temp)[[2]] <- c('Mas4', 'dChip',
'Mas5', 'RMA')
mva.pairs(temp)
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Comparison of summarization methods}
\begin{center}
<>=
<>
@
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{\hypertarget{quant}{How do the summarization methods work?}}
Recall first what we know about the oldest method for processing
Affymetrix data, the AvDiff method of MAS4.0. This method
\begin{enumerate}
\item Uses the background-correction method described above, based on
the bottom 2\% of probes.
\item Normalizes by scaling the median intensity to a fixed value.
\item Computes the PM $-$ MM differences.
\item Trims outliers and computes the average (mean) of the
differences.
\end{enumerate}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Review of dChip}
We have also looked previously at the dChip method:
%\item Does not perform background correction.
\begin{enumerate}
\item Normalizes using an ``invariant set'' method (described later).
\item Optionally uses either both PM and MM values or PM-only.
\item Fits a statistical model for sample $i$, and probe $j$,
$$
MM_{ij} = \nu_j + \theta_i \alpha_j + \epsilon
$$
$$
PM_{ij} = \nu_j + \theta_i \alpha_j + \theta_i \phi_j + \epsilon
$$
Focusing on the PM $-$ MM differences, this model estimates the probe
affinities $\phi_j$ and the expression values $\theta_i$.
\end{enumerate}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{\hypertarget{mas5}{Improving Robustness: MAS 5.0}}
Affymetrix learned something from the modelling process. In
particular, they noted the importance of multiplicative adjustments
and statistical measures with some means of identifying outliers. They
also noted that negative values from AvDiff just were not well received
by biologists or statisticians.
They modifed their algorithm in several ways. Instead of the straight
MM value, they subtract a ``change threshold'' (CT) which is
guaranteed to be lower than the PM value. Basically, they ``fix'' the
MM value when it is smaller than PM.
Next, they shifted to the log scale to capture multiplicative
effects.
Finally, they used a robust statistical method to downweight outliers
instead of their earlier \textit{ad hoc} method.
\begin{center}
signal = exp(Tukey Biweight(log($PM_{j} - CT_j$)))
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{MAS 5.0 vs MAS 4.0}
It was at this stage that Affy decided it wasn't going to fight to
have the best algorithm; it would let others play that game. Indeed,
it could reap the benefits of better algorithms by selling more chips.
To let people test their own models, they created and posted a test
dataset: The Affy Latin Square Experiment.
Using the test set, they could demonstrate that the MAS5 signal
statistic is an improvement on AvDiff. It tracks nominal fold changes
better, and it is less variable.
What it still doesn't do is use information across chips.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{\hypertarget{rma}{Robust Multichip Analysis: RMA}}
RMA (Irizarry et al, Biostatistics 2003) tries to take the better
aspects of both dChip and MAS 5.0, and to add some further twists.
Earlier in this lecture, we described the statistical model used by
RMA to perform background correction.
They normalize using the ``median polish'' method, which we will
describe in a later lecture.
They throw away the MM values entirely. They contend that there are
too many cases where MM $>$ PM, and hence including the MMs introduces
more variability than the correction is worth. (They are probably
correct.)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Robust Multichip Analysis: RMA}
As with dChip, the RMA summarization method is built around a model:
$$
log(medpol(PM_{ij} - BG)) = \mu_i + \alpha_j + \epsilon_{ij}
$$
(array $i$, probe $j$).
The parameters of this model are fit using multiple chips.
Unlike dChip, the random jitter (epsilon) is introduced
on the log scale as opposed to the raw scale. This more
accurately captures the fact that more intense probes
are more variable.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{\hypertarget{pdnn}{Incorporating other information: PDNN}}
The above methods are all mathematical, in that they focus solely on
the observed values without trying to explain those values.
Why should some probes give consistently stronger signals than others?
What governs nonspecific binding?
In general, these will depend on the exact sequence of the probe, and
the thermodynamics of the binding.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Fitting the thermodynamics}
Li Zhang introduced the Position-Dependent Nearest Neighbor
(PDNN) model (\textit{Nat Biotech}, 2003; 21:818). Unlike dChip and
RMA, the parameters for the PDNN model can all be estimated from a
single chip, in large part because the number of parameters is much
smaller.
He posits a scenario where the chance of binding is dictated by the
probe sequence, and shifts the mathematical modeling back from the
expression values to the probe sequences.
The model parameters are:
\begin{enumerate}
\item The position $k$ of a base pair in the sequence.
\item Interactions with nearest neighbors: knowing $k$, we must also
know what is at $k-1$ and $k+1$.
\end{enumerate}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{What is the model?}
The observed signal $Y_{ij}$ for probe
$i$ in the probe set for gene $j$ is modeled as
$$
Y_{ij} = \frac{N_j}{1 + exp(E_{ij})}
+ \frac{N^*}{1 + exp(E^*_{ij})} + B
$$
In this model, there are two global terms that need to be estimated:
the background, $B$, and the number, $N^*$, of RNA molecules
contributing to non-specific binding (NSB).
The quantity of interest is $N_j$, the number of expressed mRNA
molecules contributing to gene-specific binding (GSB).
The binding energies $E_{ij}$ and $E^*_{ij}$ are sums over
contributions from each position.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{What are the model parameters?}
For example, consider a probe with sequence
\begin{center}
CACCCAGCTGGTCCTGTGGATGGGA
\end{center}
\begin{itemize}
\item We write this as an ordered list of neighboring pairs:
\begin{enumerate}
\item C, A \hskip100pt energy $= w_1 \nu(b_1, b_2)$
\item A, C \hskip100pt energy $= w_2 \nu(b_2, b_3)$
\item C, C \hskip100pt energy $= w_3 \nu(b_3, b_4)$
\item C, C \hskip100pt energy $= w_4 \nu(b_4, b_5)$
\item C, A \hskip100pt energy $= w_5 \nu(b_5, b_6)$
\item A, G \hskip100pt energy $= w_6 \nu(b_6, b_7)$
\item G, C \hskip100pt energy $= w_7 \nu(b_7, b_8)$
\item C, T \hskip100pt energy $= w_8 \nu(b_8, b_9)$
\item etc.
\end{enumerate}
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Free energy}
The free energy for a perfect match is
$$
E_{ij} = \sum_k w_k \nu(b_k, b_{k+1})
$$
There is a similar formula for binding in the presence of manu
mismatches. In total, there are 2*24 weight parameters, 2*16
neighboring base pair parameters, 2 global parameters, plus one
expression parameter per probe set. Because there are many probes in
each probe set, we can fit all these parameters with a single chip.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Fitted weight parameters}
\begin{center}
\includegraphics[height=6in]{Figs/zhang-weight}
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Using PDNN in R}
The implementation of the PDNN method is contained in a separate
BioConductor package. When you load the package libary, it updates the
list of available methods.
<>=
library(affypdnn)
express.summary.stat.methods
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Using PDNN in R}
One should note that the PDNN method does not follow the standard
four-step procedure used by \texttt{expresso}. Instead of background
correction, the method starts immediately with quantile normalization.
The model can be fit separately on the PM and MM probes, or the MM
probes can be discarded. Background is estimated along with the energy
parameters and expression parameters as part of a single model.
In particular, you must use a variant of \texttt{expresso} called
\texttt{\textcolor{Orange}{expressopdnn}}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Which method is best?}
Well, all of the above methods are implemented in Bioconductor.
we're going to try a few head to head comparisons later.
In this context, it's worth thinking about how we can
define a measure of ``goodness''. Hmm?
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{\hypertarget{qc}{Quality control assessment}}
A critical part of the analysis of any set of microarray experiments
is ensuring that the arrays are of reasonable quality. BioConductor
includes several methods to assist with the QC effort for Affymetrix
projects.
The first step is typically to look at the images, which we did in the
previous lecture.
We can also look at the distributions of intensities to se how well
they match.
BioConductor includes tools to compute some summary statistics that
tell us about the relative quality and comparability of arrays
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{A summary view of four images}
\begin{center}
<>=
boxplot(Dilution, col=2:5)
@
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{The distribution of feature intensities}
\begin{center}
<>=
hist(Dilution, col=2:5, lty=1)
@
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Simple Affy}
We can use the \texttt{\textcolor{Orange}{simpleaffy}} package to
compute some QC summary statistics.
<>=
require(simpleaffy) # load the library
Dil.qc <- qc(Dilution) # compute the QC metrics
@
Computing the metrics will take a little time, and then we can start
to look at them.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Background}
The first check we make is that the background values across the
arrays should be roughly comparable. In the four arrays from the
dilution experiment, that seems to be the case.
<>=
avbg(Dil.qc) # average background
maxbg(Dil.qc) # maximum background
minbg(Dil.qc) # minimum background
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Global scaling factors}
As mentioned above, the standard Affymetrix normalization procedure
involves globally rescaling the arrays to set the median probe
intensity to the same level. Affymetrix says that the scaling factors
should not difffer by more than 3-fold if we want to compare arrays.
<>=
sfs(Dil.qc)
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Percent present calls}
Extremely low (below about 30\%) or high (above about 60\%) values for
the percentage of probes called present also signal potential quality
problems.
<>=
percent.present(Dil.qc)
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{3'/5' ratios}
Affymetrix includes probes at the 3' and 5' ends of some control
genes; the ratios should be less than 3.
% cheat to get it to fit on one page
\begin{verbatim}
> ratios(Dil.qc)
AFFX-HSAC07.3'/5' AFFX-HUMGAPDH.3'/5'
20A 0.6961423 0.4429746
20B 0.7208418 0.3529890
10A 0.8712069 0.4326566
10B 0.9313709 0.5726650
AFFX-HSAC07.3'/M AFFX-HUMGAPDH.3'/M
20A 0.1273385 -0.0602414
20B 0.1796231 -0.0136629
10A 0.2112914 0.4237527
10B 0.2725534 0.1125823
\end{verbatim}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{RNA degradation}
Individual (perfect match) probes in each probe set are ordered by
location relative to the 5' end of the targeted mRNA molecule. We also
know that RNA degradation typically starts at the 5' end, so we would
expect probe intensities to be lower near the 5' end than near the 3'
end.
The \texttt{affy} package of BioConductor includes functions to
summarize and plot the degree of RNA degradation in a series of
Affymetrix experiments. These methods pretend that something like
``the fifth probe in an Affymetrix probe set'' is a meaningful notion,
and they average these things over all probe sets on the array.
<>=
degrade <- AffyRNAdeg(Dilution)
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Visualizing RNA degradation}
\begin{center}
<>=
plotAffyRNAdeg(degrade, col=2:5)
@
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Model-based QC}
As we have seen, methods like dChip and RMA fit statistical models to
the probe intensities in order to summarize gene expression
values. The quantities associated with such models can also provide QC
information. The BioConductor package
\texttt{\textcolor{Orange}{affyPLM}} fits a probe-level model (PLM)
similar to RMA.
<>=
library("affyPLM")
pset <- fitPLM(Dilution)
@
The residuals (unexplained variation) from the model can be plotted
using the \texttt{image} function. Patterns here typically indicate
spatial flaws in the image that are not captured by the model. No such
features were noted in the Dilution experiment, so I will not
reproduce the pictures.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Relative Log Expression plots}
\begin{center}
<>=
Mbox(pset)
@
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\foilhead{Normalized Unscaled Standard Error}
\begin{center}
<>=
boxplot(pset)
@
\end{center}
\end{document}
% do this so we can re-run Sweave from a clean start
<>=
detach("package:affypdnn")
detach("package:hgu95av2probe")
detach("package:matchprobes")
detach("package:affydata")
detach("package:hgu95av2cdf")
detach("package:affy")
detach("package:affyio")
detach("package:Biobase")
detach("package:tools")
rm(list=ls())
@