% % Lectures for Microarray Course % % This is the template file for all lectures to ensure consistent % layout, fonts, and colors. % % Remember to update the hypersetup to include individual lecture % titles, change the leftheader to include the lecture title, and % then start at \begin{document} .... % Also, produce a printable (black and white version) according % to the cues immediately following \begin{document}. % % Author: Keith Baggerly and Kevin R. Coombes % Date : 19 September 2006 % % Compile this document with: % pdflatex .tex % and then: % ppower4pb .pdf .pdf % %\documentclass[Screen4to3,25pt,headrule,footrule]{foils} \documentclass[landscape,25pt,headrule,footrule]{foils} \usepackage[pdftex]{color} %for colors \definecolor{Teal}{rgb}{0.0,0.47,0.46} \definecolor{PaleTeal}{rgb}{0.0,0.8,0.6} \definecolor{DarkTeal}{rgb}{0.0,0.2157,0.2118} \definecolor{Lemon}{rgb}{1.0,1.0,0.4} \definecolor{Orange}{rgb}{1.0,0.6,0.0} \definecolor{PaleBlue}{rgb}{0.0,0.8,1.0} \definecolor{Gray7}{rgb}{0.7,0.7,0.7} \definecolor{DarkBurgundy}{rgb}{0.2745,0.0,0.0} \definecolor{Burgundy}{rgb}{0.62,0.06,0.41} \definecolor{DarkGreen}{rgb}{0.5,0.8,0.6} \definecolor{RGBblack}{rgb}{0.0,0.0,0.0} %black doesn't work (?) for % {h,v}pagecolor so use this \usepackage[pdftex]{graphicx} %for importing graphics \usepackage{times} %for nice fonts \usepackage[pdftex]{hyperref} %for hyperlinking and other effects \hypersetup{ pdftitle={Lecture 7: Affymetrix, R, and BioConductor}, pdfsubject={Analysis of Microarray Data}, pdfauthor={Keith A. Baggerly and Kevin R. Coombes, Department of Biostatistics and Applied Mathematics, UT M.D. Anderson Cancer Center, , }, pdfkeywords={microarray,R}, pdfpagemode={None}, linkcolor=PaleBlue, citecolor=Gray7, pagecolor=PaleBlue, urlcolor=PaleBlue } \usepackage{pause} %required for ppower4 \usepackage{background} %required for ppower4 \usepackage{pp4slide} %required for ppower4 \usepackage{pp4link} \usepackage[pdftex]{geometry} %for resizing the header areas \geometry{hscale=0.85, %make the slides 85% of horizontal vscale=0.8 % and 80% of vertical } \setcounter{page}{0} %don't count the title page \setlength{\foilheadskip}{-0.5in} %skip less space below the title % use different colored bullets in nested lists % color choices come from MDACC graphics standards \renewcommand{\labelitemi}{\textcolor{PaleTeal}{$\bullet$}} \renewcommand{\labelitemii}{\textcolor{Orange}{$\bullet$}} \renewcommand{\labelitemiii}{\textcolor{PaleBlue}{$\bullet$}} \renewcommand{\labelitemiv}{\textcolor{Lemon}{$\bullet$}} % change color of slide titles %\renewcommand\normalcolor{\color{black}} %\def\Black#1{\color{black}#1} % change the default background, again an MDACC standard \vpagecolor[RGBblack]{Teal} \raggedright %no right-justify % put lecture specific items in the header and consistent course % items in the footer \leftheader{\hyperlink{start}{\textsc{Affymetrix, R, and BioConductor}}} \rightheader{\textsf{\thepage}} \MyLogo{{\copyright\ Copyright 2004--2007 Kevin R.~Coombes and Keith A.~Baggerly}} % left footer \rightfooter{\textsc{GS01 0163: Analysis of Microarray Data}} \def\rcode#1{\texttt{\textcolor{Orange}{#1}}} \SweaveOpts{prefix.string=Figs/ma07} \SweaveOpts{strip.white=TRUE} \begin{document} \hypersetup{pdfpagetransition=Replace} <>= options(width=52) options(SweaveHooks=list(fig=function() par(bg="white"))) @ %\color{black} % for black-and-white version % Also, run M-x replace-string \textcolor{white} \textcolor{black}. % Then, using pdflatex only (no ppower4 step) should produce the % black and white version with pauses shown as colored boxes. % Remember to run M-x replace-string \textcolor{black} \textcolor{white} % afterwards to reset! %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Title Page \title{\textcolor{Lemon}{GS01 0163\\ Analysis of Microarray Data}} \author{\textcolor{white}{Keith Baggerly and Kevin Coombes}\\ \textcolor{white}{Section of Bioinformatics}\\ \textcolor{white}{Department of Bioinformatics and Computational Biology}\\ \textcolor{white}{UT M. D. Anderson Cancer Center}\\ \textcolor{Orange}{\tt kabagg@mdanderson.org}\\ \textcolor{Orange}{\tt kcoombes@mdanderson.org}} \date{\textcolor{Lemon}{20 September 2007}} \LogoOff \maketitle %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Individual Lecture Title Page \foilhead{\hypertarget{start}{Lecture 7}: Affymetrix, R, and BioConductor} \LogoOn % In most cases, this slide should be an outline of topics to be % covered in this lecture, presented as an itemized list. Each item in % the list should provide a hyperlink to the slide where discussion of % that topic begins. \begin{itemize} \item \hyperlink{readaffy}{Reading Affymetrix data with BioConductor} \item \hyperlink{affyprocess}{Processing Affymetrix data} \item \hyperlink{sum}{Quantification = summarization} \item \hyperlink{quant}{Description of quantification methods} \begin{itemize} \item \hyperlink{mas5}{MAS 5.0} \item \hyperlink{rma}{RMA} \item \hyperlink{pdnn}{PDNN} \end{itemize} \item \hyperlink{qc}{Quality control assessment} \end{itemize} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{\hypertarget{readaffy}{Reading Affymetrix data with BioConductor}} Once upon a time, the BioConductor \texttt{affy} package included a graphical interface (via the function \rcode{ReadAffy()}) that made it easier to read in Affymetrix data and contruct \texttt{AffyBatch} objects. This method started to break in version in 2.4, and is now essentially useless in version 2.5. If we are really lucky, they may fix it in version 2.6. In the meantime, we have to do everything by writing scripts by hand. Since we are interested in docuemting things and making them reproducible, it may be to our advantage that the (irreproducible) GUI no longer works. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Getting started} We start by loading the basic Affymetrix R library. <>= library(affy) @ % This output is not captured by Sweave, so we have to copy and paste % it manually. \begin{verbatim} Loading required package: Biobase Loading required package: tools Welcome to Bioconductor Vignettes contain introductory material. To view, type 'openVignette()'. To cite Bioconductor, see 'citation("Biobase")' and for packages 'citation(pkgname)'. Loading required package: affyio \end{verbatim} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{File locations} As we did with dChip, we have to let the system know where the files are located. In this example (as in most examples), all the CEL files are stored in a single directory. The associated sample information files are nearby. <>= datapath <- "G:/Public/Singh-Prostate-Affymetrix" celpath <- file.path(datapath, "CelFiles") @ There is a function that makes it easy to get a list of the CEL files <>= filenames <- list.celfiles(celpath) filenames[1:3] @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{AnnotatedDataFrame} In earlier version of BioConductor, the ``sample information file'' was loaded into an object of class \rcode{phenoData}. In the most recent version, this has been replaced by the \rcode{AnnotatedDataFrame} class. Both objects basically represent a single data frame, with auxiliary information that expands on the column names. We can read the same sample information file that we used for dChip into R and make it into an AnnotatedDataFrame: <>= adf <- read.AnnotatedDataFrame(file.path(datapath, "/subsamples.txt"), header=TRUE, sep="\t", row.names=2) adf @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Actually creating the annotations} As you just saw, the extra annotations are not created when you read a file this way. We prepared another file describing the columns. <>= tmp <- read.table(file.path(datapath, "explain.txt"), header=TRUE, sep="\t", quote="", row.names="Id") varMetadata(adf) <- tmp rm(tmp) adf @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{MIAME} MIAME = "Minimum Information About a Microarray Experiment" Here is a MIAME object for the Singh experiment: <>= miame <- new("MIAME", name = "Dinesh Singh", lab = "William F. Sellers", title = paste("Gene", "expression correlates of clinical", "prostate cancer behavior"), pubMedIds = c("12086878")) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Creating an AffyBatch} R (on a 32-bit machine) is unable to read all 103 CEL files into an AffyBatch at once. (We will talk later about how to get around some of these limitations.) In this case, the sample file we read in earlier actually only contains descriptions of $20$ of the CEL files, which is an amount we can easily read. <>= filenames <- paste(pData(adf)$Array.name, "CEL", sep=".") #$ abatch <- read.affybatch(filenames=file.path(celpath, filenames), phenoData=adf, description=miame) rm(adf, miame) @ <>= if (file.exists("abatch.Rda")) { load("abatch.Rda") } else { <> save(abatch, file="abatch.Rda") } @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{The AffyBatch} Now we can look at a summary of what we have so far: <>= abatch @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{The Dilution DataSet} FOr the rest of this lecture, we are going to use the sampe dataq set that ships with BioConductor. <>= library(affydata) data(Dilution) Dilution @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{\hypertarget{affyprocess}{Processing Affymetrix Data}} BioConductor breaks down the low-level processing of Affymetrix data into four steps. The design is highly modular, so you can choose different algorithms at each step. It is highly likely that the results of later (high-level) analyses will change depending on your choices at these steps. \begin{itemize} \item Background correction \item Normalization (on the level of features = probes) \item PM-correction \item Summarization \end{itemize} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Background correction} The list of available background correction methods is stored in a variable: <>= bgcorrect.methods @ \begin{description} \item[none]{Do nothing} \item[mas]{Use the algorithm from MAS 5.0} \item[rma]{Use the algorithm from the current version of RMA} \item[rma2]{Use the algorithm from an older version of RMA} \end{description} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Background correction in MAS 5.0} MAS 5.0 divides the microarray (more precisely, the CEL file) into 16 regions. In each region, the intensity of the dimmest 2\% of features is used to define the background level. Each probe is then adjusted by a weighted average of these 16 values, with the weights depending on the distance to the centroids of the 16 regions. \begin{center} \includegraphics[height=3in]{Figs/masbgcel} \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Background correction in RMA} RMA takes a different approach to background correction. First, only PM values are adjusted, the MM values are not changed. Second, they try to model the distribution of PM intensities as a sum of \vskip -0.5in \begin{itemize} \item exponential signal with mean $\lambda$ \item normal noise with mean $\mu$ and variance $\sigma^2$ (truncated at $0$ to avoid negatives). \end{itemize} If we observe a signal $X=x$ at a PM feature, we adjust it by $$ E(s | X=x) = a + b \frac{\phi(a/b) - \phi((x-a)/b)} {\Phi(a/b) + \Phi((x-a)/b) - 1} $$ where $b=\sigma$ and $a= s -\mu-\lambda\sigma^2$. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Comparing background methods} <>= d.mas <- bg.correct(Dilution[,1], "mas") d.rma <- bg.correct(Dilution[,1], "rma") bg.with.mas <- pm(Dilution[,1]) - pm(d.mas) bg.with.rma <- pm(Dilution[,1]) - pm(d.rma) @ <>= if (file.exists("bg.Rda")) { load("bg.Rda") } else { <> save(d.mas, d.rma, bg.with.mas, bg.with.rma, file="bg.Rda") } @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{} \begin{center} <>= hist(bg.with.mas) @ \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{} \begin{center} <>= hist(bg.with.rma) @ \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{} <>= summary(data.frame(bg.with.mas, bg.with.rma)) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Difference in background estimates} On this array, RMA gives slightly larger background estimates, and gives estimates that are more nearly constant across the array. The overall differences can be displayed in a histogram. \foilhead{} \begin{center} <>= hist(bg.with.rma - bg.with.mas) @ \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{How big is 20 units?} <>= tmp <- data.frame(pm(Dilution[,1]), mm(Dilution[,1])) colnames(tmp) <- c("PM", "MM") summary(tmp) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{\hypertarget{sum}{Quantification = summarization}} I'm going to avoid talking about normalization and PM correction for the moment, and jump ahead to summarization. This step is the critical final component in analyzing Affymetrix arrays, since it's the one that combines all the numbers from the PM and MM probe pairs in a probe set into a single number that represents our best guess at the expression level of the targeted gene. The available summarization methods, like the other available methods, can be obtained from a variable. <>= express.summary.stat.methods @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{expresso} The recommended way to put together all the steps for processing Affymetrix arrays in BioConductor is with the function \texttt{\textcolor{Orange}{expresso}}. Here is an example that blocks everything except the summarization: <>= tempfun <- function(method) { expresso(Dilution, bg.correct=FALSE, normalize=FALSE, pmcorrect.method="pmonly", summary.method=method) } ad <- tempfun("avgdiff") # MAS4.0 al <- tempfun("liwong") # dChip am <- tempfun("mas") # MAS5.0 ar <- tempfun("medianpolish") # RMA @ <>= if (file.exists("firstSum.Rda")) { load("firstSum.Rda") } else { <> save(ad, al, am, ar, file="firstSum.Rda") } @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Bland-Altman (M-versus-A) plots} Early in the study of microarrays, several groups (including ours) introduced what have come to be known in the microarray world as ``M-versus-A'' plots or sometimes just MA-plots. Statisticians knew these as ``Bland-Altman'' plots long before anyone started studying microarrays, since they were among the first people to use them. The problem being solved by a Bland-Altman MA-plot is that of providing a useful graphical display of two vectors of data, $x$ and $y$, which typically represent two measurements that should (almost always) be the same. The first thing that comes to mind is to plot $y$ against $x$ in the usual way, and see how well the points follow the identity line $y = x$. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Bland-Altman (M-versus-A) plots} The difficulties with this simple approach are \begin{enumerate} \item Humans can recognize horizontal lines much more easily than the can recognize diagonal lines. \item Different aspect ratios (i.e., different scales along the axes) can move the line away from the diagonal. \item Deviations from a tilted diagonal line are hard to estimate accurately by eye. \end{enumerate} The Bland-Altman solution is to rotate the plot by 45 degrees, which turns the diagonal line into a horizontal line. To do this, they plot the average ($(x+y)/2$) along the horizontal axis and the difference ($y-x$) along the vertical axis. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{MA-plots in BioConductor} The \texttt{affy} package includes a function called \texttt{\textcolor{Orange}{mva.pairs}} to make it easier to generate these plots. (You should also check out the \textcolor{Orange}{MAplot} function.) We are going to use this to compare the different quantification/summary methods. <>= temp <- data.frame(exprs(ad)[,1], exprs(al)[,1], exprs(am)[,1], 2^exprs(ar)[,1]) dimnames(temp)[[2]] <- c('Mas4', 'dChip', 'Mas5', 'RMA') mva.pairs(temp) @ \foilhead{} \begin{center} <>= <> @ \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Alternate preprocessing} It is possible that the differences we see in the MA-plots are caused because we did no processing before summarization. We will try again, but this time we will use \texttt{expresso} to correct background with the RMA method, perform quantile normalization, and just use the PM values for summarization. <>= tempfun <- function(method) { expresso(Dilution, bgcorrect.method="rma", normalize.method="quantiles", pmcorrect.method="pmonly", summary.method=method) } @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Alternate preprocessing} Now we repeat the same commands as before, which use the four different summarization methods on the same array and put them into a temporary data frame for display. <>= ad <- tempfun("avgdiff") # MAS4.0 al <- tempfun("liwong") # dChip am <- tempfun("mas") # MAS5.0 ar <- tempfun("medianpolish") # RMA @ <>= if (file.exists("secondSum.Rda")) { load("secondSum.Rda") } else { <> save(ad, al, am, ar, file="secondSum.Rda") } @ <>= temp <- data.frame(exprs(ad)[,1], exprs(al)[,1], exprs(am)[,1], 2^exprs(ar)[,1]) dimnames(temp)[[2]] <- c('Mas4', 'dChip', 'Mas5', 'RMA') mva.pairs(temp) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Comparison of summarization methods} \begin{center} <>= <> @ \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{\hypertarget{quant}{How do the summarization methods work?}} Recall first what we know about the oldest method for processing Affymetrix data, the AvDiff method of MAS4.0. This method \begin{enumerate} \item Uses the background-correction method described above, based on the bottom 2\% of probes. \item Normalizes by scaling the median intensity to a fixed value. \item Computes the PM $-$ MM differences. \item Trims outliers and computes the average (mean) of the differences. \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Review of dChip} We have also looked previously at the dChip method: %\item Does not perform background correction. \begin{enumerate} \item Normalizes using an ``invariant set'' method (described later). \item Optionally uses either both PM and MM values or PM-only. \item Fits a statistical model for sample $i$, and probe $j$, $$ MM_{ij} = \nu_j + \theta_i \alpha_j + \epsilon $$ $$ PM_{ij} = \nu_j + \theta_i \alpha_j + \theta_i \phi_j + \epsilon $$ Focusing on the PM $-$ MM differences, this model estimates the probe affinities $\phi_j$ and the expression values $\theta_i$. \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{\hypertarget{mas5}{Improving Robustness: MAS 5.0}} Affymetrix learned something from the modelling process. In particular, they noted the importance of multiplicative adjustments and statistical measures with some means of identifying outliers. They also noted that negative values from AvDiff just were not well received by biologists or statisticians. They modifed their algorithm in several ways. Instead of the straight MM value, they subtract a ``change threshold'' (CT) which is guaranteed to be lower than the PM value. Basically, they ``fix'' the MM value when it is smaller than PM. Next, they shifted to the log scale to capture multiplicative effects. Finally, they used a robust statistical method to downweight outliers instead of their earlier \textit{ad hoc} method. \begin{center} signal = exp(Tukey Biweight(log($PM_{j} - CT_j$))) \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{MAS 5.0 vs MAS 4.0} It was at this stage that Affy decided it wasn't going to fight to have the best algorithm; it would let others play that game. Indeed, it could reap the benefits of better algorithms by selling more chips. To let people test their own models, they created and posted a test dataset: The Affy Latin Square Experiment. Using the test set, they could demonstrate that the MAS5 signal statistic is an improvement on AvDiff. It tracks nominal fold changes better, and it is less variable. What it still doesn't do is use information across chips. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{\hypertarget{rma}{Robust Multichip Analysis: RMA}} RMA (Irizarry et al, Biostatistics 2003) tries to take the better aspects of both dChip and MAS 5.0, and to add some further twists. Earlier in this lecture, we described the statistical model used by RMA to perform background correction. They normalize using the ``median polish'' method, which we will describe in a later lecture. They throw away the MM values entirely. They contend that there are too many cases where MM $>$ PM, and hence including the MMs introduces more variability than the correction is worth. (They are probably correct.) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Robust Multichip Analysis: RMA} As with dChip, the RMA summarization method is built around a model: $$ log(medpol(PM_{ij} - BG)) = \mu_i + \alpha_j + \epsilon_{ij} $$ (array $i$, probe $j$). The parameters of this model are fit using multiple chips. Unlike dChip, the random jitter (epsilon) is introduced on the log scale as opposed to the raw scale. This more accurately captures the fact that more intense probes are more variable. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{\hypertarget{pdnn}{Incorporating other information: PDNN}} The above methods are all mathematical, in that they focus solely on the observed values without trying to explain those values. Why should some probes give consistently stronger signals than others? What governs nonspecific binding? In general, these will depend on the exact sequence of the probe, and the thermodynamics of the binding. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Fitting the thermodynamics} Li Zhang introduced the Position-Dependent Nearest Neighbor (PDNN) model (\textit{Nat Biotech}, 2003; 21:818). Unlike dChip and RMA, the parameters for the PDNN model can all be estimated from a single chip, in large part because the number of parameters is much smaller. He posits a scenario where the chance of binding is dictated by the probe sequence, and shifts the mathematical modeling back from the expression values to the probe sequences. The model parameters are: \begin{enumerate} \item The position $k$ of a base pair in the sequence. \item Interactions with nearest neighbors: knowing $k$, we must also know what is at $k-1$ and $k+1$. \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{What is the model?} The observed signal $Y_{ij}$ for probe $i$ in the probe set for gene $j$ is modeled as $$ Y_{ij} = \frac{N_j}{1 + exp(E_{ij})} + \frac{N^*}{1 + exp(E^*_{ij})} + B $$ In this model, there are two global terms that need to be estimated: the background, $B$, and the number, $N^*$, of RNA molecules contributing to non-specific binding (NSB). The quantity of interest is $N_j$, the number of expressed mRNA molecules contributing to gene-specific binding (GSB). The binding energies $E_{ij}$ and $E^*_{ij}$ are sums over contributions from each position. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{What are the model parameters?} For example, consider a probe with sequence \begin{center} CACCCAGCTGGTCCTGTGGATGGGA \end{center} \begin{itemize} \item We write this as an ordered list of neighboring pairs: \begin{enumerate} \item C, A \hskip100pt energy $= w_1 \nu(b_1, b_2)$ \item A, C \hskip100pt energy $= w_2 \nu(b_2, b_3)$ \item C, C \hskip100pt energy $= w_3 \nu(b_3, b_4)$ \item C, C \hskip100pt energy $= w_4 \nu(b_4, b_5)$ \item C, A \hskip100pt energy $= w_5 \nu(b_5, b_6)$ \item A, G \hskip100pt energy $= w_6 \nu(b_6, b_7)$ \item G, C \hskip100pt energy $= w_7 \nu(b_7, b_8)$ \item C, T \hskip100pt energy $= w_8 \nu(b_8, b_9)$ \item etc. \end{enumerate} \end{itemize} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Free energy} The free energy for a perfect match is $$ E_{ij} = \sum_k w_k \nu(b_k, b_{k+1}) $$ There is a similar formula for binding in the presence of manu mismatches. In total, there are 2*24 weight parameters, 2*16 neighboring base pair parameters, 2 global parameters, plus one expression parameter per probe set. Because there are many probes in each probe set, we can fit all these parameters with a single chip. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Fitted weight parameters} \begin{center} \includegraphics[height=6in]{Figs/zhang-weight} \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Using PDNN in R} The implementation of the PDNN method is contained in a separate BioConductor package. When you load the package libary, it updates the list of available methods. <>= library(affypdnn) express.summary.stat.methods @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Using PDNN in R} One should note that the PDNN method does not follow the standard four-step procedure used by \texttt{expresso}. Instead of background correction, the method starts immediately with quantile normalization. The model can be fit separately on the PM and MM probes, or the MM probes can be discarded. Background is estimated along with the energy parameters and expression parameters as part of a single model. In particular, you must use a variant of \texttt{expresso} called \texttt{\textcolor{Orange}{expressopdnn}}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Which method is best?} Well, all of the above methods are implemented in Bioconductor. we're going to try a few head to head comparisons later. In this context, it's worth thinking about how we can define a measure of ``goodness''. Hmm? %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{\hypertarget{qc}{Quality control assessment}} A critical part of the analysis of any set of microarray experiments is ensuring that the arrays are of reasonable quality. BioConductor includes several methods to assist with the QC effort for Affymetrix projects. The first step is typically to look at the images, which we did in the previous lecture. We can also look at the distributions of intensities to se how well they match. BioConductor includes tools to compute some summary statistics that tell us about the relative quality and comparability of arrays %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{A summary view of four images} \begin{center} <>= boxplot(Dilution, col=2:5) @ \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{The distribution of feature intensities} \begin{center} <>= hist(Dilution, col=2:5, lty=1) @ \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Simple Affy} We can use the \texttt{\textcolor{Orange}{simpleaffy}} package to compute some QC summary statistics. <>= require(simpleaffy) # load the library Dil.qc <- qc(Dilution) # compute the QC metrics @ Computing the metrics will take a little time, and then we can start to look at them. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Background} The first check we make is that the background values across the arrays should be roughly comparable. In the four arrays from the dilution experiment, that seems to be the case. <>= avbg(Dil.qc) # average background maxbg(Dil.qc) # maximum background minbg(Dil.qc) # minimum background @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Global scaling factors} As mentioned above, the standard Affymetrix normalization procedure involves globally rescaling the arrays to set the median probe intensity to the same level. Affymetrix says that the scaling factors should not difffer by more than 3-fold if we want to compare arrays. <>= sfs(Dil.qc) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Percent present calls} Extremely low (below about 30\%) or high (above about 60\%) values for the percentage of probes called present also signal potential quality problems. <>= percent.present(Dil.qc) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{3'/5' ratios} Affymetrix includes probes at the 3' and 5' ends of some control genes; the ratios should be less than 3. % cheat to get it to fit on one page \begin{verbatim} > ratios(Dil.qc) AFFX-HSAC07.3'/5' AFFX-HUMGAPDH.3'/5' 20A 0.6961423 0.4429746 20B 0.7208418 0.3529890 10A 0.8712069 0.4326566 10B 0.9313709 0.5726650 AFFX-HSAC07.3'/M AFFX-HUMGAPDH.3'/M 20A 0.1273385 -0.0602414 20B 0.1796231 -0.0136629 10A 0.2112914 0.4237527 10B 0.2725534 0.1125823 \end{verbatim} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{RNA degradation} Individual (perfect match) probes in each probe set are ordered by location relative to the 5' end of the targeted mRNA molecule. We also know that RNA degradation typically starts at the 5' end, so we would expect probe intensities to be lower near the 5' end than near the 3' end. The \texttt{affy} package of BioConductor includes functions to summarize and plot the degree of RNA degradation in a series of Affymetrix experiments. These methods pretend that something like ``the fifth probe in an Affymetrix probe set'' is a meaningful notion, and they average these things over all probe sets on the array. <>= degrade <- AffyRNAdeg(Dilution) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Visualizing RNA degradation} \begin{center} <>= plotAffyRNAdeg(degrade, col=2:5) @ \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Model-based QC} As we have seen, methods like dChip and RMA fit statistical models to the probe intensities in order to summarize gene expression values. The quantities associated with such models can also provide QC information. The BioConductor package \texttt{\textcolor{Orange}{affyPLM}} fits a probe-level model (PLM) similar to RMA. <>= library("affyPLM") pset <- fitPLM(Dilution) @ The residuals (unexplained variation) from the model can be plotted using the \texttt{image} function. Patterns here typically indicate spatial flaws in the image that are not captured by the model. No such features were noted in the Dilution experiment, so I will not reproduce the pictures. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Relative Log Expression plots} \begin{center} <>= Mbox(pset) @ \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \foilhead{Normalized Unscaled Standard Error} \begin{center} <>= boxplot(pset) @ \end{center} \end{document} % do this so we can re-run Sweave from a clean start <>= detach("package:affypdnn") detach("package:hgu95av2probe") detach("package:matchprobes") detach("package:affydata") detach("package:hgu95av2cdf") detach("package:affy") detach("package:affyio") detach("package:Biobase") detach("package:tools") rm(list=ls()) @