\documentclass[a4paper,11pt]{article} \usepackage{url} \begin{document} \title{Linear Models in Microarrays: An Introduction} \author{by James Wettenhall} \date{15 October 2004\\ (with minor edits 23 November 2015)} \maketitle \section{Introduction} This document is intended to be only a very brief introduction to linear models in microarrays. For more detailed information see the limma User's Guide \url{https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf}. \section{M and A} We will first discuss M and A from the point of two-color cDNA microarrays, in which one can talk about comparisons within a slide (between the two colors) or comparisons between slides. Rather than representing microarray data for a cDNA slide with raw Red and Green intensities, it is better to use log intensities (base 2 is the standard base used for the logarithms). This gives a more symmetrical distribution about the mean values of $log_{2}R$ and $log_{2}G$ than you would get if you used R and G directly. Furthermore, we are most interested in differences between R and G and overall intensities of spots (the geometric mean of the Red and Green intensities), so we define $M$, the log differential- expression ratio as $M=\log_2(R/G)$ and $A$, the log intensity as $A=\frac{1}{2} \log_2(RG)$. \\ \begin{eqnarray*} M=\log_2(R/G) = \log_2(R) - \log_2(G) \\ A=\frac{1}{2} \log_2(RG) = \frac{1}{2} (\log_2(R) + \log_2(G)) \end{eqnarray*} $M$ can in fact be used to compare any pair of RNA types in the microarray experiments, whether they are on the same slide or on different slides. It is still called the log differential-expression ratio. \section{Linear Models} To illustrate the simplest case of a linear model, consider two slides for which the same hybridization has been performed for both slides with the same dye colors. In this case, the best estimate for the M value for each gene is simply the average of the two M values for that gene, one from each slide. If a dye-swap was performed, then, one of the M values would have to be multiplied by $-1$ before taking the average. The fitted M values in limmaGUI refer to the M values after the ``averaging'' has been done. Things become more complicated when you want to estimate confidence in your average M values. The t statistic, B statistic and P value in the toptables in limmaGUI are used to provide an overall ranking of genes in order of evidence for differential expression. (By default, ranking is done by the B statistic.) In order to calculate these statistics, limmaGUI must consider all replicates of each gene (whether on the same slide or different slides) and consider the \emph{variation} in M values as well as the magnitude of the M values to decide which genes are differentially expressed. If you just use M (the log differential-expression ratio) to rank genes from a microarray experiment, then you are ignoring all of the information about variability between replicates. As well as using a linear model to estimate ``average'' M values, it is also possible to estimate M values for comparisons which were not directly performed in the experiment. If A is hybridized with B and B is hybridized with C, then you can estimate two M values with a linear model. One possible choice is to estimate M for the comparison (A,B) and estimate M for the comparison (B,C). But it would also be possible to estimate M for comparison (A,B) and M for comparison (A,C), even though there was no direct hybridization between A and C. In limmaGUI, this is known as choosing a parameterization. A parameterization can be specified in terms of simple comparisons between RNA types, e.g. (A,B) and (A,C), or for people with a bit of statistical experience, the parameterization can be specified in terms of a design matrix (by pressing an ``Advanced'' button.) The columns of the design matrix represent the parameters to be estimated by the linear model (e.g. M values for comparisons (A,B) and (A,C)) and the rows of the design matrix are the slides in the experiment. For the simple averaging scenario described at the beginning of this section, the design matrix would simply be a column of two 1's. If there was a dye-swap, then one of the 1's would become a -1. In limmaGUI you can try specifying a parameterization in the simple way (comparisons between pairs of RNA types) and then press the ``Advanced'' button to see what the corresponding design matrix looks like. \end{document}