%&pdflatex %\VignetteIndexEntry{User manual for R-Package PMM} %\VignetteDepends{pmm} \documentclass[a4paper]{article} \usepackage[left=3cm,right=3cm,top=3cm,bottom=4cm]{geometry} \parindent0ex \parskip1ex \usepackage{Sweave} \usepackage{amsmath, amssymb} \begin{document} \SweaveOpts{prefix.string=plot, eps = FALSE, pdf = TRUE} \title{Fitting the Parallel Mixed Model with the \texttt{PMM} R-package} \author{Anna Drewek} \maketitle \section{Introduction} The Parallel Mixed Model (PMM) approach is suitable for hit selection and cross-comparison of RNAi screens generated in experiments that are performed in parallel under several conditions. For example, we could think of the measurements or readouts from cells under RNAi knock-down, which are infected with several pathogens or which are grown from different cell lines. PMM simultaneously takes into account all the knock-down effects in order to gain statistical power for the hit detection. As a special feature, PMM allows incorporating RNAi weights that can be assigned according to the additional information on the used RNAis or the screening quality. The theory behind PMM is shortly described in the second section (more details can be found in \cite{Drewek2014}). The third section shows the functionality of the \texttt{PMM} R-package by using an RNAi dataset as example. \section{Background} PMM is composed of a linear mixed model and an assessment of the local False Discovery Rate. The linear mixed model consists of a fixed effect for condition and of two random effects for gene $g$ and for gene $g$ within a condition $c$. We denote the readout or measurement result of the RNAi $s$ silencing the gene $g$ as $y_{gcs}$ for the condition $c$. The linear mixed model of the PMM is defined as the following linear model $$ Y_{gcs} = \mu_{c} + a_{g} + b_{cg} + \beta X_{gcs} + \varepsilon_{gcs} $$ where $\mu_{c}$ is the fixed effect for condition $c$ (typically close to 0 if the data is Z-Scored), $a_{g}$ is the gene effect overall pathogens, $b_{cg}$ is the gene effect within a pathogen and $\varepsilon_{gcs}$ denotes the error term.\\ The effect of a certain gene $g$ within a condition $c$ is described by the sum of the two random effects: $$ c_{cg} = a_{g} + b_{cg} .$$ A positive estimated $c_{cg}$ effect means that the RNAi readout for condition $c$ is enhanced if the corresponding gene $g$ is knocked down. A negative effect means that the RNAi readout is reduced. The linear mixed model is estimated by the \texttt{lmer} function from the \texttt{lme4} R-package.\\ To distinguish hit genes, PMM provides as second step an estimate of the local False Discovery Rate (FDR). We define the local false discovery rate as $$ \widehat{fdr}(c) = \frac{\widehat{\pi_{0}} \widehat{f_{0}}(c)}{\widehat{f(c)}} $$ where $\pi_{0}$ stands for the proportion of true hits, $f_{0}$ for the distribution of the readout for all genes that are hits, $f_{1}$ for the distribution of readout for all genes that are no hits and $f(c) = \pi_{0} f_{0}(c) + (1 - \pi_{0}) f_{1}(c)$. The three quantities are separately estimated by using Maximum Likelihood, Poisson regression and moment estimation (for details see \cite{Efron2007} and \cite{Efron2010}).\\ Additionally, a sharedness score $sh_{g}$ is offered for an easier cross-comparison of the results from PMM. The sharedness score indicates if a gene is a hit in only one condition or if the hit appears among all conditions. The sharedness score is a combination of two quantities: $$ sh_{g} = \frac{1}{2} \left( (1 - \text{mean}(fdr_{cg})) + \sum_{c} (fdr_{cg} < 1) \right) $$ The first part defines the shift away from 1 and the second part describes how many pathogens support the shift. \section{Working Example} The \texttt{PMM} R-package contains the following functions: \begin{center} \begin{tabular}{c|l} Function & Description \\ \hline \texttt{pmm} & fits the PMM \\ \texttt{hitheatmap} & visualizes the results of PMM \\ \texttt{sharedness} & computes the sharedness \\ \end{tabular} \end{center} Moreover, an RNAi dataset on infection with several pathogens is included in the R-package. <>= library(pmm) data(kinome) head(kinome) @ The dataset contains the readouts of 826 kinases knock-down experiments - each targeted by a total of 12 independent siRNAs coming from three manufactures: Ambion (3 siRNAs), Qiagen (4 siRNAs) and Dharmacon (4 siRNAs + 1 pool siRNA). After knock-down the cells were infected with a pathogen, imaged with a microscope and the infection rate, as well as the number of cells were extracted from the microscope images. All experiments were conducted for 8 different pathogens. For example, the normalized number of cells (\Sexpr{kinome[1,"CellCount"]}) and the normalized infection score (\Sexpr{kinome[1,"InfectionIndex"]}) from the frist row of the data matrix are readouts of the microscope image from the siRNA experiment where gene \textit{ABL1} was knocked down with siRNA 1 from the manufacturer Ambion (for details see \cite{Drewek2014}). \subsection{Input to PMM} In order to use \texttt{pmm} your data needs to be stored as \texttt{data.frame}. Each row should correspond to one independent RNAi experiment. The data frame should have at least the following three columns: \begin{enumerate} \item gene identifier \item condition \item RNAi readout \end{enumerate} In our example, the column \textit{GeneID} identifies the genes, the column \textit{condition} corresponds to the pathogens which indicate the different conditions and as siRNA readout serve the columns \textit{InfectionIndex} and \textit{CellCount}. \subsubsection*{Note:} \begin{enumerate} \item The data should contain several independent siRNAs (different seeds) measurements per gene and condition. \item The biological replicates (experimental results with identical siRNA (same seeds) and identical condition) should be averaged. \end{enumerate} \subsection{Fitting PMM} In order to fit the PMM, take the data frame with your measurements as input and specify the correct column names by using the arguments \texttt{gene.col} and \texttt{condition.col}. As example, we fit the PMM for the readout \textit{InfectionIndex}. <>= fit1 <- pmm(kinome, "InfectionIndex", gene.col = "GeneName", condition.col = "condition") head(fit1) @ The default output of \texttt{pmm} is a matrix that contains the estimated $c_{cg}$ effects for each condition c and gene g, as well as an estimate for the local false discovery rate. A positive estimated $c_{cg}$ effect means that the response was enhanced when the corresponding gene was knocked down. A negative effect means that the response was reduced. Another version of the output giving some more information is also available by using the argument \texttt{simplify}: <>= fit2 <- pmm(kinome, "InfectionIndex", gene.col = "GeneName", condition.col = "condition", simplify = FALSE) class(fit2) names(fit2) identical(fit1,fit2[[1]]) @ The non-simplified output of \texttt{pmm} is a list of three components. The first component contains the simpilified output, i.e the matrix with the estimated $c_{cg}$ effects and the estimated local false discovery rate, the second component contains the fit of the linear mixed model and the third component contains the estimated $a_{g}$ and the estimated $b_{cg}$ values. Additional arguments of \texttt{pmm} can be used to add weights to the linear mixed model fit (\texttt{weight}) or change the number of minimal required siRNA replicates for each gene (\texttt{ignore}). Moreover \texttt{pmm} can deal with missing values. Missing values appear, for example, if your data doesn't contain a full set of combinations for conditions and genes, meaning that for each gene not every condition was performed. As an example, we set all measurements of the gene \textit{AAK1} and the pathogen \textit{Adenovirus} to NA. The result of the pmm fit looks as follows: <>= kinome$InfectionIndex[kinome$GeneName == "AAK1" & kinome$condition == "ADENO"] <- rep(NA,12) fit3 <- pmm(kinome ,"InfectionIndex", gene.col = "GeneName") head(fit3,3) @ The output shows now NA for the removed combination. \subsection{Visualization of Results} The results of PMM can be illustrated by a heat map using the function \texttt{hitheatmap}. \SweaveOpts{width=10, height=4} <>= hitheatmap(fit1, threshold = 0.2) @ \setkeys{Gin}{width=.9\textwidth} \begin{figure}[!h] \centering <>= <> @ \caption{Visualization of PMM results for an easier cross-comparison between the different conditions.} \label{Plot1} \end{figure} Red color indicates a positive estimated $c_{cg}$ effect, blue color a negative estimated $c_{cg}$ effect. The darker the color, the stronger is the estimated $c_{cg}$ effect. The heat map contains only the genes for which the local false discovery rate is below a given threshold for at least one condition. The yellow star indicates the significant genes. The plot can be modified by passing further arguments to the \texttt{plot} and the \texttt{par} function \SweaveOpts{width=10, height=4} <>= hitheatmap(fit1, threshold = 0.4, cex.main = 2, main = "My modified plot", col.main = "white", col.axis = "white", cex.axis = 0.8, bg = "black", mar = c(6,8,4,6)) @ \setkeys{Gin}{width=.9\textwidth} \begin{figure}[!h] \centering <>= <> @ \caption{Modified Heat map.} \label{Plot2} \end{figure} Missing combinations are plotted in white color and marked by NA. Use the argument \texttt{na.omit = na.omit} to plot only complete combinations. \subsection{Adding Sharedness Score} The sharedness score returns a value between $0$ and $1$ for each gene. Score $0$ indicates that a gene is not shared among the condition and score $1$ that the gene is significant among all conditions. The sharedness score is only computed for genes that pass a given \texttt{threshold}. <>= sh <- sharedness(fit1, threshold = 0.2) sh[order(sh$Sharedness),] @ The sharedness score can also be visualized within the \texttt{hitheatmap}. Use the argument \texttt{sharedness.score = TRUE} to add a row for the sharedness score. The darker the green color, the stronger is the sharedness among the conditions. \SweaveOpts{width=10, height=5} <>= hitheatmap(fit1, sharedness.score = TRUE, main = "My hits found by PMM") @ \setkeys{Gin}{width=.9\textwidth} \begin{figure}[ht] \centering <>= <> @ \caption{Visualization of PMM results with sharedness score.} \label{Plot3} \end{figure} \begin{thebibliography}{50} \bibitem{Drewek2014} R\"am\"o, P., Drewek, A., Arrieumerlou, C., Beerenwinkel, N., Ben-Tekaya H., Cardel, B., Casanova, A., Conde-Alvarez. R., Cossart, P., Csucs, G., Eicher, S., Emmenlauer, M. Greber, U., Hardt, W.-D., Helenius, A., Kasper, C., Kaufmann, A., Kreibich, S., K\"ubacher, A., Kunszt, P., Low, S.H., Mercer, J., Mudrak, D., Muntwiler, S., Pelkmans, L., Pizarro-Cerda, J., Podvinec, M., Pujadas, E., Rinn, B., Rouilly, V., Schmich F., Siebourg, J., Snijder, B., Stebler, M., Studer, G., Szczurek, E., Truttmann, M., von Mering, C., Vonderheit, A., Yakimovich, A., B\"uhlmann, P. and Dehio, C. \textsl{Simultaneous analysis of large-scale RNAi screens for pathogen entry}. BMC Genomics, 15(1162):1471-2164, 2014. \bibitem{Efron2007} Efron, B. \textsl{Size, Power and False Discovery Rates}. The Annals of Statistics, 35(4):1351-1377, 2007. \bibitem{Efron2010} Efron, B. \textsl{Empirical Bayes Methods for Estimation Testing and Prediction}. Cambridge University Press, 2014. \end{thebibliography} \end{document}