% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteDepends{graph,ScISI,ppiData} %\VignetteKeywords{Protein Interaction, Graphs} %\VignettePackage{ppiStats} %\VignetteIndexEntry{ppiStats} \documentclass[11pt]{article} \usepackage{amsmath,pstricks} % With MikTeX 2.9, using pstricks also requires using auto-pst-pdf or running % pdflatex will fail. Note that using auto-pst-pdf requires to set environment % variable MIKTEX_ENABLEWRITE18=t on Windows, and to set shell_escape = t in % texmf.cnf for Tex Live on Unix/Linux/Mac. \usepackage{auto-pst-pdf} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \usepackage{times} \usepackage{comment} \usepackage{subfigure} \textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \bibliographystyle{plainnat} \title{Exploring protein interaction data using \Rpackage{ppiStats}} \author{Denise Scholtens and Tony Chiang} \date{} \begin{document} \maketitle \section*{Introduction} \Rpackage{ppiStats} is a package that provides tools for the description and analysis of protein interaction data represented in node-and-edge graph form. The package is specifically designed to address data generated by bait-prey technologies (e.g. yeast-two-hybrid (Y2H) and affinity purification-mass spectrometry (AP-MS)) in which directed edges extend from bait proteins toward detected prey proteins, with the nature of the detected interaction depending on the technology. This vignette demonstrates the main functionality of the \Rpackage{ppiStats} package including: 1) summarizing observed data, 2) identifying systematically biased proteins in a data set, and 3) estimating stochastic error probabilities. In order to demonstrate \Rpackage{ppiStats}, we will make use of the protein interaction data sets available in \Rpackage{ppiData}. <>= library("ppiStats") library("ppiData") @ In this vignette, we will use to AP-MS data sets reported by \cite{Gavin2002} (Gavin02) and \cite{Gavin2006} (Gavin06) for \textit{Saccharomyces cerevisiae}. AP-MS technology is a bait-prey system designed to detect complex co-membership on behalf of proteins. Most of the functions available in \Rpackage{ppiStats} require objects that extend the class \Robject{graph}, defined in the \Rpackage{graph} package. For example, \Robject{Gavin2002BPGraph} and \Robject{Gavin2006BPGraph} are \Robject{graphNEL} objects available in \Rpackage{ppiData}. For assistance in converting raw data files into the appropriate object instance, see the help pages for the functions \Rfunction{bpMatrix} and \Rfunction{genBPGraph} in the \Rpackage{ppiStats} package. The \Rpackage{graph} and \Rpackage{RpsiXML} packages also contain helpful functionality. <>= data(Gavin2002BPGraph) data(Gavin2006BPGraph) GavinGraphs = list(Gavin02=Gavin2002BPGraph,Gavin06=Gavin2006BPGraph) GavinGraphs @ \section*{Summarizing observed data} While AP-MS data can in concept assay all possible complex co-membership interactions in a cell, in fact only a certain portion of those interactions are assayed in an experiment. The sampled portion of the `interactome' depends on which proteins are selected as baits, the subset of those baits that are amenable to the technology, and the set of proteins in the cell that are available for detection as prey under the conditions of the experiment. \cite{Chiang2007} present the notion of `viability' to summarize these ideas, defining `viable baits' (VBs) as baits that detect at least one prey and `viable prey' (VP) as proteins that are detected by at least one bait. It is possible for a protein to be both a viable bait and a viable prey (VBP). These definitions are important since only the edges that extend from VB nodes to VP nodes are deemed `tested' in the analyses presented here. All other potential interactions (e.g. an edge between two nodes that are VP but not VBP) are treated as untested. The function \Rfunction{createSummaryTables} lists several summary statistics about the viable protein population in an experiment. In addition to describing the viable protein population, the total number of directed interactions (TI) is reported. <>= summaryTables = createSummaryTables(GavinGraphs) summaryTables @ When examining the VB and VP populations, it can be helpful to compare the number of represented proteins to the total population in the cell, especially if the experiment is intended to be genome-wide. Figure \ref{fig:viabilityCharts} demonstrates VB and VP coverage for the Gavin02 and Gavin06 experiments, relative to the entire 6000+ genes in yeast. <>= viabilityCharts(GavinGraphs) @ \begin{figure}[tp] \centering \includegraphics[width=5in]{viabilityCharts} \caption{\label{fig:viabilityCharts} Enumerations of the VB, VP, and VBP populations for the Gavin02 and Gavin06 datasets in comparison with the entire set of 6000+ genes in yeast.} \end{figure} Rather than simply enumerating the viable protein population, the function \Rfunction{idViableProteins} identifies the specific VB, VP, and VBP proteins represented in a data set. <>= GavinViableProteins = lapply(GavinGraphs,FUN=idViableProteins) sapply(GavinViableProteins,FUN=function(x) lapply(x,FUN=length)) Gavin06VBPproteins = GavinViableProteins[["Gavin06"]]$VBP Gavin06VBPproteins[1:4] @ \section*{Identifying systematically biased proteins} In some bait-prey experiments, certain proteins are prone to systematic bias. Specifically, in the VBP population, some proteins have a high imblance between in- and out-degree. In-degree in the VBP population represents the number of times a VBP node is detected as prey by another VBP protein. Out-degree represents the number of times that VBP node detects other prey. When in-degree is much higher than out-degree, or vice versa, it may be an indication that the node is subject to more than simply stochastic measurement error in line with the sensitivity and specificity of the technology. The \Rfunction{inOutScatterCharts} function compares in- and out-degree for the VBP nodes and provides boundaries for determining which nodes may be particularly prone to systematic error according to the Binomial model of \cite{Chiang2007}. <>= inOutScatterCharts(GavinGraphs) @ \begin{figure}[htp] \centering \subfigure{ \includegraphics[width=2.5in]{scp-Gavin02-ident.pdf} } \subfigure{ \includegraphics[width=2.5in]{scp-Gavin02-sqrt.pdf} } \subfigure{ \includegraphics[width=4in]{scp-Gavin02-hist.pdf} } \caption{Scatter plots corresponding to the in- and out-degree (and the square root of each) for the VBP nodes in Gavin02. The points within the diagonal bands represent those proteins for which there is not enough evidence to reject the null hypothesis that the protein might be affected by some systematic bias while the points outside the bands are rejected at the 0.01 p-value. The bottom plot describes the distribution of the p-values for each data point. While the majority of proteins are not rejected, we can see that a substantial number proteins are. Those proteins which are rejected should not be analyzed using the standard statistical methods to model stochastic variability nor should they be used to estimate the underlying features within the data.} \label{splots:Gavin02} \end{figure} \begin{figure}[htp] \centering \subfigure{ \includegraphics[width=2.5in]{scp-Gavin06-ident.pdf} } \subfigure{ \includegraphics[width=2.5in]{scp-Gavin06-sqrt.pdf} } \subfigure{ \includegraphics[width=4in]{scp-Gavin06-hist.pdf} } \caption{The same plots as described in Figure \ref{splots:Gavin02} presented here for the Gavin06 data.} \label{splots:Gavin06} \end{figure} It appears that the Gavin06 dataset contains more proteins that are subject to systematic error than the Gavin02 dataset. We can identify these proteints using the function \Rfunction{idSystematic}, remove them from the graph, and then proceed with estimation of stochastic error probabilities using the remaining VBP nodes. <>= systematicVBPsGavin06 = idSystematic(Gavin2006BPGraph,Gavin06VBPproteins,bpGraph=TRUE) unbiasedGavin06VBPs = setdiff(Gavin06VBPproteins,systematicVBPsGavin06) Gavin06VBPsubgraph = subGraph(unbiasedGavin06VBPs,Gavin2006BPGraph) Gavin06VBPsubgraph @ \section*{Estimating stochastic error probabilities} Each edge between the subgraph induced by the VBP nodes is tested twice, once in each direction. With perfectly sensitive and specific technology, each edge would either be observed twice or not observed twice. The presence of unreciprocated edges indicates the presence of measurement error - either a false positive (FP) or a false negative (FN) observation. The function \Rfunction{estimateCCMErrorRates} estimates stochastic error probabilities, specifically the false positive ($p_{FP}$) and false negative ($p_{FN}$) probabilities, using the method of moments approach described by \cite{Chiang2007}. \Rfunction{estimateCCMErrorRates} calls \Rfunction{estErrProbMethodOfMoments} to compute a solution manifold and then selects a single pair of estimates for $p_{FP}$ and $p_{FN}$ using a gold standard approach as described in \cite{Scholtens2008}. In the example below, we use the \Robject{ScISIC} dataset from the package \Rpackage{ScISI} as the gold standard. This is a set of manurally curated complexes culled from GO and MIPS and serves to define a set of edges that should be detected by AP-MS technology. <>= data(ScISIC) Gavin06m = as(Gavin06VBPsubgraph,"matrix") errorProbabilities = estimateCCMErrorRates(Gavin06m,ScISIC) pTPestimate = errorProbabilities$globalpTP pFPestimate =errorProbabilities$globalpFP solutionManifold = errorProbabilities$probPairs pTPestimate pFPestimate @ Figure \ref{fig:stochErr} plots the calculated solution manifold and highlights the gold standard estimates. <>= plot(solutionManifold[,"pFPs"],solutionManifold[,"pFNs"],type="l",col="violet",xlab="pFP",ylab="1-pTP") points(pFPestimate,1-pTPestimate,cex=3,pch=8,col="turquoise") @ \begin{figure}[tp] \centering \includegraphics[width=4in]{plotErrorEstimates} \caption{\label{fig:stochErr} A plot of the solution manifold for estimates of $p_{FP}$ and $p_{FN}$ with the star denoting the gold standard point estimate.} \end{figure} Estimates of $p_{FP}$ and $p_{FN}$ are helpful in using statistical models to estimate features and parameters for that dataset. One popular statististic is node degree, def. the number of interacting partners for a protein. Nodes in an experimental AP-MS graph have an unknown true degree. An AP-MS experiment yields an observed number of reciprocated edges, as well as unreciprocated in- and out-edges. A few examples are listed below. <>= Gavin06VBPobserved = calcInOutDegStats(Gavin06VBPsubgraph) Gavin06VBPobserved$unrecipInDegree[5:10] Gavin06VBPobserved$unrecipOutDegree[5:10] Gavin06VBPobserved$recipIn[5:10] @ The function \Rfunction{degreeEstimates} estimates degree for VBP nodes according to the parametric model described by \cite{Scholtens2008} using the observed data and the stochastic error probability estimates. <>= Gavin06VBPdegree = degreeEstimates(Gavin06m,pTPestimate,pFPestimate) Gavin06VBPdegree[5:10] @ \section*{Conclusion} The \Rpackage{ppiStats} package is useful for describing data in graphs and is aimed at bait-prey systems. It is useful for summarizing graph data, identifying systematically biased baits and estimating stochastic error probabilities in preparation for estimation of relevant graph features. <>= sessionInfo() @ \begin{thebibliography}{9} \providecommand{\natexlab}[1]{#1} \providecommand{\url}[1]{\texttt{#1}} \expandafter\ifx\csname urlstyle\endcsname\relax \providecommand{\doi}[1]{doi: #1}\else \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi \bibitem[Gavin et~al. (2002)]{Gavin2002} A.~C. Gavin et~al. \newblock Functional organization of the yeast proteome by systematic analysis of protein complexes. \newblock \emph{Nature}, 415:\penalty0 141--147, 2002. \bibitem[Gavin et~al. (2006)]{Gavin2006} A.~C. Gavin et al. \newblock Proteome survey reveals modularity of the yeast cell machinery. \newblock \emph{Nature}, 440:\penalty0 631--636, 2006. \bibitem[Chiang et~al. (2007)]{Chiang2007} T. Chiang et~al. \newblock Coverage and error models of protein-protein interaction data by directed graph analysis. \newblock \emph{Genome Biology}, 8:R186, 2007. \bibitem[Scholtens et~al. (2008)]{Scholtens2008} D. Scholtens et~al. \newblock Estimating node degree in bait-prey graphs. \newblock \emph{Bioinformatics}, 24:218-224, 2008. \end{thebibliography} \end{document}