<%@meta language="R-vignette" content="-------------------------------- DIRECTIVES FOR R: %\VignetteIndexEntry{The babel vignette} %\VignetteKeyword{example, R, package, vignette} %\VignetteAuthor{Adam B. Olshen, Richard A. Olshen, and Barry S. Taylor} %\VignetteEngine{R.rsp::rsp} --------------------------------------------------------------------"%> \documentclass[10pt]{article} \usepackage[top=0.8in, bottom=0.8in, left=1in, right=0.8in]{geometry} \usepackage{graphicx} <% R.utils::use("R.utils, R.devices") options(width=100) %> \graphicspath{{<%=getOption("devEval/args/path", "figures/")%>}} \title{<%@meta name="title"%>} \author{<%@meta name="author"%>} \begin{document} \maketitle \section{Introduction} This document presents an overview of the {\tt babel} package. This package is for analyzing ribosome profiling data. Ribosome profiling data arises from paired next generation sequencing experiments ~\cite{aingolia09,bhsieh12}. One experiment is standard RNA-seq, which we abbreviate as mRNA. The other is RNA-seq of ribosome-protected fragments, which we abbreviate as RPF. It implements our methodology for finding genes that have unusual RPF counts given their mRNA counts ~\cite{colshen13}. Our model is that the mRNA counts follow a negative binomial distribtion, as is standard ~\cite{drobinson08,eanders10}. Because sample sizes are always low in this context, we estimate a single over-dispersion parameter across genes using the Robinson and Smyth methodology found in edgeR ~\cite{frobinson10}. Given the mRNA counts, we model the RPF counts also as negative binomial with mean according to a trimmed least sequares regression fit of RPF counts on mRNA counts and over-dispersion estimated in an iterative fashion ~\cite{colshen13}. As we detail below, our package does three main things. First, we identify genes with unusual RPF counts given their mRNA counts {\it within} conditions. Second, we combine p-values across multiple experiments within conditions. Third, we use the within-condition p-values to identify genes where the RPF/mRNA relationship has changed {\it between} conditions. \section{Data} We selected a ribosome profiling data set consisting of 1000 genes from two conditions (GroupA and GroupB) with each condition having two replicates (Sample1 and Sample2). We are calling this data set {\tt ribo.prof}. \section{An Example} Here we perform an analysis on the {\tt ribo.prof} data described above. First we load the library and data. \begin{verbatim}<%=withCapture({ library(babel) data(ribo.prof) })%>\end{verbatim} Next we print the first five lines of the mRNA ({\tt test.rna}) and RPF ({\tt test.rp}) elements of {\tt ribo.prof}. Note that these are raw counts that are not normalized in any way. Raw counts must be used with {\tt babel}. \begin{verbatim}<%=withCapture({ test.rna <- ribo.prof$test.rna print(test.rna[1:5,]) })%>\end{verbatim} \begin{verbatim}<%=withCapture({ test.rp <- ribo.prof$test.rp print(test.rp[1:5,]) })%>\end{verbatim} We see paired count data for the first five genes. We plot the Sample1 and GroupA data, with both mRNA and RPF counts on the log scale. We expect and actually do see an increasing function between the mRNA counts and RPF counts. \begin{figure}[htbp] \begin{center} \resizebox{0.6\textwidth}{!}{ \includegraphics{<%=toPDF("scatter", tags="1A", { plot(test.rna[,1]+1,test.rp[,1]+1,xlab="mRNA counts",ylab="RPF counts",pch=16,log="xy",xlim=c(1,10000),ylim=c(1,10000),font.lab=2) })%>} } \caption{Scatterplot of Sample1 and GroupA data.} \label{fig:scatter} \end{center} \end{figure} Similar patterns are repeated across the three other experiments. Next we estimate p-values for the RPF counts given the mRNA counts within every experiment. The tests are one-sided; higher than expected RPF counts lead to low p-values, while lower than expected RPF counts lead to high p-values. We set group labels corresponding to the columns of the mRNA and RPF inputs. \begin{verbatim}<%=withCapture({ test.group <- c("A","B","A","B") })%>\end{verbatim} Then, for the purpose of reproducibility, we fix the number of cores at 1. Two cores are used by default, unless the machine is running Windows, where one is the default because Windows cannot use the fork command in the parallel library. Users are encouraged to use multiple cores if they are available. Future versions of the software will allow computation over multiple nodes of a computational cluster. \begin{verbatim}<%=withCapture({ options(mc.cores=1) })%>\end{verbatim} We set the seed for reproducibility purposes. \begin{verbatim}<%=withCapture({ set.seed(12345) })%>\end{verbatim} We run the main function, which is called {\tt babel()}. The argument {\tt nreps} is set to 100000. This is the number of permutations, so the minimum p-value is {\tt 1/nreps}. Therefore, when correcting for multiple comparisons, We would prefer a million, or, even better, ten million reps, but here we are just demonstrating the procedure on one core. The argument {\tt min.rna} is set to $10$. This is the minimum number of mRNA counts acrossing all experiments for a gene to be included. This cutoff leads to the removal of $9$ genes so that $991$ are analyzed. \begin{verbatim}<%=withCapture({ test.babel <- babel(test.rna,test.rp,group=test.group,nreps=1e+05,min.rna=10) })%>\end{verbatim} Now we examine the first five lines of the within element of the list created by the {\tt babel()} run. \begin{verbatim}<%=withCapture({ within.babel <- test.babel$within print(within.babel[[1]][1:5,]) })%>\end{verbatim} The element {\tt Direction} is whether the RPF count is greater (1) or less (-1) than expected given the mRNA count. The {\tt P-value (one-sided)} tells us how unusual the RPF count is given the mRNA count, with low or high being interesting. The {\tt P-value (two-sided)} correspond to how unusual the RPF count is relative to the mRNA count, with only low being interesting. The {\tt FDR} is the estimated false discovery rate corresponding to the gene for the two-sided p-value. We plot the interesting genes from the within analysis on this sample. Genes in red have (one-sided) p-values $< 0.025$, while genes in green have p-values $> 0.975.$ \begin{figure}[htbp] \begin{center} \resizebox{0.6\textwidth}{!}{ \includegraphics{<%=toPDF("scatter2", tags="1A", { which.025 <- which(within.babel[[1]]$"P-value (one-sided)"<0.025) which.975 <- which(within.babel[[1]]$"P-value (one-sided)">0.975) plot(test.rna[,1]+1,test.rp[,1]+1,xlab="mRNA counts",ylab="RPF counts",pch=16,log="xy",xlim=c(1,10000),ylim=c(1,10000),font.lab=2) points(test.rna[which.025,1]+1,test.rp[which.025,1]+1,pch=16,col=2) points(test.rna[which.975,1]+1,test.rp[which.975,1]+1,pch=16,col=3) })%>} } \caption{Scatterplot of Sample1 and GroupA data with unusual genes highlighted.} \label{fig:scatter2} \end{center} \end{figure} Next we combine the p-values from the two experiments from Group A using the technique described in our manuscript ~\cite{colshen13}, and show the first five lines from the results. Note there is now a single (two-sided) combined p-value and a corresponding FDR. The Direction let us know whether RPF counts are higher or lower than expected. \begin{verbatim}<%=withCapture({ combined.babel <- test.babel$combined print(combined.babel[[1]][1:5,]) })%>\end{verbatim} We plot the genes at the same p-value cutoffs as before. Note the increase in power from combining experiments and that all highlighted genes are unusual in both samples. \begin{figure}[htbp] \begin{center} \resizebox{0.6\textwidth}{!}{ \includegraphics{<%=toPDF("scatter3", tags="1A", { which.025 <- which(combined.babel[[1]]$"P-value"<0.025) plot(test.rna[,1]+1,test.rp[,1]+1,xlab="mRNA counts",ylab="RPF counts",pch=16,log="xy",xlim=c(1,10000),ylim=c(1,10000),font.lab=2) points(test.rna[which.025,1]+1,test.rp[which.025,1]+1,pch=16,col=2) })%>} } \caption{Scatterplot of the Sample1 and GroupA data with unusual genes highlighted from combined analysis.} \label{fig:scatter3} \end{center} \end{figure} \begin{figure}[htbp] \begin{center} \resizebox{0.6\textwidth}{!}{ \includegraphics{<%=toPDF("scatter4", tags="1A", { which.025 <- which(combined.babel[[1]]$"P-value"<0.025) plot(test.rna[,3]+1,test.rp[,3]+1,xlab="mRNA counts",ylab="RPF counts",pch=16,log="xy",xlim=c(1,10000),ylim=c(1,10000),font.lab=2) points(test.rna[which.025,3]+1,test.rp[which.025,3]+1,pch=16,col=2) })%>} } \caption{Scatterplot of the Sample2 and GroupA data with unusual genes highlighted from combined analysis.} \label{fig:scatter4} \end{center} \end{figure} Next we look for genes whose RPF to mRNA counts vary between group A and group B. Here there are only two groups, but babel automatically tests all pairwise combinations. We again print the first five lines of the output. \begin{verbatim}<%=withCapture({ between.babel <- test.babel$between print(between.babel[[1]][1:5,]) })%>\end{verbatim} The elements {\tt mRNA$\_$logFC} and {\tt mRNA$\_$FDR} are based on tests for differential expression just on the mRNA data ~\cite{frobinson10}. Genes with {\tt mRNA$\_$FDR} $< 0.05$ are labeled as "both," meaning change is in expression and translation; other genes are labeled "translation$\_$only". There is a single (two-sided) p-value for the difference in RPF relative to mRNA and the corresponding FDR. The element {\tt Direction} determines the type of change (1 for translation higher in the first group label, -1 for lower in the first group label). Note that an FDR of $25\%$ there are $11$ significant genes. <%------------------------------------------------------------------- REFERENCES -------------------------------------------------------------------%> \bibliographystyle{ieeetr} \bibliography{babel} <%------------------------------------------------------------------- APPENDIX -------------------------------------------------------------------%> \clearpage \section*{Appendix} \subsection*{Session information} <%=toLatex(sessionInfo())%> This report was generated using the R.rsp package. \end{document}