%\VignetteIndexEntry{Lessons learned exposing web services} %\VignetteKeywords{Web services} %\VignettePackage{RWebServices} \documentclass[]{article} \usepackage[colorlinks,linkcolor=blue,pagecolor=blue,urlcolor=blue]{hyperref} \usepackage{Sweave} \newcommand{\lang}[1]{{\texttt{#1}}} \newcommand{\pkg}[1]{{\textsf{#1}}} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\func}[1]{{\texttt{#1}}} \newcommand{\method}[1]{{\texttt{#1}}} \renewcommand{\arg}[1]{{\texttt{#1}}} \newcommand{\ret}[1]{{\texttt{#1}}} \newcommand{\obj}[1]{{\texttt{#1}}} \newcommand{\class}[1]{{\textit{#1}}} \newcommand{\R}{\textsf{R}} \newcommand{\Java}{\textsf{Java}} \newcommand{\XML}{\textsf{XML}} \newcommand{\tomcat}{\textsf{tomcat}} \newcommand{\activemq}{\textsf{activeMQ}} \newcommand{\RWebServices}{\pkg{RWebServices}} \newcommand{\TypeInfo}{\pkg{TypeInfo}} \newcommand{\SJava}{\pkg{SJava}} \newcommand{\introduce}{\pkg{introduce}} \newcommand{\caGrid}{\pkg{caGrid}} \newcommand{\javadoc}{\pkg{javadoc}} \begin{document} \title{Enabling \R{} packages for web or grid services: lessons learned} \author{ Martin Morgan\footnote{Fred Hutchinson Cancer Research Center, 1100 Fairview Ave.\ N., PO Box 19024 Seattle, WA 98109}, Nianhua Li, Seth Falcon,\\ Robert Gentleman, } \date{16 February, 2007} \maketitle \section{Prelude: motivation} There are many reasons for exposing \R{} packages as web or grid services. Main reasons motivating our work are as follows: \begin{enumerate} \item Provide standardized workflows. A well-defined web service simplifies and consolidates complex steps in an analysis into a single service. This standardizes the analysis so that it is reproducible in the hands of different users, including users with no \R{} experience. \item Aid interoperability. Web services require strongly typed service inputs and outputs, and (strive to) represent data in a language-neutral manner. Strongly typed data output from an \R{} web service can be used as inputs to web services written in other languages. \item Enhance access to powerful analytic methods. \R{} methods and packages exposed as web services allow the unique strengths of \R{} (e.g., statistical modeling) to be exposed to and accessed by other programming languages. \item Access specialized computing resources. Web services separate the computing resources required of the client from those required by the server. This sets the stage for powerful computing resources to be accessed and shared by many users. \item Centralize computing administration, while easing `end-user' maintenance requirements. The often complex task of maintaining \R{}, including regular updates to \R{} itself and continual updates to availble packages, can be managed in a centralized fashion in a way that minimizes disruptions to the user's work. \item Leverage \Java{} resources. The large \Java{} community has many active projects that help to effectively expose \R{} as a web service. These \Java{} resources range from the core functionality provided by \tomcat{} to the messaging and queue management facilities of \activemq{}. \item Expose \R{} statistical functionality to \Java{} programmers. Easy facilities for producing web services from \R{} packages via a \Java{} intermediary means that the statistical computational abilities of \R{} are more readily accessible to \Java{} programmers. \end{enumerate} \section{Overcoming technical issues} A primary technical challenge to offering \R{} packages as web services is to interface the \R{} programming language with \XML{}-based web services. \begin{itemize} \item We chose to target \Java{} as the initial translation from \R{}, rather than a more ambitious attempt to write web services functionality for \R{} directly. Key issues include: \begin{itemize} \item Pro: This approach provides access to mature \Java{} web service resource tools. \item Con: Typically, this introduces a data and service invocation translation layer (from \R{} to \Java{}, in addition to the translation to \XML{}). This additional translation layer is not likely to impose a significant cost in terms of overall execution time, if only because parsing between native types and \XML{} (a necessary step regardless of strategy) is typically very slow. \item Con: The available object model (classes and methods) is reduced to the intersection of \R{}, \Java{}, and \XML{} object models. This requires that certain \R{} constructs (e.g., class unions) be employed with caution and that others (e.g., multiple inheritance) be avoided entirely. \end{itemize} \item Web services and \Java{} require strongly typed methods and well-defined data objects. The \TypeInfo{} package provides facilities for strongly typing \R{} functions and \S4{} methods. The \S4{} class system provides enough structure for well-defined data types. Both \TypeInfo{} and the \S4{} system provide language introspection to programmatically translate \R{} methods into strongly typed \Java{} signatures. Even with these solutions, many \R{} methods and classes cannot be easily represented in a way appropriate for web services. S3-style classes do not contain enough information for language introspection to determine mapping between \R{} and \Java{} types. The \obj{list} type translates to \Java{} \obj{Object[]}, but this is not sufficiently rich for use in a web services context. A solution is to wrap such data objects as S4 classes. Conversely, a common data paradigm in \Java{} or \XML{} is a collection of objects of complex type \obj{T}. While \R{} might represent this as a \obj{list}, with each member of the list implicitly of type \obj{T}, \TypeInfo{} does not provide an idiom for recognizing this paradigm and recovering appropriate information programmatically. A solution is to provide moderate type information in \R{} (object of type \obj{list}) and strong typing in \Java{}. An alternative solution is to recast the data structure in a way that can be strongly typed, e.g., a list of numeric vectors might be represented as a numeric matrix. \item We use \SJava{} and additonal facilities to accomplish \R{}$\leftrightarrow$\Java{} data and method mapping. \RWebServices{} implements two different object models for base \R{} types. The \texttt{robject} model more-or-less faithfully represents the underlying structure of \R{} objects in \Java{} (e.g., a `matrix' is vector of data values, a vector of dimensions, a type label, facilities for `names' and \texttt{NA} values, etc.). The \texttt{javalib} model is more faithful to \Java{} data representations; a \obj{matrix} must be typed as, e.g., \class{NumericMatrix}, and is represented in \Java{} as, e.g., column-major \obj{double[]} and associated dimensions \obj{int[]}. There are no provisions for \texttt{NA} or \R{} attributes such as \obj{names}. These two different object models have consequences for interoperability (likely easier to achieve with the \texttt{javalib} model) and representation of statistical data (better with the \texttt{robject} model). \end{itemize} A second group of technical challenges revolve around service availability and evaluation. \begin{itemize} \item The architecture adopted separates \Java{}-based service functionality from \R{} / \Java{} worker functionality. This means that \R{} does not need to be available to the web server, simplifying deployment and risks of server-side exposure to nefarious activities. \item A realistic service model requires ability to manage multiple requests simultaneously. We use \activemq{} to implement a messaging layer including customizable queues. \activemq{} is deployed separately from the web service, e.g., inside a firewall. Computation is performed by \R{} workers. Workers can be dynamically added to the pool, deployed on separate hosts, and customized to be capable of evaluating one or several services. Workers are persistent, minimizing service invocation costs. \end{itemize} Insights into additional technical challenges include: \begin{itemize} \item It is important to be able to conveniently encapsulate the service portion of \RWebServices{} into other web service containers, e.g., using the \introduce{} tool of \caGrid{}. The architecture of the service side of \RWebServices{} accomodates this separation. \item Statistical data offers unique challenges, for instance: \begin{itemize} \item `Missing' or \texttt{NA} values are distinct from non-computable (\texttt{NaN}) or not representable (e.g., \texttt{Inf}) values. These must be propagated successfully, both as input and return values. The \texttt{robject} model facilitates this (at the expense of greater client complexity, to continue the contract of dealing appropriately with \texttt{NA}); the \texttt{javalib} model assumes (at the risk of a runtime error) that any \texttt{NA} values are removed before service invocation (client responsibility) and before service return (\R{} service responsibility). \end{itemize} \item Web service methods require programmatic (e.g., brief method and class description) and user (e.g., detailed desription, interpretation of return values) documentation. \RWebServices{} parses \R{} \texttt{man} pages for method and class descriptions, annotating these as \javadoc{} to provide programmatic documentation, Complete user-level documentation is only available inside the \R{} package. \end{itemize} \section{Adapting to a web services environment} The interative, exploratory aspects of \R{} translate poorly to the stateless and high-latency web services environment. Lessons learned in addressing this issue include: \begin{itemize} \item Construct a coarse workflow granularity. Do this by identifying and consolidating common sequences of analytic steps, typically accomplished by arranging a sequence of \R{} package function calls into a logcial workflow. Enhance the utility of the workflow by selectively exposing parameters available for manipulation -- this represents the transition from \R{} research software to web-based service. \item Simplify result types. Many \R{} functions rely on side-effects (e.g., plots), but these are not useful for subsequent computation. Detailed results are sometimes only useful within \R{}. In these cases it is appropriate to simplfy result types to emphasize computable data. \item Imprimatur of scientific authority. \R{}'s pre-eminence as a research tool means that exploratory or experimental methods may be implemented, but these are often not appropriate for general or uncritical use. The services exposed need to be vetted to include only scientifically sound and established methods. \end{itemize} \section{Future opportunities} Lessons learned during this project point to several future opporutnities. \begin{itemize} \item Implementing stateful services represents an opportunity to reduce data latency and restore some sense of interactive analysis. For instance, stateful services might facilitate services returning data for subsequent analysis, and services for return of non-computable results like plots. \item Separating analytic services from the clients using them places an interpretive burden on the client. For instance, an \R{} user might combine input and output data into a figure, and use this to visually assess and guide subsequent analyses. This requires knowledge about how to appropriately superpose input and output data, in addition to the tools to do this. These tools are implicit in \R{}, but must be made explicitly available in the client. Possible solutions include: \begin{itemize} \item Burden lies with client. This solution requires that the client be programmed to interpret results, rather than merely retrieve them. \item Service returns complex data objects (e.g., a graphical summary of input and output, in addition to output data). The client can access parts of the object as appropriate for subsequent workflow, but needs to decompose the returned structure appropriately. Sufficiently complex return types could be difficult to document in a semantically meaning way. \item User interacts repeatedly with stateful services. This solution requires that the client maintain a sense of state, and offers the user an indication of dependencies amongst services (e.g., viewing a plot only makes sense after an analysis has been performed). \end{itemize} \item Documentation. Existing documentation tools and requirements emphasize programmatic descriptions of the API (e.g., \javadoc{}) or (in a \caGrid{} context) semantic classification of arument and return types. This level of documentation is inadequate for the user, who requires access to full manual pages for methods or tutorial-like documents summarizing appropriate use of functions. \end{itemize} \end{document}