%\VignetteIndexEntry{Overview of the robCompostions package} \documentclass[a4paper,11pt]{scrartcl} \usepackage[pdftex]{hyperref} \usepackage{subfigure} \usepackage{CoDaWork} \hypersetup{colorlinks, citecolor=blue, linkcolor=blue, urlcolor=blue } \newcommand{\R}[1]{\texttt{#1}} %\usepackage{CoDaWork} %opening \title{Overview of the robCompositons package. Compositional Data Analysis using Robust Methods.} \authors{M. TEMPL$^{1,3}$, P. FILZMOSER$^1$ and K. HRON$^2$} \affiliation{$^1$Department of Statistics and Probability Theory - Vienna University of Technology, Austria \email{templ@tuwien.ac.at}\\ $^2$Department of Mathematical Analysis and Applications of Mathematics - Palack\'y University, Czech Republic\\ $^3$Statistics Austria, Vienna, Austria} \begin{document} \maketitle \section{Few Words about R and CoDa} The free and open-source programming language and software environment \texttt{R} \citep{R} is currently both, the most widely used and most popular software for statistics and data analysis. In addition, \texttt{R} becomes quite popular as a (programming) language, ranked currently (February 2011) on place 25 at the TIOBE Programming Community Index (e.g., Matlab: 29, SAS: 30, see \href{http://www.tiobe.com}{http://www.tiobe.com}). The basic \texttt{R} environment can be downloaded from the comprehensive \texttt{R} archive network (\href{http://cran.r-project.org}{http://cran.r-project.org}). \texttt{R} is enhanceable via \textit{packages} which consist of code and structured standard documentation including code application examples and possible further documents (so called \textit{vignettes}) showing further applications of the packages. \\ Two contributed packages for compositional data analysis comes with \texttt{R}, version~2.12.1.: the package \texttt{compositions} \citep{Boo10} and the package \texttt{robCompositions} \citep{Templ11R}. Package \texttt{compositions} provides functions for the consistent analysis of compositional data and positive numbers in the way proposed originally by John Aitchison \citep[see][]{Boo10}. In addition to the basic functionality and estimation procedures in package \texttt{compositions}, package \texttt{robCompositions} provides tools for a (classical) and robust multivariate statistical analysis of compositional data together with corresponding graphical tools. In addition, several data sets are provided as well as useful utility functions. \section{Motivation to Robust Statistics} Both measurement errors and population outliers can have a high influence on classical estimators. Arbitrary results may be the consequence, because outliers may have a large influence and wrong interpretation of estimations may result. In addition, checking model assumptions is then often not possible since outliers may disturb the applied model itself. All these problems may be avoided when applying methods based on robust estimators. \\ To be more specific, a simple analysis is done in the following by applying principal component analysis - using function \texttt{pcaCoDa()} of package \texttt{robCompositions} - to the \textit{Arctic Lake sediment data set} \citep{Ait86}. We show the effect of outliers on a \textbf{simplyfied example for demostration purposes}. However, the same problems occur in higher dimensions where usually principal component analysis is applied mostly for dimension reduction purposes. \begin{figure}[ht] \centering \subfigure[Ternary diagram of the arcticLake data]{ \includegraphics[scale=0.3]{ternary} \label{fig:ternary} } \subfigure[Ilr transformed arcticLake data]{ \includegraphics[scale=0.3]{ilr} \label{fig:ilr} } \subfigure[First principal component for ilr transformed data.]{ \includegraphics[scale=0.3]{pca1} \label{fig:pca1} } \subfigure[First principal component back-transformed to original scale.]{ \includegraphics[scale=0.3]{pcaOrig} \label{fig:pca2} } \label{fig:pca} \caption[]{% The upper left graphic \subref{fig:ternary} shows a ternary diagram of the \textit{Arctic Lake Sediment Data}. In the upper right graphic \subref{fig:ilr}, the ilr-transformed data are shown and the first principal component is displayed in Figure~\subref{fig:pca1} while the first principal component is shown in the ternary diagram in Figure~\subref{fig:pca2}.} \end{figure} Figure~\ref{fig:ternary} shows the 3-part compositions of the Arctic Lake Sedimanet Data in a ternary diagram. Few outliers are clearly visible, like the two ones with higher percentages in the \textit{silc}-part. After transforming the parts by using the isometric log-ratio transformation \citep{EPMB03}, outliers are still visible (see Figure~\ref{fig:ilr}). For obtaining the principal components, the eigenvalues of the covariance matrix need to be derived. A robust estimation of the underlying covariance matrix leads to robust principal components. Figure~\ref{fig:pca1} shows the direction of the first principal component when using different covariance estimators: classical estimation (black solid line), and robust estimation using the MM estimator \citep[see, e.g.,][]{MMY06} (dotted line in grey), and the (fast) MCD estimator \citep{Rousseeuw99} (black coloured dashed line) with high degree of robustness. It is easy to see that the first principal component is attracted by the few outliers in the lower right plot, while the principal components obtained from robust estimates are not. Finally, in Figure~\ref{fig:pca2} the first principal components of the classical and the robust estimators are shown in the ternary diagram. Again, it is easy to see that the first principal component from classical estimation is highly influenced, especially by the two outliers having higher concentration in silt. The line does not follow the main part of the data. \\ This example shows that robust estimation is important to get reliable estimates for multivariate analysis of compositional data, especially when using more complex data than this simple 3-part composition. \section{Available Functionality} In the following the data sets and most important functions of package \texttt{robCompositions} are briefly described. Note, that almost all print and summary functions are not listed here, but their description is available in \cite{Templ11R}. \subsection{Data sets} Several compositional data sets are included in the package, like:\\ \begin{tabular}{p{4.5cm}p{11cm}} %\begin{description} \texttt{arcticLake} & The Artic Lake Sediment Data from the Aitchison book \citep{Ait86}. \\ \hline \texttt{coffee} & The Coffee Data contain 27 commercially available coffee samples of different origins \citep[see][]{Kor09}. \\ \hline \texttt{expenditures} & The Household Expenditures Data on five commodity groups of 20 single men from the Aitchison book \citep{Ait86}. \\ \hline \texttt{expendituresEU} & Mean consumption expenditure of households at EU-level (2005) provided by Eurostat.\\ \hline \texttt{haplogroups} & Distribution of European Y-chromosome DNA (Y-DNA) haplogroups by region in percentages, from Eupedia. \\ \hline \texttt{machineOperators} & This data set from \cite[][p. 382]{Ait86} contains compositions of eight-hour shifts of 27 machine operators. \\ \hline \texttt{phd} & PhD students in Europe based on the standard classification system splitted by different kind of studies (given as percentages), provided by Eurostat 2009. \\ \hline \texttt{skyeLavas} & AFM compositions of 23 aphyric Skye lavas \citep[][p. 360]{Ait86}. %\end{description} \end{tabular} \subsection{Basic functions} Basic utility functions like log-ratio transformations but also functions which specially written in \texttt{C} (for e.g. to compute distances between compositions) are implemented in the package. The most important are:\\ %\begin{table}[ht] \begin{tabular}{p{4.5cm}p{11cm}} %\begin{description} \texttt{aDist(x, y)} & Computes the Aitchison distance between two observations or between two data sets. The underlying code is written in \texttt{C} and allows a fast computation also for large data sets. \\ \hline \texttt{constSum(x, const=1)} & Closes compositions to sum up to a given constant (default 1). \\ \hline \texttt{robVariation(x, robust=TRUE)} & Estimates the variation matrix with robust or classical methods. \\ \hline \texttt{ternaryDiag(x, \ldots)} & Ternary diagram, optionally with grid. \\ \hline \texttt{alr(x, ivar=ncol(x))} & The alr transformation moves $D$-part compositional data from the simplex into a ($D-1$)-dimensional real space. \\ \hline \texttt{invalr(x, \ldots)} & Inverse additive log-ratio transformation, often called additive logistic transform. The function allows also to preserve absolute values when parameter class info is provided. \\ \hline \texttt{clr(x)} & The clr transformation moves $D$-part compositional data from the simplex into a $D$-dimensional real space. \\ \hline \texttt{invclr(x, useClassInfo = TRUE)} & The inverse clr transformation. Absolute values are preserved optionally. \\ \hline \texttt{ilr(x)} & An isometric log-ratio transformation with a special choice of the balances according to \cite{hron10}. \\ \hline \texttt{invilr(x.ilr)} & The inverse transformation of \texttt{ilr()}. %\end{description} \end{tabular} %\end{table} \subsection{Exploratory Tools} Multivariate outlier detection can give a first impression about the general data structure and quality \citep{MMY06}. This is also true for compositional data. The compositions are firstly transformed to the real space before robust methods are applied for outlier detection \citep{FH08}. The (robust) compositional biplot displays both samples and variables of a data matrix graphically in the form of scores and loadings of a principal component analysis, preferably - because of interpretation of the biplot - after clr transformation of the data \citep{FHR09}. The package comes with the following functionality for exploratory compositional data analysis: \\ \begin{tabular}{p{4.5cm}p{11cm}} \texttt{outCoDa(x, \ldots)} & Outlier detection for compositional data using classical and robust statistical methods \citep{FH08}. \\ \hline \texttt{plot.outCoDa} or \texttt{plot()} & Plots the Mahalanobis distances to detect potential outliers. \\ \hline \texttt{pcaCoDa(x, method = "robust")} & This function applies robust principal component analysis for compositional data \citep{FHR09}. \\ \hline \texttt{plot.pcaCoDa()} or \texttt{plot()} & Provides robust compositional biplots. \end{tabular} \subsection{Model-based Multivariate Estimation and Tests} Outliers may lead to model misspecification, biased parameter estimation and incorrect results. The main functionality of package \texttt{robCompositions} is provided on model-based estimations, namely factor analysis \citep{FHRG09}, discriminant analysis \citep{FHT09} and imputation of rounded zeros \citep{Palarea08} or missing values \citep{hron10}. The package provides the following functions:\\ \begin{tabular}{p{4.5cm}p{11cm}} \texttt{adtest(x, R = 1000, locscatt = "standard")} & This function provides three kinds of Anderson-Darling normality Tests. \\ \hline \texttt{adtestWrapper(x, alpha = 0.05, R = 1000, robustEst = FALSE)} & A set of Anderson-Darling tests are applied as proposed by \cite{Ait86}. \\ \hline \texttt{summary.adtestWrapper} or \texttt{summary()} & Summary of the adtestWrapper results.\\ \hline \texttt{alrEM(x, pos = ncol(x), dl = \ldots)} & A modified EM alr-algorithm for replacing rounded zeros in compositional data sets \citep{Palarea08}. \\ \hline \texttt{daFisher(x, grp, \ldots)} & Discriminant analysis by Fishers rule \citep[as described in][]{FHT09}. \\ \hline \texttt{impCoda(x, method = "ltsReg", \ldots)} & Iterative model-based imputation of missing values using special balances \citep{hron10}. \\ \hline \texttt{impKNNa(x, method = "knn", k = 3, \ldots)} & This function offers several k-nearest neighbor methods for the imputation of missing values in compositional data \citep{hron10}. \\ \hline \texttt{plot.imp()} or \texttt{plot()} & This function provides several diagnostic plots for the imputed data set in order to see how the imputed values are distributed in comparison with the original data values \citep{Templ09}. \\ \hline \texttt{pfa(x, factors, \ldots)} & Computes the principal factor analysis of the input data which are clr transformed first. \\ \end{tabular} \section{Conclusion and Outline} In this contribution we started with a short motivation why robustness is of major concern in compositional data analysis. We then briefly introduced and listed the methods implemented in package \texttt{robCompositions}. More details about each function can be found in the manual of the package \cite{Templ11R} and in the book chapter about \texttt{robCompositions} in the forthcoming book \textit{Compositional Data Analysis: Theory and Applications} \citep{Vera11}. \\ The package comes with the \textit{General Public Licence, version 2}, and can simple be downloaded at \href{http://cran.r-project.org/package=robCompositions}{http://cran.r-project.org/package=robCompositions} \ .\\ Future developments include further methods on replacing rounded zeros in the data as well as to handle structural zeros. Furthermore, a graphical user interface is currently developed. Comments and collaborations regarding the development of the package are warmly welcome. \bibliographystyle{chicago} % name your BibTeX data base \begin{thebibliography}{} \bibitem[\protect\citeauthoryear{Aitchison}{Aitchison}{1986}]{Ait86} Aitchison, J. (1986). \newblock {\em {T}he {S}tatistical {A}nalysis of {C}ompositional {D}ata}. \newblock Monographs on {S}tatistics and {A}pplied {P}robability. Chapman \& Hall Ltd., London (UK). (Reprinted in 2003 with additional material by The Blackburn Press). \newblock 416 p. \bibitem[\protect\citeauthoryear{van~den Boogaart, Tolosana, and Bren}{van~den Boogaart et~al.}{2010}]{Boo10} van~den Boogaart, K.~G., R.~Tolosana, and M.~Bren (2010). \newblock {\em compositions: Compositional Data Analysis}. \newblock R package version 1.10-1. \bibitem[\protect\citeauthoryear{Egozcue, Pawlowsky-Glahn, Mateu-Figueras, and Barcel\'o-Vidal}{Egozcue et~al.}{2003}]{EPMB03} Egozcue, J.J., V.~Pawlowsky-Glahn, G.~Mateu-Figueras, and C.~Barcel\'o-Vidal. \newblock Isometric logratio transformations for compositional data analysis. \newblock {\em Mathematical Geology\/}~{\em 35\/}(3), 279--300. \bibitem[\protect\citeauthoryear{Filzmoser and Hron}{Filzmoser and Hron} {2008}]{FH08} Filzmoser, P., and K.~Hron (2008). \newblock Outlier detection for compositional data using robust methods. \newblock {\em Mathematical Geosciences\/}~{\em 40\/}(3), 233--248. \bibitem[\protect\citeauthoryear{Filzmoser, Hron, and Reimann}{Filzmoser et~al.} {2009}a]{FHR09} Filzmoser, P., K.~Hron, and C.~Reimann (2009a). \newblock Principal component analysis for compositional data with outliers. \newblock {\em Environmetrics\/}~{\em 20\/}(6), 621--632. \bibitem[\protect\citeauthoryear{Filzmoser, Hron, Reimann, and Garrett}{Filzmoser et~al.} {2009}b]{FHRG09} Filzmoser, P., K.~Hron, C.~Reimann, and R.G.~Garrett (2009b). \newblock Robust factor analysis for compositional data. \newblock {\em Computers and Geosciences\/}~{\em 35}, 1854--1861. \bibitem[\protect\citeauthoryear{Filzmoser, Hron, and Templ}{Filzmoser et~al.} {2009}c]{FHT09} Filzmoser, P., K.~Hron, and M.~Templ (2009c). \newblock Discriminant analysis for compositional data and robust parameter estimation. \newblock Technical Report SM-2009-3, Vienna University of Technology, Austria. Submitted for publication. \bibitem[\protect\citeauthoryear{Hron, Templ, and Filzmoser}{Hron et~al.}{2010}]{hron10} Hron, K., M.~Templ, and P.~Filzmoser (2010). \newblock Imputation of missing values for compositional data using classical and robust methods. \newblock {\em Computational Statistics and Data Analysis\/}~{\em 54\/}(12), 3095--3107. \newblock DOI:10.1016/j.csda.2009.11.023. \bibitem[\protect\citeauthoryear{Korho\v{n}ov\'a, Hron, Klim\v{c}\'ikov\'a, Muller, Bedn\'ar, and Bart\'ak}{Korhonov\'a et~al.}{2009}]{Kor09} Korho\v{n}ov\'a, M., K.~Hron, D.~Klim\v{c}\'ikov\'a, L.~M\"uller, P.~Bedn\'a\v{r}, and P.~Bart\'ak (2009). \newblock Coffee aroma - statistical analysis of compositional data. \newblock {\em Talanta\/}~{\em 80\/}(82), 710--715. \bibitem[\protect\citeauthoryear{Maronna, Martin, and Yohai}{Maronna et~al.} {2006}]{MMY06} Maronna, R., D.~Martin, and V.~Yohai (2006). \newblock {\em Robust {S}tatistics: {T}heory and {M}ethods}. John Wiley {\&} Sons Canada Ltd., Toronto, ON. \bibitem[Palarea-Albaladejo, and Martin-Fernandez (2008)]{Palarea08} Palarea-Albaladejo, J., and J.A. Mart\'{i}n-Fern\'{a}ndez (2008). \newblock A modified EM alr-algorithm for replacing rounded zeros in compositional data sets. \newblock {\em Computers and Geosciences}, 34:\penalty0 902--917. \bibitem[Pawlowsky-Glahn, and Buccianti (2011)]{Vera11} Pawlowsky-Glahn, V., and A. Buccianti (2011). \newblock {\em Compositional Data Analysis: Theory and Applications}. John Wiley {\&} Sons Canada Ltd., Toronto, ON. Accepted for publication. \bibitem[\protect\citeauthoryear{{R Development Core Team}}{{R Development Core Team}}{2010}]{R} {R Development Core Team} (2010). \newblock {\em R: A Language and Environment for Statistical Computing}. \newblock Vienna, Austria: R Foundation for Statistical Computing. \newblock {ISBN} 3-900051-07-0. \bibitem[Rousseeuw and von Driessen(1999)]{Rousseeuw99} Rousseeuw, P.J., and K. von Driessen (1999). \newblock A fast algorithm for the minimum covariance determinant estimator. \newblock \emph{Technometrics}, 41:\penalty0 212--223, 1999. \bibitem[\protect\citeauthoryear{Templ, Hron, and Filzmoser}{Templ et~al.}{2011}]{Templ11R} Templ, M., K.~Hron, and P.~Filzmoser (2011). \newblock {\em robCompositions: Robust Estimation for Compositional Data}. \newblock Manual and package, version 1.4.4. \bibitem[Templ, Filzmoser, and Hron (2009)]{Templ09} Templ, M., P.~Filzmoser, and K.~Hron (2009). \newblock Imputation of item non-responses in compositional data using robust methods. \newblock \emph{Work Session on Statistical Data Editing}, Neuchatel, Switzerland, 11 pages. \end{thebibliography} \end{document}