%\VignetteEngine{knitr}
%\VignetteIndexEntry{metaX tutorial}
%\VignetteKeywords{Metabolomics, Quality Control,LC/GC-MS, Statistical analysis}
%\VignettePackage{metaX}
%\VignetteIndexEntry{Bioconductor LaTeX Style}
%\VignettePackage{BiocStyle}

\documentclass[12pt]{article}

<<style, eval=TRUE, echo=FALSE, results='asis'>>=
BiocStyle::latex()
@


\bioctitle{A short tutorial on using \Biocpkg{metaX} for high-throughput mass
  spectrometry-based metabolomic data analysis}

\author{Bo Wen}


\begin{document}

\maketitle

\tableofcontents

<<env, echo=FALSE,warning=FALSE,message=FALSE>>=
suppressPackageStartupMessages(library("metaX"))
#suppressPackageStartupMessages(library("R.utils"))
@

<<figSetup, echo = FALSE >>=
x <- NULL
ii <- 0 # Current figure number
fn <- function(x) {
    ii <<- ii + 1 
    figString <<- paste('\\textbf{Figure ', ii, '.} ', x, sep = '') 
    return(x) 
  }
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Introduction}\label{sec:intro} 

The \Biocpkg{metaX} package provides a integrated pipeline for mass
spectrometry-based metabolomic data analysis. It includes the stages peak 
detection, data preprocessing, normalization, missing value imputation, 
univariate statistical analysis, multivariate statistical analysis such as 
PCA and PLS-DA, metabolite identification, pathway analysis, power analysis, 
feature selection and modeling, data quality assessment and 
HTML-based report generation. This document describes how to use the function 
included in the R package \Biocpkg{metaX}.


\section{Example data}\label{sec:data} 
We are going to use two public datasets to show the functions in this tutorial. One is from the reference \cite{saghatelian2004assignment}. 
This data can be accessed through the \Biocpkg{faahKO} package. The samples in 
this data set can be divided into two groups (group knockout or KO, group wild 
type or WT) which each group includes six samples. The other is from the reference \cite{kirwan2014direct} and can be downloaded from \href{http://www.ebi.ac.uk/metabolights/MTBLS79}{MetaboLights}.


\section{Using metaX}\label{sec:use}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% 
\subsection{Data import}

\Biocpkg{metaX} has been designed to accept diverse data types including compound concentration/intensity tables which generated by \Biocpkg{XCMS}, MZmine \cite{katajamaa2006mzmine}, \href{http://www.nonlinear.com/progenesis/qi/}{Progenesis QI} (csv format) or other software which can be used for peak picking, as well as open file format MS data (such as mzXML, NetCDF).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Input compound intensity table file
\subsubsection{Input compound intensity table file}
\Biocpkg{metaX} accepts several peak intensity/concentration formats.

\begin{itemize}
\item \href{http://www.nonlinear.com/progenesis/qi/}{Progenesis QI} peak picking result file (csv format). It can be imported by the following function:
<< eval=FALSE>>=
## not run
para <- importDataFromQI(para,file="qi.csv")
@

\item \Biocpkg{XCMS} peak picking result file (txt or csv format). It can be imported by the following function:
<< eval=FALSE>>=
## not run
para <- importDataFromXCMS(para,file="xcms.txt")
@

\item \href{http://www.metaboanalyst.ca}{MetaboAnalyst} comatible peak intensity table (csv format). There is an example input file (\href{http://www.metaboanalyst.ca/resources/data/lcms_table.csv}{lcms\_table.csv}) in the website of MetaboAnalyst. It can be imported by the following function:
<< eval=FALSE>>=
## not run
para <- importDataFromMetaboAnalyst(para,
 file="http://www.metaboanalyst.ca/resources/data/lcms_table.csv")
@

\item \Biocpkg{metaX} comatible peak intensity table (txt or csv format), in which both sample or feature names must be unique. In this file, there must be a column named \textbf{"name"} which listed the feature IDs. The other columns contains the intensity/concentration data and the names of the columns are sample names. This data file can be imported by the following method:
<< eval=TRUE>>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)
head(para@rawPeaks[,1:4])
@
\end{itemize}

After importing the peak intensity data, the user also need to set the sample list file. A very important step in the \Biocpkg{metaX} pipeline is the definition of a sample list file, that provides the file names (sample), batch number (batch), sample class (class) and the sample injection order (order). Please note that if the sample list file contains quality control (QC) sample, the value in the column of class must be "NA". The sample list file can be set as below:

<<eval=TRUE>>=
## set the sample list file path
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
sampleListFile(para) <- sfile

## print the object:
para
@

The content of this sample list is shown below:

<<eval=TRUE,echo=FALSE>>=
library(data.table)
ss <- read.delim(para@sampleListFile)
head(ss)
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% MS Data
\subsubsection{Input MS data files}
If the user provides MS data (NetCDF, mzXML and so on), \Biocpkg{metaX} uses the \Biocpkg{XCMS} to perform peak picking, followed by the CAMERA \cite{kuhl2011camera} package to perform peak annotation. In this situation, the MS data must be placed in two subdirectories of a single folder like below:

<<exampleDir, eval=TRUE>>=
list.files(system.file("cdf", package = "faahKO"),
           recursive = TRUE,full.names = TRUE)
@

In the \Biocpkg{metaX} package, it uses a \Rclass{metaXpara-class} object to manage the 
file path information and other parameters for data processing. We can set the 
input files path like below:

<< eval=TRUE>>=
## create a metaXpara-class object
#library("metaX")
para <- new("metaXpara")
## set the MS data path
dir.case(para) <- system.file("cdf/KO", package = "faahKO")
dir.ctrl(para) <- system.file("cdf/WT", package = "faahKO")
@

After setting the MS data file path, the user also need to set the sample list file:

<<eval=TRUE>>=
## set the sample list file path
sampleFile <- system.file("extdata/faahKO_sampleList.txt", 
                              package = "metaX")
sampleListFile(para) <- sampleFile
samList <- read.delim(sampleFile)
print(samList)
@

In general, the user also needs to set several other parameters for peak picking and annotation. Several parameters related to peak picking and annotation must be set. 

\begin{itemize}
\item Peak picking. The peak picking related parameters can be set as below:
<<eval=TRUE>>=
## set parameters for peak picking
xcmsSet.peakwidth(para) <- c(20,50)
xcmsSet.snthresh(para) <- 10
xcmsSet.prefilter(para) <- c(3,100)
xcmsSet.noise(para) <- 0
xcmsSet.nSlaves(para) <- 4
@

\item Peak annotation. The peak annotation related parameters can be set as below:
<<eval=TRUE>>=
## set parameters for peak annotation
group.bw(para) <- 5
group.minfrac(para) <- 0.3
group.mzwid(para) <- 0.015
group.max(para) <- 1000
@
\end{itemize}

For the complete parameters, please see the help page of metaXpara-class. 

In \Biocpkg{metaX}, there is a function \Rfunction{peakFinder()}, which can be used to do the peak picking and annotation.
<<eval=FALSE>>=
p <- peakFinder(para)
@

For the complete parameters, please see the help page of metaXpara-class. 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% 
\subsection{Integrated function \Rfunction{metaXpipe()}}

The function metaXpipe automates the whole data analysis process. In general, the user only need to use this function to do most of the analysis. It includes the following steps:

\begin{enumerate}
\item Peak picking and annotation when input MS data.
\item Data pre-processing: Firstly, if an metabolite feature is detected in \textless 50\% of QC samples or detected in \textless 20\% of experimental samples, it is removed from the rest data analysis. 
\item Missing value imputation.
\item Removal of outliers (sample).
\item Normalization.
\item Feature filter according to the CV (30\%) of features in QC sample. It only works when there are QC samples.
\item Univariate statistical analysis, such as \textbf{t test}, \textbf{Mann-Whitney U test} and \textbf{ROC} analysis. 
\item PCA (score plot, loading plot)
\item PLS-DA (score plot, loading plot, permutation test, Q2, R2)
\item Assessment of data quality:
    \begin{enumerate}
    \item The peak number distribution
    \item The number of missing value distribution
    \item The boxplot of peak intensity
    \item The total peak intensity distribution
    \item The correlation heatmap of QC samples if available
    \item Metabolite m/z (or mass) distribution
    \item Plot of m/z versus retention time
    \item PCA score or loading plot of all samples
    \end{enumerate}

\end{enumerate}

A complete example to show how to run the integrated analysis is shown below:
<<rawPeaks,eval=FALSE,echo=TRUE,results='hide',warning=FALSE,message=FALSE>>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)
sampleListFile(para) <- sfile
ratioPairs(para) <- "S:C"
plsdaPara <- new("plsDAPara")
para@outdir <- "output"
p <- metaXpipe(para,plsdaPara=plsdaPara,cvFilter=0.3,remveOutlier = TRUE,
  outTol = 1.2, doQA = TRUE, doROC = TRUE, qcsc = FALSE,
  nor.method = "pqn", pclean = TRUE, t = 1, scale = "uv",)
@

In the above example, the class \Rclass{plsDAPara} from \Biocpkg{metaX} is used to control the parameters for \textbf{PLS-DA} analysis. For the complete parameters, please see the help page of \Rclass{plsDAPara-class}.  

After the analysis has completed, the file "index.html" in the output directory can be opened in a web browser to access report generated.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% 
\subsection{Function modules}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% 
\subsubsection{Peak picking and annotation}

In metaX, the function \Rfunction{peakFinder()} can be used to do the peak picking and annotation. It uses the \Biocpkg{XCMS} to perform peak picking, followed by the CAMERA \cite{kuhl2011camera} package to perform peak annotation.


\subsubsection{Missing value imputation}

Missing value imputation. Missing values is a common phenomenon in a 
typical quantitative metabolomics dataset. There are several methods provided by
\Biocpkg{metaX} to process the missing value. Currently, we implemented a 
variety of methods which enable users to automatically perform missing value 
imputation by min, Probabilistic PCA (PPCA), Bayesian PCA (BPCA), k nearest-neighbor 
(KNN), missForest and Singular Value Decomposition Imputation (SVDImpute).

When the user uses the function \Rfunction{metaXpipe} to do the analysis, the following code can be used to set the imputation method:
<<missValueImputeMethod1,eval=TRUE>>=
## bpca, svdImpute, knn, rf, min
missValueImputeMethod(para) <- "knn"
@

Also, the user can use the function \Rfunction{missingValueImpute} to perform the missing value imputation:

<<eval=TRUE>>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)
sampleListFile(para) <- sfile
para <- reSetPeaksData(para)
## bpca, svdImpute, knn, rf, min
para <- missingValueImpute(para,method="knn")
@

\subsubsection{Normalization}

Currently, we implemented several methods to perform data 
normalization, such as the QC-robust spline batch correction (QC-RSC) \cite{dunn2011procedures}, sum, VSN, probabilistic quotient 
normalization (PQN) \cite{dieterle2006probabilistic}, quantiles and robust quantiles. 

If there are pooled QC samples in the experiment (the experiment is like figure~\ref{qcdesign}), the function \Rfunction{doQCRLSC} can be used to perform the QC-RSC normalization. This method is implemented to correct data within batch experiment analytical variation, and batch-to-batch variation in large-scale studies.
<<nor11,fig.height=2,fig.width=6,fig.align='center',echo=FALSE,fig.cap=fn("\\label{qcdesign}The experiment design which contained QC samples.")>>=
par(mar=c(0,0,0,0))
sname <- c(rep("QC",3),rep("S",3),"...",rep("QC",1),rep("S",3),"...",rep("QC",3))
cname <- ifelse(sname=="QC","green","yellow")
xname <- 1:length(sname)
plot(xname,rep(1,length(xname)),type="n",axes=FALSE,ylim=c(0.6,1.2))
points(xname,rep(1,length(xname)),pch=22,cex=4.5,col=cname,bg=cname)
arrows(x0 = 1,x1 = 4,y0 = 0.8,y1 = 0.8,lwd = 3)
text(xname,rep(1,length(xname)),labels = sname)
text(1,0.85,labels = "Sample injection order",cex=1,adj = c(0,0))
@
<<QC,eval=FALSE>>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)[1:100,]
sampleListFile(para) <- sfile
para <- reSetPeaksData(para)
para <- missingValueImpute(para)
res <- doQCRLSC(para,cpu=1)
@

Except the QC-RSC method, the user can use the other method (PQN, sum et al.) as below:
<<eval=FALSE>>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)
sampleListFile(para) <- sfile
para <- reSetPeaksData(para)
para <- missingValueImpute(para)
para <- metaX::normalize(para,method="pqn",valueID="value")
@



\subsubsection{Pre-processing of raw peak data}
In \Biocpkg{metaX}, two functions, \Rfunction{filterQCPeaks()} and \Rfunction{filterPeaks()} can be used to filter features. In general, an metabolite feature is detected in \textless 50\% of QC samples (by using \Rfunction{filterQCPeaks()}) or detected in \textless 20\% (by using \Rfunction{filterPeaks()}) of experimental samples, it is removed from the rest data analysis.
<<eval=FALSE>>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)
sampleListFile(para) <- sfile
para <- reSetPeaksData(para)
p <- filterPeaks(para,ratio=0.2)
p <- filterQCPeaks(para,ratio=0.5)
@

\subsubsection{Removal of outliers}
\Biocpkg{metaX} provides function (Rfunction{autoRemoveOutlier()}) to automatically remove the outlier samples in the pre-processed data based on expansion of the Hotellings T2 distribution ellipse. A sample within the first and second component PCA score plot beyond the expanded ellipse is removed, then the PCA model is recalculated. In default, three rounds of outlier removal are performed.
<<eval=FALSE>>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)
sampleListFile(para) <- sfile
para <- reSetPeaksData(para)
para <- missingValueImpute(para)
rs <- autoRemoveOutlier(para,valueID="value")
@

\subsubsection{Power analysis and sample size estimation}
In \Biocpkg{metaX}, the function \Rfunction{powerAnalyst()} can be used to do power analysis and sample size estimation.
<<eval=TRUE,fig.align='center',fig.height=4,fig.width=6,fig.cap=fn("\\label{powerplot}The power and sample size distribution.")>>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)
sampleListFile(para) <- sfile
para <- reSetPeaksData(para)
para <- missingValueImpute(para)
para <- metaX::normalize(para)
para <- transformation(para,valueID = "value")
para <- preProcess(para,scale = "pareto",valueID="value")
res <- powerAnalyst(para,group=c("C","S"),log=FALSE,
                    maxInd=200,showPlot = TRUE)
print(res)
@


\subsubsection{Metabolite correlation network analysis}

In \Biocpkg{metaX}, the function \Rfunction{plotNetwork()} can be used to do correlation network analysis.
<<eval=TRUE,fig.align='center',fig.height=6,fig.width=6,fig.cap=fn("\\label{netplot}The correlation network."),warning=FALSE,cache=TRUE>>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)
sampleListFile(para) <- sfile
para <- reSetPeaksData(para)
para <- missingValueImpute(para)
op <- par()
par(mar=c(0,0,0,0))
gg <- plotNetwork(para,group=c("S","C"),degree.thr = 10,
                  cor.thr = 0.85,showPlot=TRUE)
par(op)
@

\subsubsection{Metabolite identification}

In \Biocpkg{metaX}, the function \Rfunction{metaboliteAnnotation()} can be used to do the metabolite identification. Currently, only HMDB \cite{wishart2012hmdb} is supported.

\subsubsection{Functional analysis}

At the present, metaX supports the function for metabolite pathway analysis based on IMPaLA \cite{kamburov2011integrated}. The function \Rfunction{pathwayAnalysis()} can be used to do the pathway analysis.

<<eval=TRUE,cache=TRUE>>=
res <- pathwayAnalysis(id=c("HMDB00060","HMDB00056","HMDB00064"),
                       outfile="pathway.csv")
@
Part of the pathway analysis result is shown below:
<<eval=TRUE,results='asis'>>=
xtable::xtable(head(res[,2:4]))
@

\subsubsection{Biomarker analysis}
\Biocpkg{metaX} uses the functions from the R package "caret" to perform the biomarker selection, model creation and performance evaluation.
<<eval=TRUE,warning=FALSE,message=FALSE,cache=TRUE>>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)[1:100,]
sampleListFile(para) <- sfile
para <- reSetPeaksData(para)
res <- featureSelection(para, group=c("S","C"), method = "rf", 
                        valueID = "value", fold = 5)

@

The biomarker selection result is shown below:
<<eval=TRUE,results='asis'>>=
xtable::xtable(head(res$results))
@

The best feature(s) is below:
<<eval=TRUE,results='asis'>>=
print(res$optVariables)
@

\subsubsection{Data transformation}
There are two methods which can be used to do transformation, log transformation and cube root transformation. The function \Rfunction{transformation()} can be used to do transformation like below:

<<eval=TRUE,results='asis',cache=TRUE>>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)
sampleListFile(para) <- sfile
para <- reSetPeaksData(para)
para <- missingValueImpute(para)
para <- transformation(para,valueID = "value",method=1)
@

\subsubsection{Data scaling}
There are three methods which can be used to do scaling, "pareto", "vector", "uv". The function \Rfunction{preProcess()} can be used to do scaling like below:

<<eval=TRUE,results='asis',cache=TRUE>>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)
sampleListFile(para) <- sfile
para <- reSetPeaksData(para)
para <- missingValueImpute(para)
para <- metaX::preProcess(para,valueID = "value",scale="uv")
@

\subsubsection{Univariate statistical analysis}
For univariate statistical analysis, the parametric statistical test (t test), non-parametric statistical test (Mann-Whitney U test), and classical univariate receiver operating characteristic (ROC) curve analysis are implemented. The function \Rfunction{peakStat()} can be used to do these analysis.

<<eval=TRUE,results='asis',cache=TRUE,message=FALSE,warning=FALSE,fig.show="hide">>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)[1:50,]
sampleListFile(para) <- sfile
para <- reSetPeaksData(para)
para <- missingValueImpute(para)
ratioPairs(para) <- "S:C"
addValueNorm(para) <- para
plsdaPara <- new("plsDAPara")
plsdaPara@nperm <- 10
res <- peakStat(para,plsdaPara,doROC = TRUE)
@

The part of the analysis result is shown below: 
<<eval=TRUE>>=
head(res@quant)
@


\subsubsection{PCA}
The PCA analysis can be performed by the function \Rfunction{plotPCA()} in \Biocpkg{metaX}. The score plot (as shown in figure~\ref{pcaplot}) for 2D plot, figure~\ref{3dpcaplot}) for 3D plot) and loading plot (as shown in figure~\ref{pcaloadingplot})) are outputed.

<<eval=TRUE,warning=FALSE,cache=TRUE,message=FALSE,fig.align='center',fig.height=5,fig.width=5,fig.cap=fn("\\label{pcaplot}The PCA score plot.")>>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)
sampleListFile(para) <- sfile
para <- reSetPeaksData(para)
para <- missingValueImpute(para)
para <- transformation(para,valueID = "value")
metaX::plotPCA(para,valueID="value",scale="pareto",center=TRUE,rmQC = FALSE)
@

<<eval=TRUE,echo=FALSE,fig.align='center',fig.height=5,fig.width=5,fig.cap=fn("\\label{pcaplot}The PCA score plot."),warning=FALSE,message=FALSE>>=
library(ggplot2)
library(reshape2)
library(dplyr)
pca.res <- readRDS("metaX-pca.rds")

x <- para@peaksData
x$class <- as.character(x$class)
x$class[is.na(x$class)] <- "QC"
x<-dcast(x,ID~sample,value.var = "value")
              row.names(x) <- x$ID
              x$ID <- NULL
plotData <- data.frame(x=pca.res@scores[,1],
                                     y=pca.res@scores[,2],
                                     z=pca.res@scores[,3],
                                     sample=names(x))
sampleList  <- read.delim(para@sampleListFile)
plotData <- merge(plotData,sampleList,by="sample",sort=FALSE)
plotData$class <- as.character(plotData$class)
plotData$class[is.na(plotData$class)] <- "QC" 
plotData$batch <- as.character(plotData$batch)
ggobj <-ggplot(data = plotData,aes(x=x,y=y,colour=class))+
                  geom_hline(yintercept=0,colour="white",size=1)+
                  geom_vline(xintercept=0,colour="white",size=1)+
                  #geom_point()+
                  xlab(paste("PC1"," (",sprintf("%.2f%%",100*pca.res@R2[1]),") ",sep=""))+
                  ylab(paste("PC2"," (",sprintf("%.2f%%",100*pca.res@R2[2]),") ",sep=""))+
                  #theme_bw()+
                  theme(#legend.justification=c(1,1), 
                      #legend.position=c(1,1),
                      legend.position="top",
                      panel.grid.major = element_blank(), 
                      panel.grid.minor = element_blank(),
                      panel.background=element_rect(fill="#E3E3EE"))+
                  #theme(legend.direction = 'horizontal', legend.position = 'top')+
                  stat_ellipse(geom = "polygon", type="euclid",alpha = 0.4, 
                               aes(fill = class))+
                  stat_ellipse(geom = "path")
              
ggobj <- ggobj + geom_text(aes(label=order),size=4,hjust=-0.2)
ggobj <- ggobj + geom_point(aes(shape=batch))+
                      scale_shape_manual(values=1:n_distinct(plotData$batch))
##print(ggobj)              
           
@
<<eval=TRUE,echo=FALSE,fig.align='center',fig.height=5,fig.width=5,fig.cap=fn("\\label{3dpcaplot}The 3D PCA score plot."),warning=FALSE,message=FALSE>>=
library(scatterplot3d)
col <- as.numeric(as.factor(plotData$class))
              s3d <- scatterplot3d(plotData$x,plotData$y,plotData$z,type="h",angle = 24,
                                   col.grid="lightblue",lty.hplot=2,pch="",color="gray",
                                   xlab = paste("PC1"," (",
                                                sprintf("%.2f%%",100*pca.res@R2[1]),") ",sep=""),
                                   ylab = paste("PC2"," (",
                                                sprintf("%.2f%%",100*pca.res@R2[2]),") ",sep=""),
                                   zlab = paste("PC3"," (",
                                                sprintf("%.2f%%",100*pca.res@R2[3]),") ",sep="")
              )#color = as.numeric(as.factor(plotData$class)))
              s3d$points(plotData$x,plotData$y,plotData$z, pch = 1,col = col)
              s3d.coords <- s3d$xyz.convert(plotData$x,plotData$y,plotData$z)
              text(s3d.coords$x, s3d.coords$y, labels = plotData$order,
                   pos = 4,cex=0.5,col = col)
              
              
              
              classLabel <- levels(as.factor(plotData$class))
              legend(s3d$xyz.convert(max(plotData$x)*0.7, 
                                     max(plotData$y), min(plotData$z)), 
                     col=as.numeric(as.factor(classLabel)), yjust=0,pch=1,
                     legend = classLabel, cex = 0.8)
@


<<eval=TRUE,echo=FALSE,fig.align='center',fig.height=5,fig.width=5,fig.cap=fn("\\label{pcaloadingplot}The PCA loading plot."),warning=FALSE,message=FALSE>>=
loadings_data <- as.data.frame(pca.res@loadings)
loadings_data <- as.matrix(loadings_data)
    HotEllipse<-abs(cbind(metaX:::HotE(loadings_data[,1],loadings_data[,2])))*0.9
    outliers<-as.numeric()
    for (i in 1:nrow(loadings_data)){

        sample<-abs(loadings_data[i,])
        out.PC1<-which(HotEllipse[,1]<sample[1])
        out.PC1.PC2<-any(HotEllipse[out.PC1,2]<sample[2])*1

        outlier<-ifelse(out.PC1.PC2>0,1,0)#+out.PC1.PC3+out.PC2.PC3>0,1,0)
        outliers<-c(outliers,outlier)
    }

    dat <- as.data.frame(loadings_data)
    dat$label <- row.names(dat)
    dat$label[outliers==0] <- ""
    dat$col <- as.character(outliers)
    dat$alllabel <- row.names(dat)

    gg <- ggplot(data=dat,aes(x=PC1,y=PC2,colour=col))+
        geom_hline(yintercept=0,linetype=2)+
        geom_vline(xintercept=0,linetype=2)+
        geom_point(alpha=0.5)+
        theme_bw()+
        theme(legend.position="none")
    gg <- gg + geom_text(aes(label=label),size=2.5,vjust=0,hjust=0)
    print(gg)
@

\subsubsection{PLS-DA}

The PLS-DA analysis can be performed by the function \Rfunction{runPLSDA()} in \Biocpkg{metaX}. The R2, Q2 and the permutation test p-value are calculated and outputed. The permutation test plot is shown in figure~\ref{permutationtest}. The score plot and loading plot are shown in figure~\ref{scoreplotplsda} and figure~\ref{loadingplotplsda}, respectively.

<<eval=TRUE,warning=FALSE,cache=TRUE,message=FALSE>>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)
sampleListFile(para) <- sfile
para <- reSetPeaksData(para)
para <- missingValueImpute(para)
para <- metaX::normalize(para,method="pqn")
plsdaPara <- new("plsDAPara")
plsdaPara@nperm <- 100
plsda.res <- runPLSDA(para = para,plsdaPara = plsdaPara,
                      sample = c("S","C"),valueID="value")

cat("R2Y: ",plsda.res$plsda$res$R2,"\n")
cat("Q2Y: ",plsda.res$plsda$res$Q2,"\n")

## permutation test 
cat("P-value R2Y: ",plsda.res$pvalue$R2,"\n")
cat("P-value Q2Y: ",plsda.res$pvalue$Q2,"\n")
@


<<eval=TRUE,echo=FALSE,fig.align='center',fig.height=5,fig.width=5,fig.cap=fn("\\label{permutationtest}The permutation test plot for PLS-DA."),warning=FALSE,message=FALSE>>=
plotdat <- as.matrix(cbind(unlist(plsda.res$plsda$res),plsda.res$plsda$perm))
    plotdat <- as.data.frame(t(plotdat))
    x1 <- plotdat$cor[1]
    y1 <- plotdat$R2[1]
    y2 <- plotdat$Q2[1]
    plotdat$cor <- abs(plotdat$cor)    
    plotdat <- plotdat[order(plotdat$cor),]
    par(mar=c(3,3,2,1),mgp=c(1.6,0.6,0),cex.lab=1.2,cex.main=0.9)
    plot(plotdat$cor,plotdat$R2,ylim=c(min(plotdat$R2,plotdat$Q2),1),pch=16,
         xlab="Cor",ylab="Value",col="blue")
    points(plotdat$cor,plotdat$Q2,pch=15,col="red")
    
    lm.r <- lm(I(R2-y1)~I(cor-x1)+0,data=plotdat)
    lm.q <- lm(I(Q2-y2)~I(cor-x1)+0,data=plotdat)
    #lines(plotdat$cor,predict(lm.r,data=plotdat$cor),col="blue",lty=2)
    #lines(plotdat$cor,predict(lm.q,data=plotdat$cor),col="red",lty=2)
    int.R <- predict(lm.r,newdata=list(cor=0))+y1
    int.Q <- predict(lm.q,newdata=list(cor=0))+y2
    
    abline(int.R,coef(lm.r),lty=2,col="blue")
    abline(int.Q,coef(lm.q),lty=2,col="red")
    #abline(lm.q,lty=2,col="red")
    legend("bottomright",pch=c(16,15),legend = c("R2","Q2"),col=c("blue","red"))
    
    title(main = paste("Intercepts:","R2=(0.0,",sprintf("%.4f",int.R),
                       "), Q2=(0.0,",sprintf("%.4f",int.Q),")"))
@

<<eval=TRUE,echo=FALSE,fig.align='center',fig.height=5,fig.width=5,fig.cap=fn("\\label{scoreplotplsda}The score plot for PLS-DA."),warning=FALSE,message=FALSE>>=
library(pls)
result <- plsda.res
sampleList  <- read.delim(para@sampleListFile)
    
peaksData <- para@peaksData
peaksData <- peaksData[peaksData$class %in% c("S","C"),]
peaksData$class <- as.character(peaksData$class)
plsData <- dcast(peaksData,sample+class~ID,
                 value.var = "value")
plsData$class[is.na(plsData$class)] <- "QC"
plotData <- data.frame(x=result$model$scores[,1],
                       y=result$model$scores[,2],
                       sample=plsData$sample,
                       class=result$class$rawLabel)
plotData$class <- as.character(plotData$class)
plotData$class[is.na(plotData$class)] <- "QC" 
sampleList$class <- NULL
plotData <- merge(plotData,sampleList,by="sample",sort=FALSE)
ggobj <-ggplot(data = plotData,aes(x=x,y=y,colour=class))+
        geom_hline(yintercept=0,colour="white",size=1)+
        geom_vline(xintercept=0,colour="white",size=1)+
        geom_point()+

        xlab(paste("PC1"," (",sprintf("%.2f%%",explvar(result$model)[1]),") ",
                   sep=""))+
        ylab(paste("PC2"," (",sprintf("%.2f%%",explvar(result$model)[2]),") ",
                   sep=""))+
        #theme_bw()+
        theme(legend.justification=c(1,1), 
              legend.position=c(1,1),
              panel.grid.major = element_blank(), 
              panel.grid.minor = element_blank(),
              panel.background=element_rect(fill="#E3E3EE"))+
        #theme(legend.direction = 'horizontal', legend.position = 'top')+
        stat_ellipse(geom = "polygon", type="euclid",alpha = 0.4, 
                     aes(fill = class))+
        stat_ellipse(geom = "path",level=0.95)
ggobj <- ggobj + geom_text(aes(label=order),size=4,hjust=-0.2)
print(ggobj)
@   


<<eval=TRUE,echo=FALSE,fig.align='center',fig.height=5,fig.width=5,fig.cap=fn("\\label{loadingplotplsda}The loading plot for PLS-DA."),warning=FALSE,message=FALSE>>=
loadings_data <- as.data.frame(plsda.res$model$loadings[,1:2])
names(loadings_data) <- c("PC1","PC2")
loadings_data <- as.matrix(loadings_data)
    HotEllipse<-abs(cbind(metaX:::HotE(loadings_data[,1],loadings_data[,2])))*0.9
    outliers<-as.numeric()
    for (i in 1:nrow(loadings_data)){

        sample<-abs(loadings_data[i,])
        out.PC1<-which(HotEllipse[,1]<sample[1])
        out.PC1.PC2<-any(HotEllipse[out.PC1,2]<sample[2])*1

        outlier<-ifelse(out.PC1.PC2>0,1,0)#+out.PC1.PC3+out.PC2.PC3>0,1,0)
        outliers<-c(outliers,outlier)
    }

    dat <- as.data.frame(loadings_data)
    dat$label <- row.names(dat)
    dat$label[outliers==0] <- ""
    dat$col <- as.character(outliers)
    dat$alllabel <- row.names(dat)

    gg <- ggplot(data=dat,aes(x=PC1,y=PC2,colour=col))+
        geom_hline(yintercept=0,linetype=2)+
        geom_vline(xintercept=0,linetype=2)+
        geom_point(alpha=0.5)+
        theme_bw()+
        theme(legend.position="none")
    gg <- gg + geom_text(aes(label=label),size=2.5,vjust=0,hjust=0)
    print(gg)
@   


\subsubsection{Assessment of data quality}

In \Rfunction{metaXpipe}, pre- and post-normalization, the data quality is visually assessed in several aspects:
\begin{enumerate}
\item The peak number distribution
\item The number of missing value distribution
\item The boxplot of peak intensity
\item The total peak intensity distribution
\item The correlation heatmap of QC samples if available
\item The metabolite m/z (or mass) distribution
\item The plot of m/z versus retention time
\item The PCA score or loading plot of all samples (only available for post-normalization).
\end{enumerate}

The example figures can be viewed in website \href{http://wenbostar.github.io/metaX/}{http://wenbostar.github.io/metaX/}.

\section{Frequently asked questions}


\subsection{How to select the best number of components for PLS-DA?}
The function \Rfunction{selectBestComponent} can be used to select the best number of components for PLS-DA. This function calculates the R2 and Q2 for each component. 
<<eval=FALSE>>=
para <- new("metaXpara")
pfile <- system.file("extdata/MTBLS79.txt",package = "metaX")
sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX")
rawPeaks(para) <- read.delim(pfile,check.names = FALSE)
sampleListFile(para) <- sfile
para <- reSetPeaksData(para)
para <- missingValueImpute(para)
para <- metaX::normalize(para,method="pqn",valueID="value")
selectBestComponent(para,np=10,sample=c("S","C"),scale="pareto",valueID="value",k=5)
@

\subsection{How to set the comparison groups?}
We can use the following method to set the comparison groups:

<<eval=TRUE>>=
ratioPairs(para) <- "KO:WT"
@

If multiple comparison groups must be set in a single analysis, the user can set
the "para@ratioPairs" like "A:B;C:B;D:B", each comparison group is separated by 
semicolon.
<<eval=TRUE>>=
ratioPairs(para) <- "A:B;C:B;D:B"
@


\subsection{How to set the output file directory?}
The user can set the output directory and the prefix of the output files as below:

<<eval=TRUE>>=
## set the output parameters
outdir(para) <- "test"
prefix(para) <- "metaX"
@









%## set parameters for output
%para@outdir <- "./"
%para@prefix <- "test"
%
%## set comparison group
%para@ratioPairs <- "KO:WT"



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section*{Session information}\label{sec:sessionInfo} 

All software and respective versions used to produce this document are listed below.

<<sessioninfo, results='asis', echo=FALSE>>=
toLatex(sessionInfo())
@

\bibliography{metaX}

\end{document}