%\VignetteEngine{knitr} %\VignetteIndexEntry{metaX tutorial} %\VignetteKeywords{Metabolomics, Quality Control,LC/GC-MS, Statistical analysis} %\VignettePackage{metaX} %\VignetteIndexEntry{Bioconductor LaTeX Style} %\VignettePackage{BiocStyle} \documentclass[12pt]{article} <>= BiocStyle::latex() @ \bioctitle{A short tutorial on using \Biocpkg{metaX} for high-throughput mass spectrometry-based metabolomic data analysis} \author{Bo Wen} \begin{document} \maketitle \tableofcontents <>= suppressPackageStartupMessages(library("metaX")) #suppressPackageStartupMessages(library("R.utils")) @ <>= x <- NULL ii <- 0 # Current figure number fn <- function(x) { ii <<- ii + 1 figString <<- paste('\\textbf{Figure ', ii, '.} ', x, sep = '') return(x) } @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Section %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction}\label{sec:intro} The \Biocpkg{metaX} package provides a integrated pipeline for mass spectrometry-based metabolomic data analysis. It includes the stages peak detection, data preprocessing, normalization, missing value imputation, univariate statistical analysis, multivariate statistical analysis such as PCA and PLS-DA, metabolite identification, pathway analysis, power analysis, feature selection and modeling, data quality assessment and HTML-based report generation. This document describes how to use the function included in the R package \Biocpkg{metaX}. \section{Example data}\label{sec:data} We are going to use two public datasets to show the functions in this tutorial. One is from the reference \cite{saghatelian2004assignment}. This data can be accessed through the \Biocpkg{faahKO} package. The samples in this data set can be divided into two groups (group knockout or KO, group wild type or WT) which each group includes six samples. The other is from the reference \cite{kirwan2014direct} and can be downloaded from \href{http://www.ebi.ac.uk/metabolights/MTBLS79}{MetaboLights}. \section{Using metaX}\label{sec:use} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% \subsection{Data import} \Biocpkg{metaX} has been designed to accept diverse data types including compound concentration/intensity tables which generated by \Biocpkg{XCMS}, MZmine \cite{katajamaa2006mzmine}, \href{http://www.nonlinear.com/progenesis/qi/}{Progenesis QI} (csv format) or other software which can be used for peak picking, as well as open file format MS data (such as mzXML, NetCDF). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Input compound intensity table file \subsubsection{Input compound intensity table file} \Biocpkg{metaX} accepts several peak intensity/concentration formats. \begin{itemize} \item \href{http://www.nonlinear.com/progenesis/qi/}{Progenesis QI} peak picking result file (csv format). It can be imported by the following function: << eval=FALSE>>= ## not run para <- importDataFromQI(para,file="qi.csv") @ \item \Biocpkg{XCMS} peak picking result file (txt or csv format). It can be imported by the following function: << eval=FALSE>>= ## not run para <- importDataFromXCMS(para,file="xcms.txt") @ \item \href{http://www.metaboanalyst.ca}{MetaboAnalyst} comatible peak intensity table (csv format). There is an example input file (\href{http://www.metaboanalyst.ca/resources/data/lcms_table.csv}{lcms\_table.csv}) in the website of MetaboAnalyst. It can be imported by the following function: << eval=FALSE>>= ## not run para <- importDataFromMetaboAnalyst(para, file="http://www.metaboanalyst.ca/resources/data/lcms_table.csv") @ \item \Biocpkg{metaX} comatible peak intensity table (txt or csv format), in which both sample or feature names must be unique. In this file, there must be a column named \textbf{"name"} which listed the feature IDs. The other columns contains the intensity/concentration data and the names of the columns are sample names. This data file can be imported by the following method: << eval=TRUE>>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE) head(para@rawPeaks[,1:4]) @ \end{itemize} After importing the peak intensity data, the user also need to set the sample list file. A very important step in the \Biocpkg{metaX} pipeline is the definition of a sample list file, that provides the file names (sample), batch number (batch), sample class (class) and the sample injection order (order). Please note that if the sample list file contains quality control (QC) sample, the value in the column of class must be "NA". The sample list file can be set as below: <>= ## set the sample list file path sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") sampleListFile(para) <- sfile ## print the object: para @ The content of this sample list is shown below: <>= library(data.table) ss <- read.delim(para@sampleListFile) head(ss) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% MS Data \subsubsection{Input MS data files} If the user provides MS data (NetCDF, mzXML and so on), \Biocpkg{metaX} uses the \Biocpkg{XCMS} to perform peak picking, followed by the CAMERA \cite{kuhl2011camera} package to perform peak annotation. In this situation, the MS data must be placed in two subdirectories of a single folder like below: <>= list.files(system.file("cdf", package = "faahKO"), recursive = TRUE,full.names = TRUE) @ In the \Biocpkg{metaX} package, it uses a \Rclass{metaXpara-class} object to manage the file path information and other parameters for data processing. We can set the input files path like below: << eval=TRUE>>= ## create a metaXpara-class object #library("metaX") para <- new("metaXpara") ## set the MS data path dir.case(para) <- system.file("cdf/KO", package = "faahKO") dir.ctrl(para) <- system.file("cdf/WT", package = "faahKO") @ After setting the MS data file path, the user also need to set the sample list file: <>= ## set the sample list file path sampleFile <- system.file("extdata/faahKO_sampleList.txt", package = "metaX") sampleListFile(para) <- sampleFile samList <- read.delim(sampleFile) print(samList) @ In general, the user also needs to set several other parameters for peak picking and annotation. Several parameters related to peak picking and annotation must be set. \begin{itemize} \item Peak picking. The peak picking related parameters can be set as below: <>= ## set parameters for peak picking xcmsSet.peakwidth(para) <- c(20,50) xcmsSet.snthresh(para) <- 10 xcmsSet.prefilter(para) <- c(3,100) xcmsSet.noise(para) <- 0 xcmsSet.nSlaves(para) <- 4 @ \item Peak annotation. The peak annotation related parameters can be set as below: <>= ## set parameters for peak annotation group.bw(para) <- 5 group.minfrac(para) <- 0.3 group.mzwid(para) <- 0.015 group.max(para) <- 1000 @ \end{itemize} For the complete parameters, please see the help page of metaXpara-class. In \Biocpkg{metaX}, there is a function \Rfunction{peakFinder()}, which can be used to do the peak picking and annotation. <>= p <- peakFinder(para) @ For the complete parameters, please see the help page of metaXpara-class. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% \subsection{Integrated function \Rfunction{metaXpipe()}} The function metaXpipe automates the whole data analysis process. In general, the user only need to use this function to do most of the analysis. It includes the following steps: \begin{enumerate} \item Peak picking and annotation when input MS data. \item Data pre-processing: Firstly, if an metabolite feature is detected in \textless 50\% of QC samples or detected in \textless 20\% of experimental samples, it is removed from the rest data analysis. \item Missing value imputation. \item Removal of outliers (sample). \item Normalization. \item Feature filter according to the CV (30\%) of features in QC sample. It only works when there are QC samples. \item Univariate statistical analysis, such as \textbf{t test}, \textbf{Mann-Whitney U test} and \textbf{ROC} analysis. \item PCA (score plot, loading plot) \item PLS-DA (score plot, loading plot, permutation test, Q2, R2) \item Assessment of data quality: \begin{enumerate} \item The peak number distribution \item The number of missing value distribution \item The boxplot of peak intensity \item The total peak intensity distribution \item The correlation heatmap of QC samples if available \item Metabolite m/z (or mass) distribution \item Plot of m/z versus retention time \item PCA score or loading plot of all samples \end{enumerate} \end{enumerate} A complete example to show how to run the integrated analysis is shown below: <>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE) sampleListFile(para) <- sfile ratioPairs(para) <- "S:C" plsdaPara <- new("plsDAPara") para@outdir <- "output" p <- metaXpipe(para,plsdaPara=plsdaPara,cvFilter=0.3,remveOutlier = TRUE, outTol = 1.2, doQA = TRUE, doROC = TRUE, qcsc = FALSE, nor.method = "pqn", pclean = TRUE, t = 1, scale = "uv",) @ In the above example, the class \Rclass{plsDAPara} from \Biocpkg{metaX} is used to control the parameters for \textbf{PLS-DA} analysis. For the complete parameters, please see the help page of \Rclass{plsDAPara-class}. After the analysis has completed, the file "index.html" in the output directory can be opened in a web browser to access report generated. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% \subsection{Function modules} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% \subsubsection{Peak picking and annotation} In metaX, the function \Rfunction{peakFinder()} can be used to do the peak picking and annotation. It uses the \Biocpkg{XCMS} to perform peak picking, followed by the CAMERA \cite{kuhl2011camera} package to perform peak annotation. \subsubsection{Missing value imputation} Missing value imputation. Missing values is a common phenomenon in a typical quantitative metabolomics dataset. There are several methods provided by \Biocpkg{metaX} to process the missing value. Currently, we implemented a variety of methods which enable users to automatically perform missing value imputation by min, Probabilistic PCA (PPCA), Bayesian PCA (BPCA), k nearest-neighbor (KNN), missForest and Singular Value Decomposition Imputation (SVDImpute). When the user uses the function \Rfunction{metaXpipe} to do the analysis, the following code can be used to set the imputation method: <>= ## bpca, svdImpute, knn, rf, min missValueImputeMethod(para) <- "knn" @ Also, the user can use the function \Rfunction{missingValueImpute} to perform the missing value imputation: <>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE) sampleListFile(para) <- sfile para <- reSetPeaksData(para) ## bpca, svdImpute, knn, rf, min para <- missingValueImpute(para,method="knn") @ \subsubsection{Normalization} Currently, we implemented several methods to perform data normalization, such as the QC-robust spline batch correction (QC-RSC) \cite{dunn2011procedures}, sum, VSN, probabilistic quotient normalization (PQN) \cite{dieterle2006probabilistic}, quantiles and robust quantiles. If there are pooled QC samples in the experiment (the experiment is like figure~\ref{qcdesign}), the function \Rfunction{doQCRLSC} can be used to perform the QC-RSC normalization. This method is implemented to correct data within batch experiment analytical variation, and batch-to-batch variation in large-scale studies. <>= par(mar=c(0,0,0,0)) sname <- c(rep("QC",3),rep("S",3),"...",rep("QC",1),rep("S",3),"...",rep("QC",3)) cname <- ifelse(sname=="QC","green","yellow") xname <- 1:length(sname) plot(xname,rep(1,length(xname)),type="n",axes=FALSE,ylim=c(0.6,1.2)) points(xname,rep(1,length(xname)),pch=22,cex=4.5,col=cname,bg=cname) arrows(x0 = 1,x1 = 4,y0 = 0.8,y1 = 0.8,lwd = 3) text(xname,rep(1,length(xname)),labels = sname) text(1,0.85,labels = "Sample injection order",cex=1,adj = c(0,0)) @ <>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE)[1:100,] sampleListFile(para) <- sfile para <- reSetPeaksData(para) para <- missingValueImpute(para) res <- doQCRLSC(para,cpu=1) @ Except the QC-RSC method, the user can use the other method (PQN, sum et al.) as below: <>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE) sampleListFile(para) <- sfile para <- reSetPeaksData(para) para <- missingValueImpute(para) para <- metaX::normalize(para,method="pqn",valueID="value") @ \subsubsection{Pre-processing of raw peak data} In \Biocpkg{metaX}, two functions, \Rfunction{filterQCPeaks()} and \Rfunction{filterPeaks()} can be used to filter features. In general, an metabolite feature is detected in \textless 50\% of QC samples (by using \Rfunction{filterQCPeaks()}) or detected in \textless 20\% (by using \Rfunction{filterPeaks()}) of experimental samples, it is removed from the rest data analysis. <>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE) sampleListFile(para) <- sfile para <- reSetPeaksData(para) p <- filterPeaks(para,ratio=0.2) p <- filterQCPeaks(para,ratio=0.5) @ \subsubsection{Removal of outliers} \Biocpkg{metaX} provides function (Rfunction{autoRemoveOutlier()}) to automatically remove the outlier samples in the pre-processed data based on expansion of the Hotellings T2 distribution ellipse. A sample within the first and second component PCA score plot beyond the expanded ellipse is removed, then the PCA model is recalculated. In default, three rounds of outlier removal are performed. <>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE) sampleListFile(para) <- sfile para <- reSetPeaksData(para) para <- missingValueImpute(para) rs <- autoRemoveOutlier(para,valueID="value") @ \subsubsection{Power analysis and sample size estimation} In \Biocpkg{metaX}, the function \Rfunction{powerAnalyst()} can be used to do power analysis and sample size estimation. <>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE) sampleListFile(para) <- sfile para <- reSetPeaksData(para) para <- missingValueImpute(para) para <- metaX::normalize(para) para <- transformation(para,valueID = "value") para <- preProcess(para,scale = "pareto",valueID="value") res <- powerAnalyst(para,group=c("C","S"),log=FALSE, maxInd=200,showPlot = TRUE) print(res) @ \subsubsection{Metabolite correlation network analysis} In \Biocpkg{metaX}, the function \Rfunction{plotNetwork()} can be used to do correlation network analysis. <>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE) sampleListFile(para) <- sfile para <- reSetPeaksData(para) para <- missingValueImpute(para) op <- par() par(mar=c(0,0,0,0)) gg <- plotNetwork(para,group=c("S","C"),degree.thr = 10, cor.thr = 0.85,showPlot=TRUE) par(op) @ \subsubsection{Metabolite identification} In \Biocpkg{metaX}, the function \Rfunction{metaboliteAnnotation()} can be used to do the metabolite identification. Currently, only HMDB \cite{wishart2012hmdb} is supported. \subsubsection{Functional analysis} At the present, metaX supports the function for metabolite pathway analysis based on IMPaLA \cite{kamburov2011integrated}. The function \Rfunction{pathwayAnalysis()} can be used to do the pathway analysis. <>= res <- pathwayAnalysis(id=c("HMDB00060","HMDB00056","HMDB00064"), outfile="pathway.csv") @ Part of the pathway analysis result is shown below: <>= xtable::xtable(head(res[,2:4])) @ \subsubsection{Biomarker analysis} \Biocpkg{metaX} uses the functions from the R package "caret" to perform the biomarker selection, model creation and performance evaluation. <>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE)[1:100,] sampleListFile(para) <- sfile para <- reSetPeaksData(para) res <- featureSelection(para, group=c("S","C"), method = "rf", valueID = "value", fold = 5) @ The biomarker selection result is shown below: <>= xtable::xtable(head(res$results)) @ The best feature(s) is below: <>= print(res$optVariables) @ \subsubsection{Data transformation} There are two methods which can be used to do transformation, log transformation and cube root transformation. The function \Rfunction{transformation()} can be used to do transformation like below: <>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE) sampleListFile(para) <- sfile para <- reSetPeaksData(para) para <- missingValueImpute(para) para <- transformation(para,valueID = "value",method=1) @ \subsubsection{Data scaling} There are three methods which can be used to do scaling, "pareto", "vector", "uv". The function \Rfunction{preProcess()} can be used to do scaling like below: <>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE) sampleListFile(para) <- sfile para <- reSetPeaksData(para) para <- missingValueImpute(para) para <- metaX::preProcess(para,valueID = "value",scale="uv") @ \subsubsection{Univariate statistical analysis} For univariate statistical analysis, the parametric statistical test (t test), non-parametric statistical test (Mann-Whitney U test), and classical univariate receiver operating characteristic (ROC) curve analysis are implemented. The function \Rfunction{peakStat()} can be used to do these analysis. <>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE)[1:50,] sampleListFile(para) <- sfile para <- reSetPeaksData(para) para <- missingValueImpute(para) ratioPairs(para) <- "S:C" addValueNorm(para) <- para plsdaPara <- new("plsDAPara") plsdaPara@nperm <- 10 res <- peakStat(para,plsdaPara,doROC = TRUE) @ The part of the analysis result is shown below: <>= head(res@quant) @ \subsubsection{PCA} The PCA analysis can be performed by the function \Rfunction{plotPCA()} in \Biocpkg{metaX}. The score plot (as shown in figure~\ref{pcaplot}) for 2D plot, figure~\ref{3dpcaplot}) for 3D plot) and loading plot (as shown in figure~\ref{pcaloadingplot})) are outputed. <>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE) sampleListFile(para) <- sfile para <- reSetPeaksData(para) para <- missingValueImpute(para) para <- transformation(para,valueID = "value") metaX::plotPCA(para,valueID="value",scale="pareto",center=TRUE,rmQC = FALSE) @ <>= library(ggplot2) library(reshape2) library(dplyr) pca.res <- readRDS("metaX-pca.rds") x <- para@peaksData x$class <- as.character(x$class) x$class[is.na(x$class)] <- "QC" x<-dcast(x,ID~sample,value.var = "value") row.names(x) <- x$ID x$ID <- NULL plotData <- data.frame(x=pca.res@scores[,1], y=pca.res@scores[,2], z=pca.res@scores[,3], sample=names(x)) sampleList <- read.delim(para@sampleListFile) plotData <- merge(plotData,sampleList,by="sample",sort=FALSE) plotData$class <- as.character(plotData$class) plotData$class[is.na(plotData$class)] <- "QC" plotData$batch <- as.character(plotData$batch) ggobj <-ggplot(data = plotData,aes(x=x,y=y,colour=class))+ geom_hline(yintercept=0,colour="white",size=1)+ geom_vline(xintercept=0,colour="white",size=1)+ #geom_point()+ xlab(paste("PC1"," (",sprintf("%.2f%%",100*pca.res@R2[1]),") ",sep=""))+ ylab(paste("PC2"," (",sprintf("%.2f%%",100*pca.res@R2[2]),") ",sep=""))+ #theme_bw()+ theme(#legend.justification=c(1,1), #legend.position=c(1,1), legend.position="top", panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background=element_rect(fill="#E3E3EE"))+ #theme(legend.direction = 'horizontal', legend.position = 'top')+ stat_ellipse(geom = "polygon", type="euclid",alpha = 0.4, aes(fill = class))+ stat_ellipse(geom = "path") ggobj <- ggobj + geom_text(aes(label=order),size=4,hjust=-0.2) ggobj <- ggobj + geom_point(aes(shape=batch))+ scale_shape_manual(values=1:n_distinct(plotData$batch)) ##print(ggobj) @ <>= library(scatterplot3d) col <- as.numeric(as.factor(plotData$class)) s3d <- scatterplot3d(plotData$x,plotData$y,plotData$z,type="h",angle = 24, col.grid="lightblue",lty.hplot=2,pch="",color="gray", xlab = paste("PC1"," (", sprintf("%.2f%%",100*pca.res@R2[1]),") ",sep=""), ylab = paste("PC2"," (", sprintf("%.2f%%",100*pca.res@R2[2]),") ",sep=""), zlab = paste("PC3"," (", sprintf("%.2f%%",100*pca.res@R2[3]),") ",sep="") )#color = as.numeric(as.factor(plotData$class))) s3d$points(plotData$x,plotData$y,plotData$z, pch = 1,col = col) s3d.coords <- s3d$xyz.convert(plotData$x,plotData$y,plotData$z) text(s3d.coords$x, s3d.coords$y, labels = plotData$order, pos = 4,cex=0.5,col = col) classLabel <- levels(as.factor(plotData$class)) legend(s3d$xyz.convert(max(plotData$x)*0.7, max(plotData$y), min(plotData$z)), col=as.numeric(as.factor(classLabel)), yjust=0,pch=1, legend = classLabel, cex = 0.8) @ <>= loadings_data <- as.data.frame(pca.res@loadings) loadings_data <- as.matrix(loadings_data) HotEllipse<-abs(cbind(metaX:::HotE(loadings_data[,1],loadings_data[,2])))*0.9 outliers<-as.numeric() for (i in 1:nrow(loadings_data)){ sample<-abs(loadings_data[i,]) out.PC1<-which(HotEllipse[,1]0,1,0)#+out.PC1.PC3+out.PC2.PC3>0,1,0) outliers<-c(outliers,outlier) } dat <- as.data.frame(loadings_data) dat$label <- row.names(dat) dat$label[outliers==0] <- "" dat$col <- as.character(outliers) dat$alllabel <- row.names(dat) gg <- ggplot(data=dat,aes(x=PC1,y=PC2,colour=col))+ geom_hline(yintercept=0,linetype=2)+ geom_vline(xintercept=0,linetype=2)+ geom_point(alpha=0.5)+ theme_bw()+ theme(legend.position="none") gg <- gg + geom_text(aes(label=label),size=2.5,vjust=0,hjust=0) print(gg) @ \subsubsection{PLS-DA} The PLS-DA analysis can be performed by the function \Rfunction{runPLSDA()} in \Biocpkg{metaX}. The R2, Q2 and the permutation test p-value are calculated and outputed. The permutation test plot is shown in figure~\ref{permutationtest}. The score plot and loading plot are shown in figure~\ref{scoreplotplsda} and figure~\ref{loadingplotplsda}, respectively. <>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE) sampleListFile(para) <- sfile para <- reSetPeaksData(para) para <- missingValueImpute(para) para <- metaX::normalize(para,method="pqn") plsdaPara <- new("plsDAPara") plsdaPara@nperm <- 100 plsda.res <- runPLSDA(para = para,plsdaPara = plsdaPara, sample = c("S","C"),valueID="value") cat("R2Y: ",plsda.res$plsda$res$R2,"\n") cat("Q2Y: ",plsda.res$plsda$res$Q2,"\n") ## permutation test cat("P-value R2Y: ",plsda.res$pvalue$R2,"\n") cat("P-value Q2Y: ",plsda.res$pvalue$Q2,"\n") @ <>= plotdat <- as.matrix(cbind(unlist(plsda.res$plsda$res),plsda.res$plsda$perm)) plotdat <- as.data.frame(t(plotdat)) x1 <- plotdat$cor[1] y1 <- plotdat$R2[1] y2 <- plotdat$Q2[1] plotdat$cor <- abs(plotdat$cor) plotdat <- plotdat[order(plotdat$cor),] par(mar=c(3,3,2,1),mgp=c(1.6,0.6,0),cex.lab=1.2,cex.main=0.9) plot(plotdat$cor,plotdat$R2,ylim=c(min(plotdat$R2,plotdat$Q2),1),pch=16, xlab="Cor",ylab="Value",col="blue") points(plotdat$cor,plotdat$Q2,pch=15,col="red") lm.r <- lm(I(R2-y1)~I(cor-x1)+0,data=plotdat) lm.q <- lm(I(Q2-y2)~I(cor-x1)+0,data=plotdat) #lines(plotdat$cor,predict(lm.r,data=plotdat$cor),col="blue",lty=2) #lines(plotdat$cor,predict(lm.q,data=plotdat$cor),col="red",lty=2) int.R <- predict(lm.r,newdata=list(cor=0))+y1 int.Q <- predict(lm.q,newdata=list(cor=0))+y2 abline(int.R,coef(lm.r),lty=2,col="blue") abline(int.Q,coef(lm.q),lty=2,col="red") #abline(lm.q,lty=2,col="red") legend("bottomright",pch=c(16,15),legend = c("R2","Q2"),col=c("blue","red")) title(main = paste("Intercepts:","R2=(0.0,",sprintf("%.4f",int.R), "), Q2=(0.0,",sprintf("%.4f",int.Q),")")) @ <>= library(pls) result <- plsda.res sampleList <- read.delim(para@sampleListFile) peaksData <- para@peaksData peaksData <- peaksData[peaksData$class %in% c("S","C"),] peaksData$class <- as.character(peaksData$class) plsData <- dcast(peaksData,sample+class~ID, value.var = "value") plsData$class[is.na(plsData$class)] <- "QC" plotData <- data.frame(x=result$model$scores[,1], y=result$model$scores[,2], sample=plsData$sample, class=result$class$rawLabel) plotData$class <- as.character(plotData$class) plotData$class[is.na(plotData$class)] <- "QC" sampleList$class <- NULL plotData <- merge(plotData,sampleList,by="sample",sort=FALSE) ggobj <-ggplot(data = plotData,aes(x=x,y=y,colour=class))+ geom_hline(yintercept=0,colour="white",size=1)+ geom_vline(xintercept=0,colour="white",size=1)+ geom_point()+ xlab(paste("PC1"," (",sprintf("%.2f%%",explvar(result$model)[1]),") ", sep=""))+ ylab(paste("PC2"," (",sprintf("%.2f%%",explvar(result$model)[2]),") ", sep=""))+ #theme_bw()+ theme(legend.justification=c(1,1), legend.position=c(1,1), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background=element_rect(fill="#E3E3EE"))+ #theme(legend.direction = 'horizontal', legend.position = 'top')+ stat_ellipse(geom = "polygon", type="euclid",alpha = 0.4, aes(fill = class))+ stat_ellipse(geom = "path",level=0.95) ggobj <- ggobj + geom_text(aes(label=order),size=4,hjust=-0.2) print(ggobj) @ <>= loadings_data <- as.data.frame(plsda.res$model$loadings[,1:2]) names(loadings_data) <- c("PC1","PC2") loadings_data <- as.matrix(loadings_data) HotEllipse<-abs(cbind(metaX:::HotE(loadings_data[,1],loadings_data[,2])))*0.9 outliers<-as.numeric() for (i in 1:nrow(loadings_data)){ sample<-abs(loadings_data[i,]) out.PC1<-which(HotEllipse[,1]0,1,0)#+out.PC1.PC3+out.PC2.PC3>0,1,0) outliers<-c(outliers,outlier) } dat <- as.data.frame(loadings_data) dat$label <- row.names(dat) dat$label[outliers==0] <- "" dat$col <- as.character(outliers) dat$alllabel <- row.names(dat) gg <- ggplot(data=dat,aes(x=PC1,y=PC2,colour=col))+ geom_hline(yintercept=0,linetype=2)+ geom_vline(xintercept=0,linetype=2)+ geom_point(alpha=0.5)+ theme_bw()+ theme(legend.position="none") gg <- gg + geom_text(aes(label=label),size=2.5,vjust=0,hjust=0) print(gg) @ \subsubsection{Assessment of data quality} In \Rfunction{metaXpipe}, pre- and post-normalization, the data quality is visually assessed in several aspects: \begin{enumerate} \item The peak number distribution \item The number of missing value distribution \item The boxplot of peak intensity \item The total peak intensity distribution \item The correlation heatmap of QC samples if available \item The metabolite m/z (or mass) distribution \item The plot of m/z versus retention time \item The PCA score or loading plot of all samples (only available for post-normalization). \end{enumerate} The example figures can be viewed in website \href{http://wenbostar.github.io/metaX/}{http://wenbostar.github.io/metaX/}. \section{Frequently asked questions} \subsection{How to select the best number of components for PLS-DA?} The function \Rfunction{selectBestComponent} can be used to select the best number of components for PLS-DA. This function calculates the R2 and Q2 for each component. <>= para <- new("metaXpara") pfile <- system.file("extdata/MTBLS79.txt",package = "metaX") sfile <- system.file("extdata/MTBLS79_sampleList.txt",package = "metaX") rawPeaks(para) <- read.delim(pfile,check.names = FALSE) sampleListFile(para) <- sfile para <- reSetPeaksData(para) para <- missingValueImpute(para) para <- metaX::normalize(para,method="pqn",valueID="value") selectBestComponent(para,np=10,sample=c("S","C"),scale="pareto",valueID="value",k=5) @ \subsection{How to set the comparison groups?} We can use the following method to set the comparison groups: <>= ratioPairs(para) <- "KO:WT" @ If multiple comparison groups must be set in a single analysis, the user can set the "para@ratioPairs" like "A:B;C:B;D:B", each comparison group is separated by semicolon. <>= ratioPairs(para) <- "A:B;C:B;D:B" @ \subsection{How to set the output file directory?} The user can set the output directory and the prefix of the output files as below: <>= ## set the output parameters outdir(para) <- "test" prefix(para) <- "metaX" @ %## set parameters for output %para@outdir <- "./" %para@prefix <- "test" % %## set comparison group %para@ratioPairs <- "KO:WT" %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Section %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{Session information}\label{sec:sessionInfo} All software and respective versions used to produce this document are listed below. <>= toLatex(sessionInfo()) @ \bibliography{metaX} \end{document}