--- title: 'AWAggregator Vignette' author: - name: Jiahua Tan affiliation: - &1 "Department of Chemistry, University of British Columbia, Vancouver, BC, Canada" - name: Gian L. Negri affiliation: - &2 "Canada's Michael Smith Genome Sciences Centre, BC Cancer Research Institute, University of British Columbia, Vancouver, BC, Canada" - name: Gregg B. Morin affiliation: - *2 - &3 "Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada" - name: David D. Y. Chen affiliation: - *1 date: '`r format(Sys.Date(), "%B %e, %Y")`' package: AWAggregator output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{AWAggregator vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction The `AWAggregator` package implements an attribute-weighted aggregation algorithm which leverages peptide-spectrum match (PSM) attributes to provide a more accurate estimate of protein abundance compared to conventional aggregation methods. This algorithm employs pre-trained random forest models to predict the quantitative inaccuracy of PSMs based on their attributes. PSMs are then aggregated to the protein level using a weighted average, taking the predicted inaccuracy into account. Additionally, the package allows users to construct their own training sets that are more relevant to their specific experimental conditions if desired. Since `ExperimentHub` can only retrieve data from the `AWAggregatorData` package with Bioconductor version 3.21 or later, please use the legacy version of the `AWAggregator` package if you are using an earlier Bioconductor version: https://github.com/Tan-Jiahua/AWAggregator-compat ## Overview of Package Functions Functions available in the `AWAggregator` package: - `getDistMetric()`: Calculates the distance metric for PSMs. Distance metric reflects on whether the quantified ratio of each pair of samples of a PSM diverges from other PSMs in the same redundant/unique group. Redundant group, unique group and distance metric were originally defined in the iPQF method. Please refer to "iPQF: a new peptide-to-protein summarization method using peptide spectra characteristics to improve protein quantification" for more details. - `getPSMAttributes()`: Retrieves attributes required for training or test sets. - `getAvgScaledErrorOfLog2FC()`: Calculates the Average Scaled Error of log2FC values required for training sets. - `mergeTrainingSets()`: Extracts a similar number of PSMs from each input dataset and merges them into a single training set. - `fitQuantInaccuracyModel()`: Trains a random forest model to predict the level of quantitative inaccuracy of PSMs. - `aggregateByAttributes()`: Aggregates PSMs using a random forest model. - `convertPDFormat()`: Converts output from Proteome Discoverer into the input format required by `AWAggregator`. Function available in the associated `AWAggregatorData` package: - `loadQuantInaccuracyModel()`: Loads a pre-trained random forest model for predicting the level of quantitative inaccuracy of PSMs. ## Overview of Package Data Data available in the `AWAggregator` package: - `sample.PSM.FP`: represents sample PSMs mapped to the proteins A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the `psm.tsv` output file generated by FragPipe. Columns unnecessary for the `AWAggregator` have been removed from the sample data. - `sample.prot.PD`: represents sample proteins A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the TXT export of the proteins page in the Proteome Discoverer search results. Columns unnecessary for the `AWAggregator` have been removed from the sample data. - `sample.PSM.PD`: represents sample PSMs mapped to the proteins A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the TXT export of the PSMs page in the Proteome Discoverer search results. Columns unnecessary for the `AWAggregator` have been removed from the sample data. Data available in the associated `AWAggregatorData` package: - `regr`: represent the pre-trained random forest model that incorporates the average coefficient of variation (CV) as a feature. - `regr.no.CV`: represent the pre-trained random forest model that does not include the average CV as a feature. - `benchmark.set.1`, `benchmark.set.2`, `benchmark.set.3`: represents PSMs in Benchmark Set 1 \~ 3 derived from the `psm.tsv` output files generated by FragPipe, which are used to train the random forest model. Columns unnecessary for the `AWAggregator` have been removed from the sample data. # Installation The `AWAggregator` package and the associated `AWAggregatorData` package can be installed from Bioconductor. ```{r install, eval=FALSE} if (!requireNamespace('BiocManager', quietly=TRUE)) install.packages('BiocManager') BiocManager::install('AWAggregator') BiocManager::install('AWAggregatorData') ``` # Workflow Examples Load the `AWAggregator` package and the `AWAggregatorData` package. ```{r load package} library(AWAggregator) library(AWAggregatorData) ``` ## Ex.1: Aggregate PSMs from FragPipe Using the Pre-Trained Model. In this example, we aggregate the reporter ion intensities of PSMs to the protein level. We use the sample dataset `sample.PSM.FP`, included in the `AWAggregator` package and derived from the `psm.tsv` output file generated by FragPipe. This dataset includes reporter ion intensities from nine samples, labeled from `Sample 1` to `Sample 9`, without replicates. The PSMs are mapped to the following proteins: A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, with unnecessary columns removed for clarity. This example demonstrates the basic functionality of the `AWAggregator` package using the default pre-trained model. ```{r aggregate PSMs from FragPipe} # Load the pre-trained random forest model that does not include the average CV # as a feature, which indicates the average CV in percentage for processed PSM # reporter ion intensities across different replicate groups. It is recommended # to load the pre-trained model with average CV when replicates are available; # otherwise, use the model without the average CV data(sample.PSM.FP) regr <- loadQuantInaccuracyModel(useAvgCV=FALSE) # Load sample names (Sample 1 ~ Sample 9) samples <- colnames(sample.PSM.FP)[grep('Sample', colnames(sample.PSM.FP))] groups <- samples df <- getPSMAttributes( PSM=sample.PSM.FP, # TMT tag (229.1629) and carbamidomethylation (57.0214) are applied as # fixed post-translational modifications (PTMs) fixedPTMs=c('229.1629', '57.0214'), colOfReporterIonInt=samples, groups=groups, setProgressBar=TRUE ) aggregated_results <- aggregateByAttributes( PSM=df, colOfReporterIonInt=samples, ranger=regr, ratioCalc=FALSE ) ``` The output dataframe will provide estimates of protein abundance. ``` Protein Sample 1 Sample 2 Sample 3 Sample 4 ... sp|A0AV96|RBM47_HUMAN 0.9292177 1.0111264 0.7933874 0.9606382 ... sp|A0AVF1|IFT56_HUMAN 0.6646691 0.6600642 0.6696656 0.7984397 ... sp|A0AVT1|UBA6_HUMAN 1.1883116 1.1752203 1.0482381 1.0910095 ... sp|A0FGR8|ESYT2_HUMAN 0.9304190 0.8504465 1.0550898 0.7952998 ... sp|A0M8Q6|IGLC7_HUMAN 0.4205675 0.6393757 0.7475482 0.6968704 ... ``` ## Ex.2: Aggregate PSMs from Proteome Discoverer Using the Pre-Trained Model. In this example, we convert the search result from Proteome Discoverer to the format required by `AWAggregator` and aggregate the reporter ion intensities of PSMs to the protein level. We use the sample dataset `sample.PSM.PD`, alongside its corresponding protein table `sample.prot.PD`, both included in the `AWAggregator` package. These files are derived from the TXT exports of the proteins and PSMs pages in the search results from Proteome Discoverer. This dataset includes reporter ion intensities from nine samples, labeled from `Sample 1` to `Sample 9`, without replicates. The PSM and protein tables contains following proteins: A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, with unnecessary columns removed for clarity. ```{r aggregate PSMs from Proteome Discoverer} # Load the pre-trained random forest model that does not include the average CV # as a feature, which indicates the average CV in percentage for processed PSM # reporter ion intensities across different replicate groups. It is recommended # to load the pre-trained model with average CV when replicates are available; # otherwise, use the model without the average CV data(sample.PSM.PD) data(sample.prot.PD) regr <- loadQuantInaccuracyModel(useAvgCV=FALSE) # Load sample names (Sample 1 ~ Sample 9) samples <- colnames(sample.PSM.PD)[grep('Sample', colnames(sample.PSM.PD))] groups <- samples df <- convertPDFormat( PSM=sample.PSM.PD, protein=sample.prot.PD, colOfReporterIonInt=samples ) df <- getPSMAttributes( PSM=df, # TMT tag and carbamidomethylation are applied as static PTMs fixedPTMs=c('TMT6plex', 'Carbamidomethyl'), colOfReporterIonInt=samples, groups=groups, setProgressBar=TRUE ) aggregated_results <- aggregateByAttributes( PSM=df, colOfReporterIonInt=samples, ranger=regr, ratioCalc=FALSE ) ``` The output dataframe will provide estimates of protein abundance. ``` Protein Sample 1 Sample 2 Sample 3 Sample 4 ... A0AV96_Homo sapiens 0.9392033 0.9514846 0.7096284 0.9393484 ... A0AVF1_Homo sapiens 0.6591366 0.6534372 0.7121089 0.7741971 ... A0AVT1_Homo sapiens 1.2035820 1.1647425 1.0494833 1.1121796 ... A0FGR8_Homo sapiens 0.9664924 0.8391658 1.0946545 0.7832414 ... A0M8Q6_Homo sapiens 0.3516833 0.4695273 0.7225070 0.6042526 ... ``` ## Ex.3: Build a Merged Training Set and Retrain the Model. Retraining the AWA model using additional spike-in datasets can improve the number of quantified PSMs in the merged training set, and hence the robustness of the correlation. In addition, retraining using experiment-specific in-house spike-in datasets could also provide potential benefits for the machine learning model by better representing the employed hardware and acquisition modes. In this example, we create a training set by merging three benchmark spike-in datasets (`benchmark.set.1`, `benchmark.set.2`, and `benchmark.set.3`), all included in the `AWAggregator` package and derived from the `psm.tsv` output files generated by FragPipe. This combined training set is then used to train a random forest model. ### Step 1: Load Spike-in Datasets We load the spike-in datasets using `ExperimentHub` package. These datasets correspond to the sets described in the `AWAggregator` publication. You may substitute your own spike-in datasets if desired. ```{r load spike-in datasets} library(ExperimentHub) eh <- ExperimentHub() benchmarkSet1 <- eh[['EH9637']] # Benchmark Set 1 benchmarkSet2 <- eh[['EH9638']] # Benchmark Set 2 benchmarkSet3 <- eh[['EH9639']] # Benchmark Set 3 ``` ### Step 2: Calculate PSM Attributes and Average Scaled Error of log~2~FC Firstly, we calculate the attributes and the values of Average Scaled Error of log~2~FC in `benchmark.set.1`. ```{r calculate X and y for benchmark.set.1} library(stringr) # Load sample names (Sample 'H1+E1_1' ~ Sample 'H1+E6_3') samples <- colnames(benchmarkSet1)[ grep('H1[+]E[0-9]+_[1-4]', colnames(benchmarkSet1)) ] groups <- str_match(samples, 'H1[+]E[0-9]+')[, 1] PSM1 <- getPSMAttributes( PSM=benchmarkSet1, # TMT tag (229.1629) and carbamidomethylation (57.0214) are applied as # fixed PTMs fixedPTM=c('229.1629', '57.0214'), colOfReporterIonInt=samples, groups=groups ) PSM1 <- getAvgScaledErrorOfLog2FC( PSM=PSM1, colOfReporterIonInt=samples, groups=groups, # The actual protein fold change may be deviated from the intended values # after TMT labelling as the original work indicates when H1+Y6 is # involved, and therefore, H1+Y6 is not used in the calculation of Average # of Scaled Error of log2FC expectedRelativeAbundance=list(`H1+E1`=1, `H1+E2`=2, `H1+E6`=NA), speciesAtConstLevel='HUMAN' ) ``` Secondly, we calculate the attributes and the values of Average Scaled Error of log~2~FC in `benchmark.set.2`. `benchmark.set.2` consists of three separate mass spectrometry runs, indicated by the `Replicate` column. Each run is processed individually because of potential run-specific differences using `lapply` function, and merged together by `bind_rows` function. ```{r calculate X and y for benchmark.set.2} library(dplyr) # Load sample names (Sample 'H1+Y1_1' ~ Sample 'H1+Y10_3') samples <- colnames(benchmarkSet2)[ grep('H1[+]Y[0-9]+_[1-3]', colnames(benchmarkSet2)) ] groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1] # Process each replicate separately using lapply() # lapply() loops over all unique replicate IDs in benchmarkSet2. # 'X' is the current replicate ID. tmp <- lapply(unique(benchmarkSet2$Replicate), FUN=function(X){ # Select PSMs from the current replicate X df <- benchmarkSet2[benchmarkSet2$Replicate == X, ] df <- getPSMAttributes( PSM=df, fixedPTM=c('229.1629', '57.0214'), colOfReporterIonInt=samples, groups=groups, setProgressBar=FALSE ) df <- getAvgScaledErrorOfLog2FC( PSM=df, colOfReporterIonInt=samples, groups=groups, expectedRelativeAbundance=list(`H1+Y1`=1, `H1+Y4`=4, `H1+Y10`=10), speciesAtConstLevel='HUMAN' ) # Return the processed PSMs from the current replicate return(df) }) # Combine results from all replicates into one dataframe PSM2 <- bind_rows(tmp) ``` Thirdly, we calculate the attributes and the values of Average Scaled Error of log~2~FC in `benchmark.set.3`. ```{r calculate X and y for benchmark.set.3} # Load sample names (Sample 'H1+Y1_1' ~ Sample 'H1+Y10_2') samples <- colnames(benchmarkSet3)[ grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3)) ] groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1] PSM3 <- getPSMAttributes( PSM=benchmarkSet3, fixedPTM=c('304.2071', '125.0476'), colOfReporterIonInt=samples, groups=groups, # The signals for yeast PSMs in group H1+Y0 is completely from noise, so # they are not used for calculating Average CV groupsExcludedFromCV='H1+Y0' ) PSM3 <- getAvgScaledErrorOfLog2FC( PSM=PSM3, colOfReporterIonInt=samples, groups=groups, expectedRelativeAbundance=list( `H1+Y0`=0, `H1+Y1`=1, `H1+Y5`=5, `H1+Y10`=10 ), speciesAtConstLevel='HUMAN' ) ``` ### Step 3: Merge Spike-in Datasets as a New Training Set Next, we merge a new training set from these three datasets. The minimum number of PSMs to extract from each dataset is determined by the number of PSMs in the smallest set. Complete sets of PSMs mapped to the selected proteins are extracted, resulting in final PSM counts from each set that are equal to or slightly larger than the preset values. ```{r merge spike-in datasets} set.seed(1000) PSM <- mergeTrainingSets( PSMList=list( `Benchmark Set 1`=PSM1, `Benchmark Set 2`=PSM2, `Benchmark Set 3`=PSM3 ), numPSMs=min(nrow(PSM1), nrow(PSM2), nrow(PSM3)) ) ``` ### Step 4: Train a New Random Forest Model Train a new random forest model using Average CV as an attribute. ```{r train new model with average CV} regr <- fitQuantInaccuracyModel(PSM, useAvgCV=TRUE, seed=3979) ``` ```{r session info} sessionInfo() ```