---
title: 'AWAggregator Vignette'
author:
- name: Jiahua Tan
  affiliation:
  - &1 "Department of Chemistry, University of British Columbia, Vancouver, BC, 
  Canada"
- name: Gian L. Negri
  affiliation:
  - &2 "Canada's Michael Smith Genome Sciences Centre, BC Cancer Research 
  Institute, University of British Columbia, Vancouver, BC, Canada"
- name: Gregg B. Morin
  affiliation:
  - *2
  - &3 "Department of Medical Genetics, University of British Columbia, 
  Vancouver, BC, Canada"
- name: David D. Y. Chen
  affiliation:
  - *1
date: '`r format(Sys.Date(), "%B %e, %Y")`'
package: AWAggregator
output: 
  BiocStyle::html_document:
    toc: true
vignette: >
    %\VignetteIndexEntry{AWAggregator vignette}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

# Introduction

The `AWAggregator` package implements an attribute-weighted aggregation 
algorithm which leverages peptide-spectrum match (PSM) attributes to provide a 
more accurate estimate of protein abundance compared to conventional 
aggregation methods. This algorithm employs pre-trained random forest models to 
predict the quantitative inaccuracy of PSMs based on their attributes. PSMs are 
then aggregated to the protein level using a weighted average, taking the 
predicted inaccuracy into account. Additionally, the package allows users to 
construct their own training sets that are more relevant to their specific 
experimental conditions if desired.

Since `ExperimentHub` can only retrieve data from the `AWAggregatorData` 
package with Bioconductor version 3.21 or later, please use the legacy version 
of the `AWAggregator` package if you are using an earlier Bioconductor version: 
https://github.com/Tan-Jiahua/AWAggregator-compat

## Overview of Package Functions

Functions available in the `AWAggregator` package:

-   `getDistMetric()`: Calculates the distance metric for PSMs. Distance metric 
reflects on whether the quantified ratio of each pair of samples of a PSM 
diverges from other PSMs in the same redundant/unique group. Redundant group, 
unique group and distance metric were originally defined in the iPQF method. 
Please refer to "iPQF: a new peptide-to-protein summarization method using 
peptide spectra characteristics to improve protein quantification" for more 
details.

-   `getPSMAttributes()`: Retrieves attributes required for training or test 
sets.

-   `getAvgScaledErrorOfLog2FC()`: Calculates the Average Scaled Error of 
log2FC values required for training sets.

-   `mergeTrainingSets()`: Extracts a similar number of PSMs from each input 
dataset and merges them into a single training set.

-   `fitQuantInaccuracyModel()`: Trains a random forest model to predict the 
level of quantitative inaccuracy of PSMs.

-   `aggregateByAttributes()`: Aggregates PSMs using a random forest model.

-   `convertPDFormat()`: Converts output from Proteome Discoverer into the 
input format required by `AWAggregator`.

Function available in the associated `AWAggregatorData` package:

-   `loadQuantInaccuracyModel()`: Loads a pre-trained random forest model for 
predicting the level of quantitative inaccuracy of PSMs.

## Overview of Package Data

Data available in the `AWAggregator` package:

-   `sample.PSM.FP`: represents sample PSMs mapped to the proteins A0AV96, 
A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the `psm.tsv` output file 
generated by FragPipe. Columns unnecessary for the `AWAggregator` have been 
removed from the sample data.

-   `sample.prot.PD`: represents sample proteins A0AV96, A0AVF1, A0AVT1, 
A0FGR8, and A0M8Q6, obtained from the TXT export of the proteins page in the 
Proteome Discoverer search results. Columns unnecessary for the `AWAggregator` 
have been removed from the sample data.

-   `sample.PSM.PD`: represents sample PSMs mapped to the proteins A0AV96, 
A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the TXT export of the PSMs 
page in the Proteome Discoverer search results. Columns unnecessary for the 
`AWAggregator` have been removed from the sample data.

Data available in the associated `AWAggregatorData` package:

-   `regr`: represent the pre-trained random forest model that incorporates the 
average coefficient of variation (CV) as a feature.

-   `regr.no.CV`: represent the pre-trained random forest model that does not 
include the average CV as a feature.

-   `benchmark.set.1`, `benchmark.set.2`, `benchmark.set.3`: represents PSMs in 
Benchmark Set 1 \~ 3 derived from the `psm.tsv` output files generated by 
FragPipe, which are used to train the random forest model. Columns unnecessary 
for the `AWAggregator` have been removed from the sample data.

# Installation

The `AWAggregator` package and the associated `AWAggregatorData` package can be 
installed from Bioconductor.

```{r install, eval=FALSE}
if (!requireNamespace('BiocManager', quietly=TRUE))
    install.packages('BiocManager')

BiocManager::install('AWAggregator')
BiocManager::install('AWAggregatorData')
```

# Workflow Examples

Load the `AWAggregator` package and the `AWAggregatorData` package.

```{r load package}
library(AWAggregator)
library(AWAggregatorData)
```

## Ex.1: Aggregate PSMs from FragPipe Using the Pre-Trained Model.

In this example, we aggregate the reporter ion intensities of PSMs to the 
protein level. We use the sample dataset `sample.PSM.FP`, included in the 
`AWAggregator` package and derived from the `psm.tsv` output file generated by 
FragPipe. This dataset includes reporter ion intensities from nine samples, 
labeled from `Sample 1` to `Sample 9`, without replicates. The PSMs are mapped 
to the following proteins: A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, with 
unnecessary columns removed for clarity.

This example demonstrates the basic functionality of the `AWAggregator` package 
using the default pre-trained model.

```{r aggregate PSMs from FragPipe}
# Load the pre-trained random forest model that does not include the average CV 
# as a feature, which indicates the average CV in percentage for processed PSM 
# reporter ion intensities across different replicate groups. It is recommended 
# to load the pre-trained model with average CV when replicates are available; 
# otherwise, use the model without the average CV
data(sample.PSM.FP)
regr <- loadQuantInaccuracyModel(useAvgCV=FALSE)
# Load sample names (Sample 1 ~ Sample 9)
samples <- colnames(sample.PSM.FP)[grep('Sample', colnames(sample.PSM.FP))]
groups <- samples
df <- getPSMAttributes(
    PSM=sample.PSM.FP,
    # TMT tag (229.1629) and carbamidomethylation (57.0214) are applied as 
    # fixed post-translational modifications (PTMs)
    fixedPTMs=c('229.1629', '57.0214'),
    colOfReporterIonInt=samples,
    groups=groups,
    setProgressBar=TRUE
)
aggregated_results <- aggregateByAttributes(
    PSM=df,
    colOfReporterIonInt=samples,
    ranger=regr,
    ratioCalc=FALSE
)
```

The output dataframe will provide estimates of protein abundance.

```         
Protein               Sample 1   Sample 2   Sample 3   Sample 4   ...
sp|A0AV96|RBM47_HUMAN 0.9292177  1.0111264  0.7933874  0.9606382  ...
sp|A0AVF1|IFT56_HUMAN 0.6646691  0.6600642  0.6696656  0.7984397  ...
sp|A0AVT1|UBA6_HUMAN  1.1883116  1.1752203  1.0482381  1.0910095  ...
sp|A0FGR8|ESYT2_HUMAN 0.9304190  0.8504465  1.0550898  0.7952998  ...
sp|A0M8Q6|IGLC7_HUMAN 0.4205675  0.6393757  0.7475482  0.6968704  ...
```

## Ex.2: Aggregate PSMs from Proteome Discoverer Using the Pre-Trained Model.

In this example, we convert the search result from Proteome Discoverer to the 
format required by `AWAggregator` and aggregate the reporter ion intensities of 
PSMs to the protein level. We use the sample dataset `sample.PSM.PD`, alongside 
its corresponding protein table `sample.prot.PD`, both included in the 
`AWAggregator` package. These files are derived from the TXT exports of the 
proteins and PSMs pages in the search results from Proteome Discoverer. This 
dataset includes reporter ion intensities from nine samples, labeled from 
`Sample 1` to `Sample 9`, without replicates. The PSM and protein tables 
contains following proteins: A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, with 
unnecessary columns removed for clarity.

```{r aggregate PSMs from Proteome Discoverer}
# Load the pre-trained random forest model that does not include the average CV 
# as a feature, which indicates the average CV in percentage for processed PSM 
# reporter ion intensities across different replicate groups. It is recommended 
# to load the pre-trained model with average CV when replicates are available; 
# otherwise, use the model without the average CV
data(sample.PSM.PD)
data(sample.prot.PD)
regr <- loadQuantInaccuracyModel(useAvgCV=FALSE)
# Load sample names (Sample 1 ~ Sample 9)
samples <- colnames(sample.PSM.PD)[grep('Sample', colnames(sample.PSM.PD))]
groups <- samples
df <- convertPDFormat(
    PSM=sample.PSM.PD,
    protein=sample.prot.PD,
    colOfReporterIonInt=samples
)
df <- getPSMAttributes(
    PSM=df,
    # TMT tag and carbamidomethylation are applied as static PTMs
    fixedPTMs=c('TMT6plex', 'Carbamidomethyl'),
    colOfReporterIonInt=samples,
    groups=groups,
    setProgressBar=TRUE
)
aggregated_results <- aggregateByAttributes(
    PSM=df,
    colOfReporterIonInt=samples,
    ranger=regr,
    ratioCalc=FALSE
)
```

The output dataframe will provide estimates of protein abundance.

```         
Protein             Sample 1   Sample 2   Sample 3   Sample 4   ...
A0AV96_Homo sapiens 0.9392033  0.9514846  0.7096284  0.9393484  ...
A0AVF1_Homo sapiens 0.6591366  0.6534372  0.7121089  0.7741971  ...
A0AVT1_Homo sapiens 1.2035820  1.1647425  1.0494833  1.1121796  ...
A0FGR8_Homo sapiens 0.9664924  0.8391658  1.0946545  0.7832414  ...
A0M8Q6_Homo sapiens 0.3516833  0.4695273  0.7225070  0.6042526  ...
```

## Ex.3: Build a Merged Training Set and Retrain the Model.

Retraining the AWA model using additional spike-in datasets can improve the 
number of quantified PSMs in the merged training set, and hence the robustness 
of the correlation. In addition, retraining using experiment-specific in-house 
spike-in datasets could also provide potential benefits for the machine 
learning model by better representing the employed hardware and acquisition 
modes.

In this example, we create a training set by merging three benchmark spike-in 
datasets (`benchmark.set.1`, `benchmark.set.2`, and `benchmark.set.3`), all 
included in the `AWAggregator` package and derived from the `psm.tsv` output 
files generated by FragPipe. This combined training set is then used to train a 
random forest model.

### Step 1: Load Spike-in Datasets

We load the spike-in datasets using `ExperimentHub` package. These datasets 
correspond to the sets described in the `AWAggregator` publication. You may 
substitute your own spike-in datasets if desired.

```{r load spike-in datasets}
library(ExperimentHub)
eh <- ExperimentHub()
benchmarkSet1 <- eh[['EH9637']] # Benchmark Set 1
benchmarkSet2 <- eh[['EH9638']] # Benchmark Set 2
benchmarkSet3 <- eh[['EH9639']] # Benchmark Set 3
```

### Step 2: Calculate PSM Attributes and Average Scaled Error of log~2~FC

Firstly, we calculate the attributes and the values of Average Scaled Error of 
log~2~FC in `benchmark.set.1`.

```{r calculate X and y for benchmark.set.1}
library(stringr)

# Load sample names (Sample 'H1+E1_1' ~ Sample 'H1+E6_3')
samples <- colnames(benchmarkSet1)[
    grep('H1[+]E[0-9]+_[1-4]', colnames(benchmarkSet1))
]
groups <- str_match(samples, 'H1[+]E[0-9]+')[, 1]
PSM1 <- getPSMAttributes(
    PSM=benchmarkSet1,
    # TMT tag (229.1629) and carbamidomethylation (57.0214) are applied as 
    # fixed PTMs
    fixedPTM=c('229.1629', '57.0214'),
    colOfReporterIonInt=samples,
    groups=groups
)
PSM1 <- getAvgScaledErrorOfLog2FC(
    PSM=PSM1,
    colOfReporterIonInt=samples,
    groups=groups,
    # The actual protein fold change may be deviated from the intended values 
    # after TMT labelling as the original work indicates when H1+Y6 is 
    # involved, and therefore, H1+Y6 is not used in the calculation of Average 
    # of Scaled Error of log2FC
    expectedRelativeAbundance=list(`H1+E1`=1, `H1+E2`=2, `H1+E6`=NA),
    speciesAtConstLevel='HUMAN'
)
```

Secondly, we calculate the attributes and the values of Average Scaled Error of 
log~2~FC in `benchmark.set.2`. `benchmark.set.2` consists of three separate 
mass spectrometry runs, indicated by the `Replicate` column. Each run is 
processed individually because of potential run-specific differences using 
`lapply` function, and merged together by `bind_rows` function.

```{r calculate X and y for benchmark.set.2}
library(dplyr)

# Load sample names (Sample 'H1+Y1_1' ~ Sample 'H1+Y10_3')
samples <- colnames(benchmarkSet2)[
    grep('H1[+]Y[0-9]+_[1-3]', colnames(benchmarkSet2))
]
groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1]

# Process each replicate separately using lapply()
# lapply() loops over all unique replicate IDs in benchmarkSet2.
# 'X' is the current replicate ID.
tmp <- lapply(unique(benchmarkSet2$Replicate), FUN=function(X){
    # Select PSMs from the current replicate X
    df <- benchmarkSet2[benchmarkSet2$Replicate == X, ]
    df <- getPSMAttributes(
        PSM=df,
        fixedPTM=c('229.1629', '57.0214'),
        colOfReporterIonInt=samples,
        groups=groups,
        setProgressBar=FALSE
    )
    df <- getAvgScaledErrorOfLog2FC(
        PSM=df,
        colOfReporterIonInt=samples,
        groups=groups,
        expectedRelativeAbundance=list(`H1+Y1`=1, `H1+Y4`=4, `H1+Y10`=10),
        speciesAtConstLevel='HUMAN'
    )
    # Return the processed PSMs from the current replicate
    return(df)
})
# Combine results from all replicates into one dataframe
PSM2 <- bind_rows(tmp)
```

Thirdly, we calculate the attributes and the values of Average Scaled Error of 
log~2~FC in `benchmark.set.3`.

```{r calculate X and y for benchmark.set.3}
# Load sample names (Sample 'H1+Y1_1' ~ Sample 'H1+Y10_2')
samples <- colnames(benchmarkSet3)[
    grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3))
]
groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1]
PSM3 <- getPSMAttributes(
    PSM=benchmarkSet3,
    fixedPTM=c('304.2071', '125.0476'),
    colOfReporterIonInt=samples,
    groups=groups,
    # The signals for yeast PSMs in group H1+Y0 is completely from noise, so 
    # they are not used for calculating Average CV
    groupsExcludedFromCV='H1+Y0'
)
PSM3 <- getAvgScaledErrorOfLog2FC(
    PSM=PSM3,
    colOfReporterIonInt=samples,
    groups=groups,
    expectedRelativeAbundance=list(
        `H1+Y0`=0, `H1+Y1`=1, `H1+Y5`=5, `H1+Y10`=10
    ),
    speciesAtConstLevel='HUMAN'
)
```

### Step 3: Merge Spike-in Datasets as a New Training Set

Next, we merge a new training set from these three datasets. The minimum number 
of PSMs to extract from each dataset is determined by the number of PSMs in the 
smallest set. Complete sets of PSMs mapped to the selected proteins are 
extracted, resulting in final PSM counts from each set that are equal to or 
slightly larger than the preset values.

```{r merge spike-in datasets}
set.seed(1000)
PSM <- mergeTrainingSets(
    PSMList=list(
        `Benchmark Set 1`=PSM1,
        `Benchmark Set 2`=PSM2,
        `Benchmark Set 3`=PSM3
    ),
    numPSMs=min(nrow(PSM1), nrow(PSM2), nrow(PSM3))
)
```

### Step 4: Train a New Random Forest Model

Train a new random forest model using Average CV as an attribute.

```{r train new model with average CV}
regr <- fitQuantInaccuracyModel(PSM, useAvgCV=TRUE, seed=3979)
```

```{r session info}
sessionInfo()
```