---
title: "Visualization and exploration"
author: 
    - name: Tram Nguyen
      affiliation: Department of Biomedical Informatics, Harvard Medical School
      email: Tram_Nguyen@hms.harvard.edu
    - name: Pascal Notin
      affiliation: Department of Systems Biology, Harvard Medical School
    - name: Aaron W Kollasch
      affiliation: Department of Systems Biology, Harvard Medical School
    - name: Debora Marks
      affiliation: Department of Systems Biology, Harvard Medical School
    - name: Ludwig Geistlinger
      affiliation: Department of Biomedical Informatics, Harvard Medical School
package: ProteinGymR
output:
    BiocStyle::html_document:
      self_contained: yes 
      toc: true
      toc_float: true
      toc_depth: 2
      code_folding: show
date: "June 18, 2025"
bibliography: references.bib
vignette: >
    %\VignetteIndexEntry{Visualization and exploration"}
    %\VignetteEncoding{UTF-8}
    %\VignetteEngine{knitr::rmarkdown}
editor_options: 
    markdown: 
      wrap: 80
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE,
                      message = FALSE)
```
[CH]: https://bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html
[r3d]: https://cran.r-project.org/web/packages/r3dmol/vignettes/r3dmol.html
[pg_publication]: https://proceedings.neurips.cc/paper_files/paper/2023/hash/cac723e5ff29f65e3fcbb0739ae91bee-Abstract-Datasets_and_Benchmarks.html
[physiochem]: https://biology.stackexchange.com/questions/105321/arrangement-of-amino-acids-in-the-protein-alphabet
# Setup
```{r, message=FALSE, warning=FALSE}
library(ProteinGymR)
library(ComplexHeatmap)
library(stringr)
library(dplyr)
library(ggplot2)
library(ggExtra)
```
# Introduction 
This vignette demonstrates how to explore and visualize experimental deep 
mutational scanning (DMS) fitness scores and variant effect prediction model 
scores from the ProteinGym database [@Notin2023]. Specifically, it walks through
the functionality to generate heatmaps of DMS scores for all possible amino acid
substitutions and projects these scores onto 3D protein structures. 
`ProteinGymR` uses functionality from the Bioconductor package 
[ComplexHeatmap][CH] and [r3dmol][r3d] from CRAN under the hood to generate the
heatmaps and 3D protein structures, respectively. Finally, this vignette 
demonstrates how to contrast variant effects predictions with experimental 
measurements. 
# Visualize DMS scores along a protein sequence
Here, we explore the "ACE2_HUMAN" DMS assay from @Chan2020 and create a heatmap
of the DMS scores with `plot_dms_heatmap()`. If the argument `dms_data` is not 
specified, the default will load the most recent DMS substitution data from 
ProteinGym with `dms_substitutions()`. This function only requires
a specific assay name. To obtain all assay names, run: 
`names(dms_substitutions())`. By default, the function also plots the full range
of positions where DMS data is available for this assay. To plot a specific 
region of interest, use the arguments `start_pos` and `end_pos` which takes
in an integer for the first and last residue position to plot in the protein.
```{r ACE default heatmap, fig.wide = TRUE}
ace2_dms <- plot_dms_heatmap(
    assay_name = "ACE2_HUMAN_Chan_2020",
    start_pos = 1, 
    end_pos = 100)
ace2_dms
```
The heatmap shows the DMS score at each position along the given protein 
sequence (x-axis) where a residue was mutated (y-axis: substituted amino acid, 
top; x-axis: reference amino acid). For demonstration, we subset to the first 
1-100 positions and grouped the amino acids by their physiochemical properties 
(DE,KRH,NQ,ST,PGAVIL,MC,FYW). See [here][physiochem] for more information. Note 
that not all positions along the protein sequence may be subjected to mutation 
for every DMS assay. This results from the specific research objectives, 
prioritization choices of the investigators, or technical constraints inherent 
to the experimental design. 
A low DMS score indicates low fitness, while a higher DMS score indicates high 
fitness. We can think of higher DMS scores as being more benign, while lower DMS
score indicates more pathogenic regions.
Based on the "ACE2_HUMAN_Chan_2020" assay, virtually all possible amino acid 
changes at positions 90 and 92 lead to higher fitness; possibly 
suggestive of a benign region of the protein. However, several mutations at 
position 48 resulted in low fitness. This could represent an important region 
for protein function where any perturbation would likely be deleterious.
Let's plot another assay, specifying a region and invoking the `ComplexHeatmap`
row clustering under the hood. For more details about this clustering method or
to view more function parameters, refer to the documentation of the 
`plot_dms_heatmap` function.
```{r SHOC2 heatmap, fig.wide = TRUE}
shoc2_dms <- plot_dms_heatmap(assay_name = "SHOC2_HUMAN_Kwon_2022", 
    start_pos = 10,
    end_pos = 60,
    cluster_rows = TRUE)
shoc2_dms
```
For example, in this region of the SHOC2_HUMAN protein, mutating to a Lysine (K, y-axis) resulted more frequently in higher fitness.
# Visualize model scores along a protein
ProteinGymR provides functionality to generate heatmaps of zero-shot
mode scores for 79 variant effect prediction models and 11 semi-supervised 
models with the function `plot_zeroshot_heatmap()`. The required arguments 
for this function are the assay name to plot (same as for the DMS heatmap), 
and a model to plot. For a complete list of models, run `available_models()` for
zero-shot models, and `supervised_available_models()` for the 11 semi-supervised
models. If `model_data` is not provided, the default model scores from 
ProteinGym will be loaded in from default model scores from 
`zeroshot_substitutions()`.
```{r, fig.width = 9, fig.height = 7}
ace2_model <- plot_zeroshot_heatmap(
    assay_name = "ACE2_HUMAN_Chan_2020", 
    model = "GEMME",
    start_pos = 1,
    end_pos = 100)
ComplexHeatmap::draw(ace2_dms %v% ace2_model)
```
As for DMS scores, we are plotting the GEMME zero-shot scores for 
positions 1 to 100 in the assay "ACE2_HUMAN_Chan_2020". At first glance, both
the DMS data and GEMME model reveal position 48 to be quite pathogenic across 
amino acid substitutions. Note that the model scores here are mostly negative; 
however because these are model prediction scores, negative values do not 
necessarily indicate lower fitness after mutation as with DMS scores.
Thus, model scores are always represented with another color palette to 
distinguish from experimental scores. Note further that model scores are not
rescaled or normalized across the 79 models, and comparison of the predicted 
scores between models is thus not straightforward. See [@Notin2023] for more 
information on model scores and how to interpret them.
It can be useful to visualize the DMS and model scores side by side for a given 
assay to compare the experimental DMS scores and predicted zero-shot scores 
outputted from the model. This is easily done with `%v%` which stacks the 
heatmaps in one column, while `+` will display them in two columns, side by 
side. This functionality is available for all Heatmap objects generated with 
the ComplexHeatmap package.
# 3D protein structure
This section demonstrates how to explore and visualize DMS or model scores on 
a 3D protein structure using the package r3dmol under the hood. The function 
requires DMS or model assay to aggregate scores that will be projected onto the
3D structure.
By default, if no `data_scores` argument is provided, the DMS substitutions from 
`dms_substitutions()` are loaded in, or if viewing model scores, 
set this argument to any model available in ProteinGym v1.2. Get a list of 
zero-shot and semi-supervised models with `available_models()` and 
`supervised_available_model()`.
If a model is chosen, a helper function is invoked which normalizes the 
model prediction scores using a rank-based normal quantile transformation. 
The result is a set of normalized scores that preserve the rank order of the 
models scores, while standardizing the distribution. Transformed values 
typically fall between -3 and 3. This normalization ensures the scores are 
approximately standard normally distributed (mean = 0, SD = 1), allowing 
comparisons across models.
The user may also specify what aggregation method to use for 
calculating the summary statistic at each residue position. By default, 
the mean DMS score/model prediction score is calculated for each position. 
See the function documentation for details: `?plot_structure()`
First, let's use all the default settings. The only required arguments are
the `assay_name`.
Importantly, because the plot shows one protein structure, all DMS fitness 
scores across amino acids are aggregated within a position. By default this 
aggregation function is just the average of all the DMS scores at that position.
However, it is possible to set any user-defined aggregation function with the 
`aggregation_func` argument. 
For DMS assays, a score of zero will always be represented as white, 
corresponding to the biological interpretation of neutral fitness effect.
```{r, fig.wide = TRUE}
plot_structure(assay_name = "ACE2_HUMAN_Chan_2020")
```
In this example, we are plotting the 3D structure of the ACE2_HUMAN protein and
overlaying the mean DMS score across all mutants in a given position. @Chan2020 
who generated the DMS assay data only experimentally assessed a subset
of the entire ACE2_HUMAN protein. By default, the function only colors the 
regions where there is information available in the assay. Red colors represent
more pathogenic (lower DMS scores) and blue colors show more benign positions
(higher DMS scores). Regions that appear white indicate closer to no change 
before and after the DMS perturbation. Grey regions represent the range of the 
protein assessed in the assay; however, only the colored regions include DMS 
data. Finally, by default, regions of the protein itself outside the range of 
the experimental assay have the "ball and stick" representation.
We can also overlap model scores from the any of our zero-shot or 
semi-supervised models. Do this by setting `data_scores` argument to any model 
string matching `available_models()` or `supervised_available_models()`. 
Here, let's demonstrate plotting the "Kermut" model and also allowing the 
full visualization of the complete protein structure, rather than just the
"ball and stick" representation. This can be done by setting the argument `full_structure` to `TRUE`.
```{r, fig.wide = TRUE}
plot_structure(assay_name = "ACE2_HUMAN_Chan_2020", 
    data_scores = "Kermut",
    full_structure = TRUE)
```
Now we can more clearly see the entire protein structure for ACE2_HUMAN in the 
ribbon representation, and we have overlaid the model prediction scores from the
Kermut model.
Some assays extensively assessed nearly every position of the complete protein, 
for example: the C6KNH7_9INFA protein from [@Lee2018]. Let's visualize 
this protein and set the aggregation method to view the minimum DMS score across
all mutants at each position by setting `aggregation_fun = min`. To view a 
specific region in detail: use `start_pos` and `end_pos`.
```{r, fig.wide = TRUE}
plot_structure(assay_name = "C6KNH7_9INFA_Lee_2018", 
    aggregate_fun = min)
```
As we might expect, the minimum DMS value (more pathogenic) is almost always a 
negative number across all positions of this protein. Therefore, there seems to 
be at least one amino acid mutation that could severely disrupt the fitness at
any position of this protein.
Finally, it is possible to use the same color scheme as the [popEVE 
mutation portal](https://pop.evemodel.org/). We can do this for any of 
the heatmaps or protein structure plots. This can be done by setting the argument `color_scheme` to "EVE".
# Correlate DMS scores with model scores
The `dms_corr_plot()` function allows the user to evaluate the correlation 
between experimental and model prediction scores. By default, it takes in a
protein UniProt ID and runs a Spearman correlation between the ProteinGym DMS 
assay scores and AlphaMissense predicted pathogenicity scores. However, as with `plot_structure()`, other models benchmarked in ProteinGym can also be specified via the `model` argument.
```{r, warning = FALSE, fig.wide = TRUE}
dms_corr_plot(uniprotId = "Q9NV35")
```
By default, the `dms_corr_plot()` function gathers any of the 217 DMS assays of 
the chosen UniProt ID and correlates the average DMS score across relevant
assays and the AlphaMissense model predictions.
Although the default uses the AlphaMissense scores, it is simple to correlate 
DMS experimental scores with predictions from any of the 79 zero-shot 
or 11 supervised models. Below is an example of the workflow to accomplish this for the UniProt ID "Q9NV35" as before.
# Correlate prediction scores between two models. 
Similar to the above, we can also explore the correlation between two different 
models for a given protein instead of looking at the DMS experimental data. 
We can do this for the protein "P04637" and the `model_corr_plot()` function.
By default, the function only requires a UniProt ID, and uses "AlphaMissense"
and "EVE_single" models as defaults. Let's change that to 
"Kermut" and "ProteinNPT" for our demonstration.
```{r, fig.wide = TRUE}
model_corr_plot(
     uniprotId = "P04637",
     model1 = "Kermut",
     model2 = "ProteinNPT"
)
```
There seems to be good correlation between
the model predictions for all variants in assays assessing the "P04637" protein.
# Session Info
```{r}
sessionInfo()
```
# References