---
title: "Analyzing OpenArray Gene Expression Data with OAtools"
author: "Aidan Shea"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_document:
    highlight: pygments
    toc: true
    toc_float: true
    fig_width: 5
    df_print: kable
vignette: >
  %\VignetteIndexEntry{Analyzing OpenArray Gene Expression Data with OAtools}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

## Introduction

Highly multiplex PCR platforms enable drastically greater throughput compared 
to traditional PCR methods, making them critical for pathogen detection and 
gene expression experiments in public health and clinical research environments.
One such platform is ThermoFisher Scientific's OpenArray Technology, which our 
lab uses to screen patient swabs for respiratory viral infections. With 3000+ 
qPCR reactions taking place on each OpenArray plate, the sheer volume of data 
produced requires an informatic solution for downstream analyses and reporting.

ThermoFisher provides the proprietary QuantStudio 12K Flex Software and the 
Relative Quantification Application for data analysis. However, no all-in-one
open-source software exists to derive reportable results from raw fluorescence 
values measured on OpenArray.

To bridge the gap, OAtools streamlines the analysis process including data 
import, plotting and quality control, analysis, and communication through an 
R Markdown document. Our goal in sharing OAtools is to support open-source and 
shareable analyses of OpenArray gene expression data, especially in clinical 
research environments.

## Installation

### GitHub Install

To install the development version of *OAtools* directly from the 
uwvirology-ngs public GitHub, (<https://github.com/uwvirology-ngs/OAtools>) 
run the commands below in your R console:

```{r installation-dev, eval=FALSE}
# Install devtools from CRAN
if (!require("devtools", quietly = TRUE)) {
    install.packages("devtools")
}

# Install the development version of OAtools from the UW Virology NGS GitHub
devtools::install_github(
    repo = "uwvirology-ngs/OAtools", 
    dependencies = TRUE, 
    build_vignettes = TRUE
)
```

### Bioconductor Install

Our lab plans to share this project as a Bioconductor package. Once available, 
this package may be installed from Bioconductor by running the following 
commands in the R console:

```{r installation-bioconductor, eval=FALSE}
# install Bioconductor from CRAN
if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

# install OAtools from Bioconductor
BiocManager::install("OAtools")
```

### Environment Setup

If following along with the vignette by running code blocks in the R console, 
attach the `OAtools` and `SummarizedExperiment` libraries to the environment.
The `SummarizedExperiment` package should already be installed as a dependency.

```{r libraries, message=FALSE}
# attach OAtools and SummarizedExperiment libraries
library(OAtools)
library(SummarizedExperiment)
```

If it is not installed, install dplyr with the following:
```{r install-dplyr, eval=FALSE}
# install dplyr from CRAN
if (!require("dplyr", quietly = TRUE)) {
    install.packages("dplyr")
}
```

## Workflow Overview

*OAtools* supports two core methods for analyzing OpenArray qPCR data:

1.) The first method is to simply carry over the relative threshold analysis 
native to the QuantStudio 12K Flex Software.

2.) The second method is to fit model curves to the fluorescence vs. cycle 
data for each well individually, then determine PCR results which depend on 
features of the curve.

## Import to SummarizedExperiment

To start, we specify the file path to some example OpenArray qPCR run data, 
which is included with the package. This data is a sample from a gene expression
experiment our lab, the University of Washington Virology Lab, ran on human 
nasal swabs to screen for respiratory pathogens. 

For detailed documentation on the example data used in this vignette, run 
`?oa_gene_expression_1` in the R console with OAtools attached.

With the file path specified, we use the `excelToSE()` 
function to read the Excel file containing the run data and load it into a 
SummarizedExperiment object. SummarizedExperiment is an S4 Class from 
Bioconductor used to store matrix-like assay data and associated metadata. 
This function takes optional parameters for the number of header rows, which 
store run metadata, and the number of rows to skip before reading assay data, 
though the defaults will work in this case.

```{r data_import}
# save filepath to example OpenArray gene expression run data
path = system.file(
    "extdata", 
    "oa_gene_expression_1.xlsx", 
    package = "OAtools"
)

# transform the run data into a SummarizedExperiment
se <- excelToSE(excel_path = path)
```

## Experiment Structure

### Assay Matrices

With the data loaded into a SummarizedExperiment container, we can review its 
structure. The assay matrices store observed fluorescence values measured
by the QuantStudio 12K Flex thermocycler by cycle number, where each column 
stores PCR reactions from through-holes on the OpenArray plate separately. 

The `excelToSE()` function generates two assay matrices by default, which are
`fluo_normalized` and `fluo_reporter`. The former refers to baseline-corrected,
normalized fluorescence imported from the 'Amplification Data' tab, while the 
latter refers to the spectral contribution of the reporter dye imported from
the 'Multicomponent Data' tab. We may access the assay matrix corresponding to
the multicomponent data, for example, by invoking `assays(se)$fluo_reporter` 
in the R console, then coercing into a data frame with `as.data.frame()` as 
shown below.

Here we display a slice of the multicomponent fluorescence data from cycles 
11 to 20, where the exponential phase of amplification tends to occur. The rows
of each assay matrix denote the cycle number and the columns denote the 
individual PCR reactions on the OpenArray plate. 

```{r structure-assay_matrix}
# retrieve the assay matrix and display a subset as a data frame
as.data.frame(assays(se)$fluo_reporter) |> 
    dplyr::select(well_2321:well_2386) |>
    dplyr::slice(11:20)
```

### Coldata

Well number, sample, gene, quality control metrics, and various metadata 
associated with each PCR reaction are stored as a DataFrame accessible with
`colData(se)`. The Crt (relative threshold value) and amplification status 
determined by QuantStudio 12K Flex Software are saved here as well. The 
relative threshold value is analogous to a Ct value reported in a typical qPCR
assay. 

Here we display a subset of the column annotation DataFrame associated with 
the first 10 wells. 

```{r structure-coldata}
# retrieve the coldata and render a subset as a data frame
as.data.frame(colData(se)) |> 
    dplyr::select(
        well, sample_name:target_name, reporter, crt, amp_score:amp_status
    ) |> 
    dplyr::slice(1:10)
```

### Rowdata

By contrast, the row annotation DataFrame stores only the cycle numbers, 
which is less interesting by itself. 

```{r structure-rowdata}
# retrieve the rowdata and render a subset as a data frame
as.data.frame(rowData(se)) |> 
    dplyr::slice(1:10)
```

### Metadata

Finally, the file path of the imported data and run-level metadata are stored 
in a list accessible with `metadata(se)`. Here we display the run information. 

```{r structure-metadata}
# render the experiment level metadata
metadata(se)$run_info
```

Model data generated in downstream curve-fitting steps is saved to the 
experiment metadata as well.

## Reporting Results

If we intend to use the relative threshold analysis native to the instrument 
software, we can create a run result report straight away. We can call the 
`generateReport()` function directly on the Summarized Experiment, which then 
dynamically generates a .html document summarizing the results. 

Note that the package *kableExtra* is required to generate the PCR report. 
If it is not installed, install with:

```{r install-kableExtra, eval=FALSE}
# install kableExtra from CRAN
if (!require("kableExtra", quietly = TRUE)) {
    install.packages("kableExtra")
}
```

The following code block generates the PCR report assuming that data has been
loaded in using `excelToSE()`. Note that this code chunk is not wired to 
execute when constructing the vignette as `generateReport()` calls code to 
save the report directly to the user's machine, though it defaults to a 
temporary file.

```{r reporting-results, eval=FALSE}
# generate an HTML report from the run data
generateReport(se = se)
```

At present, this report pulls from the analysis provided by the QuantStudio 12K
Flex software, though an option to build the report from data generated by the
curve-fitting method is planned for future development.

## Analysis by Logistic Regression

The second method of data analysis, as mentioned in the overview, is to fit 
model curves to each PCR reaction curve individually, then use computed 
features to differentiate positive and negative PCR results. In particular, 
the `computeModels()` package function calls an algorithm from SciPy and 
attempts to fit 5-parameter logistic regressions to the fluorescence curves. 
The 5PL model is implemented for its favorable ability to handle asymmetries 
common to amplification curves. 

This approach has the advantages of transparency and reproducibility, 
circumventing the need to rely on propriety algorithms to interpret PCR
results.

### Data Import

Here we start the data analysis again from the point where we have exported 
the gene expression experiment from QuantStudio 12K Flex software. As in the 
previous example, we load the experiment into a SummarizedExperiment object. 

```{r regression-data_import}
# clear the environment
rm(list = ls())

# save filepath to example OpenArray gene expression run data
path = system.file(
    "extdata", 
    "oa_gene_expression_1.xlsx", 
    package = "OAtools"
)

# transform the run data into a SummarizedExperiment
se <- excelToSE(excel_path = path)
```

### Optimizing Models to PCR Curves

Next, we run the optimizer and attempt to fit a 5PL model curve to each PCR 
reaction on the OpenArray plate. As mentioned above, OAtools exports the 
`computeModels()` function, which iterates over the experiment and does
this programmatically. `computeModels()` then saves the parameters of each 
model curve, along with features computed from them, as a list in the 
experiment metadata. If the `fluo_reporter` assay matrix is used, for example, 
the models would be accessible with `metadata(se)$fluo_reporter_models`. 

Under the hood, `computeModels()` invokes a thin R wrapper, `runFitCurve()`, 
around the internal `fit_curve()` function implemented in python3. 
`computeModels()` handles the logic for iterating over the OpenArray plate, 
while `fit_curve()` interfaces with SciPy and actually runs the optimization, 
returning a model curve when provided PCR data (observed fluorescence vs. 
cycle number) for a single reaction. 

A linear_threshold parameter is optionally provided when calling 
`computeModels()`. PCR curves with overall change in fluorescence below this
threshold are marked as negative and fit to linear models instead. The 
rationale is that such PCR curves are trivially negative for the amplicon of 
interest, but add a great deal of compute time to the optimization process and 
are better filtered out. 

```{r regression-fit_curve}
# optimize model curves to each PCR reaction
se <- computeModels(
    se = se,
    assay_name = "fluo_reporter"
)

# display an example model, fit to the PCR reaction in well 2345
metadata(se)$fluo_reporter_models$well_2345
```

### Deriving PCR Results from the Model

Finally, we can leverage the model data we've generated to derive PCR results. 
OAtools exports the `determinePCRResults()` function, which we can invoke on
our SummarizedExperiment to generate such results. `determinePCRResults()` 
requires as input an Excel key. This key associates each gene target with 
threshold values for model features used to separate positive and negative 
PCR results. In particular, values are specified for the cycle threshold, 
midpoint slope, and overall change-in-fluorescence. 

Different thresholds may be specified for distinct target genes to account for 
differences in primer efficiency across assays on the same OpenArray plate.

For documentation and details on the key, run `?OAtools::target_threshold_key` 
in the R console. 

`determinePCRResults()` computes PASS/FAIL values for the cycle threshold, 
midpoint slope, and change-in-fluorescence, marking PCR reactions which pass
on all accounts as positive and all others as negative. These results are saved
to the column annotation DataFrame.

```{r regression-derive_results}
# save filepath to target-threshold-key
key_path = system.file(
    "extdata", 
    "target_threshold_key.xlsx", 
    package = "OAtools"
)

# assign a PCR result according to the key
se <- determinePCRResults(
    se = se, 
    key_path = key_path
)

# render a snapshot including the logic for determining the result
as.data.frame(SummarizedExperiment::colData(se)) |> 
    dplyr::select(
        well, crt, midpoint_slope, delta_fluo, 
        crt_threshold, slope_threshold, delta_threshold, result
    ) |> 
    dplyr::slice_head(n = 10)
```

## Interoperability

### NormqPCR

*OAtools* supports interoperability with the NormqPCR package from Bioconductor,
which provides functions to normalize RT-qPCR data from gene expression 
platforms like OpenArray. *NormqPCR* reads from the qPCRBatch class object, 
which requires the *ReadqPCR* package. We can set up our environment as below:

```{r interop-NormqPCR_setup, message=FALSE, eval=FALSE}
# Install ReadqPCR from Bioconductor
if (!require("ReadqPCR", quietly = TRUE)) {
    BiocManager::install("ReadqPCR")
}

# Install NormqPCR from Bioconductor
if (!require("NormqPCR", quietly = TRUE)) {
    BiocManager::install("NormqPCR")
}
```

Then, to convert the SummarizedExperiment container into a *NormqPCR*-friendly 
instance of the qPCRBatch class, simply invoke the `seToQPCRBatch()` function
on the SummarizedExperiment container. With our qPCRBatch object, we can, for
example, normalize the cycle thresholds against a housekeeping gene.

```{r interop-NormqPCR}
# convert SummarizedExperiment container to qPCRBatch
qpcr <- seToQPCRBatch(se)

# choose housekeeping gene (in this case, the RNAse P control)
housekeeping_gene = "RNAse_P_Pa04930436_g1"

# run delta-Cq calculation
norm <- NormqPCR::deltaCq(
    qPCRBatch = qpcr,
    hkgs = housekeeping_gene
)

# display expression matrix
as.data.frame(Biobase::exprs(norm)) |> 
    dplyr::select(`Pos_Control_A`:`Sample-106`) |> 
    knitr::kable(digits = 2)
    
```

Please refer to the official documentation for *NormqPCR* for more information.
(<https://bioconductor.org/packages/release/bioc/html/NormqPCR.html>) 


## Session Info

```{r sessionInfo}
sessionInfo()
```