--- title: "Analyzing OpenArray Gene Expression Data with OAtools" author: "Aidan Shea" date: "`r Sys.Date()`" output: rmarkdown::html_document: highlight: pygments toc: true toc_float: true fig_width: 5 df_print: kable vignette: > %\VignetteIndexEntry{Analyzing OpenArray Gene Expression Data with OAtools} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction Highly multiplex PCR platforms enable drastically greater throughput compared to traditional PCR methods, making them critical for pathogen detection and gene expression experiments in public health and clinical research environments. One such platform is ThermoFisher Scientific's OpenArray Technology, which our lab uses to screen patient swabs for respiratory viral infections. With 3000+ qPCR reactions taking place on each OpenArray plate, the sheer volume of data produced requires an informatic solution for downstream analyses and reporting. ThermoFisher provides the proprietary QuantStudio 12K Flex Software and the Relative Quantification Application for data analysis. However, no all-in-one open-source software exists to derive reportable results from raw fluorescence values measured on OpenArray. To bridge the gap, OAtools streamlines the analysis process including data import, plotting and quality control, analysis, and communication through an R Markdown document. Our goal in sharing OAtools is to support open-source and shareable analyses of OpenArray gene expression data, especially in clinical research environments. ## Installation ### GitHub Install To install the development version of *OAtools* directly from the uwvirology-ngs public GitHub, () run the commands below in your R console: ```{r installation-dev, eval=FALSE} # Install devtools from CRAN if (!require("devtools", quietly = TRUE)) { install.packages("devtools") } # Install the development version of OAtools from the UW Virology NGS GitHub devtools::install_github( repo = "uwvirology-ngs/OAtools", dependencies = TRUE, build_vignettes = TRUE ) ``` ### Bioconductor Install Our lab plans to share this project as a Bioconductor package. Once available, this package may be installed from Bioconductor by running the following commands in the R console: ```{r installation-bioconductor, eval=FALSE} # install Bioconductor from CRAN if (!require("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } # install OAtools from Bioconductor BiocManager::install("OAtools") ``` ### Environment Setup If following along with the vignette by running code blocks in the R console, attach the `OAtools` and `SummarizedExperiment` libraries to the environment. The `SummarizedExperiment` package should already be installed as a dependency. ```{r libraries, message=FALSE} # attach OAtools and SummarizedExperiment libraries library(OAtools) library(SummarizedExperiment) ``` If it is not installed, install dplyr with the following: ```{r install-dplyr, eval=FALSE} # install dplyr from CRAN if (!require("dplyr", quietly = TRUE)) { install.packages("dplyr") } ``` ## Workflow Overview *OAtools* supports two core methods for analyzing OpenArray qPCR data: 1.) The first method is to simply carry over the relative threshold analysis native to the QuantStudio 12K Flex Software. 2.) The second method is to fit model curves to the fluorescence vs. cycle data for each well individually, then determine PCR results which depend on features of the curve. ## Import to SummarizedExperiment To start, we specify the file path to some example OpenArray qPCR run data, which is included with the package. This data is a sample from a gene expression experiment our lab, the University of Washington Virology Lab, ran on human nasal swabs to screen for respiratory pathogens. For detailed documentation on the example data used in this vignette, run `?oa_gene_expression_1` in the R console with OAtools attached. With the file path specified, we use the `excelToSE()` function to read the Excel file containing the run data and load it into a SummarizedExperiment object. SummarizedExperiment is an S4 Class from Bioconductor used to store matrix-like assay data and associated metadata. This function takes optional parameters for the number of header rows, which store run metadata, and the number of rows to skip before reading assay data, though the defaults will work in this case. ```{r data_import} # save filepath to example OpenArray gene expression run data path = system.file( "extdata", "oa_gene_expression_1.xlsx", package = "OAtools" ) # transform the run data into a SummarizedExperiment se <- excelToSE(excel_path = path) ``` ## Experiment Structure ### Assay Matrices With the data loaded into a SummarizedExperiment container, we can review its structure. The assay matrices store observed fluorescence values measured by the QuantStudio 12K Flex thermocycler by cycle number, where each column stores PCR reactions from through-holes on the OpenArray plate separately. The `excelToSE()` function generates two assay matrices by default, which are `fluo_normalized` and `fluo_reporter`. The former refers to baseline-corrected, normalized fluorescence imported from the 'Amplification Data' tab, while the latter refers to the spectral contribution of the reporter dye imported from the 'Multicomponent Data' tab. We may access the assay matrix corresponding to the multicomponent data, for example, by invoking `assays(se)$fluo_reporter` in the R console, then coercing into a data frame with `as.data.frame()` as shown below. Here we display a slice of the multicomponent fluorescence data from cycles 11 to 20, where the exponential phase of amplification tends to occur. The rows of each assay matrix denote the cycle number and the columns denote the individual PCR reactions on the OpenArray plate. ```{r structure-assay_matrix} # retrieve the assay matrix and display a subset as a data frame as.data.frame(assays(se)$fluo_reporter) |> dplyr::select(well_2321:well_2386) |> dplyr::slice(11:20) ``` ### Coldata Well number, sample, gene, quality control metrics, and various metadata associated with each PCR reaction are stored as a DataFrame accessible with `colData(se)`. The Crt (relative threshold value) and amplification status determined by QuantStudio 12K Flex Software are saved here as well. The relative threshold value is analogous to a Ct value reported in a typical qPCR assay. Here we display a subset of the column annotation DataFrame associated with the first 10 wells. ```{r structure-coldata} # retrieve the coldata and render a subset as a data frame as.data.frame(colData(se)) |> dplyr::select( well, sample_name:target_name, reporter, crt, amp_score:amp_status ) |> dplyr::slice(1:10) ``` ### Rowdata By contrast, the row annotation DataFrame stores only the cycle numbers, which is less interesting by itself. ```{r structure-rowdata} # retrieve the rowdata and render a subset as a data frame as.data.frame(rowData(se)) |> dplyr::slice(1:10) ``` ### Metadata Finally, the file path of the imported data and run-level metadata are stored in a list accessible with `metadata(se)`. Here we display the run information. ```{r structure-metadata} # render the experiment level metadata metadata(se)$run_info ``` Model data generated in downstream curve-fitting steps is saved to the experiment metadata as well. ## Reporting Results If we intend to use the relative threshold analysis native to the instrument software, we can create a run result report straight away. We can call the `generateReport()` function directly on the Summarized Experiment, which then dynamically generates a .html document summarizing the results. Note that the package *kableExtra* is required to generate the PCR report. If it is not installed, install with: ```{r install-kableExtra, eval=FALSE} # install kableExtra from CRAN if (!require("kableExtra", quietly = TRUE)) { install.packages("kableExtra") } ``` The following code block generates the PCR report assuming that data has been loaded in using `excelToSE()`. Note that this code chunk is not wired to execute when constructing the vignette as `generateReport()` calls code to save the report directly to the user's machine, though it defaults to a temporary file. ```{r reporting-results, eval=FALSE} # generate an HTML report from the run data generateReport(se = se) ``` At present, this report pulls from the analysis provided by the QuantStudio 12K Flex software, though an option to build the report from data generated by the curve-fitting method is planned for future development. ## Analysis by Logistic Regression The second method of data analysis, as mentioned in the overview, is to fit model curves to each PCR reaction curve individually, then use computed features to differentiate positive and negative PCR results. In particular, the `computeModels()` package function calls an algorithm from SciPy and attempts to fit 5-parameter logistic regressions to the fluorescence curves. The 5PL model is implemented for its favorable ability to handle asymmetries common to amplification curves. This approach has the advantages of transparency and reproducibility, circumventing the need to rely on propriety algorithms to interpret PCR results. ### Data Import Here we start the data analysis again from the point where we have exported the gene expression experiment from QuantStudio 12K Flex software. As in the previous example, we load the experiment into a SummarizedExperiment object. ```{r regression-data_import} # clear the environment rm(list = ls()) # save filepath to example OpenArray gene expression run data path = system.file( "extdata", "oa_gene_expression_1.xlsx", package = "OAtools" ) # transform the run data into a SummarizedExperiment se <- excelToSE(excel_path = path) ``` ### Optimizing Models to PCR Curves Next, we run the optimizer and attempt to fit a 5PL model curve to each PCR reaction on the OpenArray plate. As mentioned above, OAtools exports the `computeModels()` function, which iterates over the experiment and does this programmatically. `computeModels()` then saves the parameters of each model curve, along with features computed from them, as a list in the experiment metadata. If the `fluo_reporter` assay matrix is used, for example, the models would be accessible with `metadata(se)$fluo_reporter_models`. Under the hood, `computeModels()` invokes a thin R wrapper, `runFitCurve()`, around the internal `fit_curve()` function implemented in python3. `computeModels()` handles the logic for iterating over the OpenArray plate, while `fit_curve()` interfaces with SciPy and actually runs the optimization, returning a model curve when provided PCR data (observed fluorescence vs. cycle number) for a single reaction. A linear_threshold parameter is optionally provided when calling `computeModels()`. PCR curves with overall change in fluorescence below this threshold are marked as negative and fit to linear models instead. The rationale is that such PCR curves are trivially negative for the amplicon of interest, but add a great deal of compute time to the optimization process and are better filtered out. ```{r regression-fit_curve} # optimize model curves to each PCR reaction se <- computeModels( se = se, assay_name = "fluo_reporter" ) # display an example model, fit to the PCR reaction in well 2345 metadata(se)$fluo_reporter_models$well_2345 ``` ### Deriving PCR Results from the Model Finally, we can leverage the model data we've generated to derive PCR results. OAtools exports the `determinePCRResults()` function, which we can invoke on our SummarizedExperiment to generate such results. `determinePCRResults()` requires as input an Excel key. This key associates each gene target with threshold values for model features used to separate positive and negative PCR results. In particular, values are specified for the cycle threshold, midpoint slope, and overall change-in-fluorescence. Different thresholds may be specified for distinct target genes to account for differences in primer efficiency across assays on the same OpenArray plate. For documentation and details on the key, run `?OAtools::target_threshold_key` in the R console. `determinePCRResults()` computes PASS/FAIL values for the cycle threshold, midpoint slope, and change-in-fluorescence, marking PCR reactions which pass on all accounts as positive and all others as negative. These results are saved to the column annotation DataFrame. ```{r regression-derive_results} # save filepath to target-threshold-key key_path = system.file( "extdata", "target_threshold_key.xlsx", package = "OAtools" ) # assign a PCR result according to the key se <- determinePCRResults( se = se, key_path = key_path ) # render a snapshot including the logic for determining the result as.data.frame(SummarizedExperiment::colData(se)) |> dplyr::select( well, crt, midpoint_slope, delta_fluo, crt_threshold, slope_threshold, delta_threshold, result ) |> dplyr::slice_head(n = 10) ``` ## Interoperability ### NormqPCR *OAtools* supports interoperability with the NormqPCR package from Bioconductor, which provides functions to normalize RT-qPCR data from gene expression platforms like OpenArray. *NormqPCR* reads from the qPCRBatch class object, which requires the *ReadqPCR* package. We can set up our environment as below: ```{r interop-NormqPCR_setup, message=FALSE, eval=FALSE} # Install ReadqPCR from Bioconductor if (!require("ReadqPCR", quietly = TRUE)) { BiocManager::install("ReadqPCR") } # Install NormqPCR from Bioconductor if (!require("NormqPCR", quietly = TRUE)) { BiocManager::install("NormqPCR") } ``` Then, to convert the SummarizedExperiment container into a *NormqPCR*-friendly instance of the qPCRBatch class, simply invoke the `seToQPCRBatch()` function on the SummarizedExperiment container. With our qPCRBatch object, we can, for example, normalize the cycle thresholds against a housekeeping gene. ```{r interop-NormqPCR} # convert SummarizedExperiment container to qPCRBatch qpcr <- seToQPCRBatch(se) # choose housekeeping gene (in this case, the RNAse P control) housekeeping_gene = "RNAse_P_Pa04930436_g1" # run delta-Cq calculation norm <- NormqPCR::deltaCq( qPCRBatch = qpcr, hkgs = housekeeping_gene ) # display expression matrix as.data.frame(Biobase::exprs(norm)) |> dplyr::select(`Pos_Control_A`:`Sample-106`) |> knitr::kable(digits = 2) ``` Please refer to the official documentation for *NormqPCR* for more information. () ## Session Info ```{r sessionInfo} sessionInfo() ```