Post-transcriptional regulation is a major determinant of gene expression, occurring through processes such as altered mRNA translation efficiency and/or stability (Sonenberg and Hinnebusch 2009; Medina-Muñoz et al. 2021; Tushev et al. 2018). mRNAs encode intrinsic regulatory features (cis-acting elements) that influence their post-transcriptional fate, often in a context-dependent manner through interactions with trans-acting factors that integrate cellular and environmental signalling cues. For example, features in 5’ untranslated regions (UTRs) such as 5′ terminal oligopyrimidine (TOP) motifs, and upstream open reading frames (uORFs) can determine the translation efficiency of transcripts under conditions of cellular stress where mTOR signalling is suppressed and the integrated stress response (ISR) is activated (Philippe et al. 2020; Young et al. 2015). Furthermore, features of coding sequences (CDSs) and 3’UTRs also influence the post-transcriptional fate of mRNA molecules, for example via codon composition determining demand for tRNAs (Lorent et al. 2019), and 3’UTR-encoded sequences dictating mRNA localization and stability (Tushev et al. 2018). One of the main challenges in understanding sequence determinants of selective post-transcriptional regulation is that individual mRNAs often contain multiple regulatory features whose effects may depend on cellular context, making it challenging to pinpoint which features drive observed regulatory outcomes, and if features act independently or in a combinatorial manner. Moreover, many RNA features covary across transcripts, complicating statistical analyses and interpretation.
To address this challenge, we developed postNet, an R/Bioconductor package for transcriptome-wide identification, quantification, and modelling of RNA features that associate with post-transcriptional regulation. PostNet provides a suite of flexible tools to perform sequence feature analysis and statistical comparisons across gene sets of interest, including nucleotide content, sequence length, folding energy, motifs, uORFs, and codon composition. PostNet also performs hierarchical statistical modelling to decipher regulatory networks, revealing independent and combinatorial effects of mRNA features on post-transcriptional regulation of gene expression. The package also supports machine learning approaches, including Random Forest models, to classify modes of post-transcriptional regulation based on feature composition. These models can be applied across datasets to predict regulatory mechanisms and to test generalizability across experimental contexts. Finally, postNet also allows users to explore relationships between regulatory effects and features using network analyses and UMAP visualizations. In summary, postNet provides an integrated computational framework for dissecting how mRNA sequence features and/or other regulatory elements interact to shape gene expression.
PostNet was designed to accommodate a variety of inputs and biological contexts. A postNet analysis can consist of the following steps:
postNetData object with the
postNetStart() function, containing curated reference mRNA
sequences, gene sets of interest to be compared, and a regulatory effect
measurement that will be used in downstream analyses. See Setting up a postNet analysis for details.featureIntegration() function, which implements either
forward stepwise regression or Random Forest classification approaches.
See Modelling post-transcriptional regulation
for details.rfPred() function. See Network analysis and prediction for
details.plotFeaturesMap() function. See Visualize and explore relationships between mRNA
features and regulation for details.Here, we demonstrate a minimal example of the steps of a standard postNet analysis workflow using an example dataset illustrating changes in mRNA translation, with default parameters. This includes:
# Load the package
library(postNet)
# Load the vignette example data
data("postNetVignette", package = "postNet")Step 1: A postNetData object containing
regulated gene sets of interest along with the background set, a
regulatory effect measurement (in this case, log2 fold changes in
translation efficiency derived from ribosome profiling and RNA-seq
comparing cells responding to osmotic stress against controls (Krokowski et al. 2022)), and reference mRNA
sequence annotations is initialized using the
postNetStart() function.
# Prepare custom gene lists and regulatory effect measurement:
myGenes <- postNetVignette$geneList
str(myGenes)## List of 2
## $ translationUp : chr [1:1445] "DDX6" "CLK1" "DUSP1" "TXNIP" ...
## $ translationDown: chr [1:1378] "PPDPF" "NCKAP1" "AP2M1" "ABHD12" ...
## chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" "AAGAB" "AAK1" "AAMDC" "AAMP" ...
## Named num [1:9555] -2.095 0.275 -0.57 -0.608 0.576 ...
## - attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
# Initialize a postNetData object:
ptn <- postNetStart(
geneList = myGenes,
geneListcolours = c("#FC9272", "#99000D"),
customBg = myBg,
effectMeasure = myEffect,
selection = "random",
setSeed = 123,
source = "load",
species = "human"
)Step 2: mRNA features that may influence post-transcriptional regulation of gene expression are enumerated. In this minimal example, the length and nucleotide content of 5’UTR regions are calculated and compared between gene sets. However, numerous other features across all sequence regions can be examined (see Analysis of mRNA sequence features). These analyses allow both statistical comparisons of mRNA features between gene sets of interest, and generate outputs that can be used to assess the contribution of a given mRNA feature to the observed regulatory effect in the modelling step.
# Quantify the length and nucleotide content of the 5'UTR and compare between regulated gene sets
len <- lengthAnalysis(
ptn = ptn,
region = c("UTR5"),
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
plotOut = TRUE,
plotType = "boxplot",
pdfName = "Example"
)
str(len)## List of 1
## $ UTR5_length: Named num [1:9555] 8.34 7.18 7.23 7.71 4.91 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
content_UTR5 <- contentAnalysis(
ptn = ptn,
region = c("UTR5"),
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
contentIn = c("GC"),
plotOut = TRUE,
plotType = "ecdf",
pdfName = "Example"
)
str(content_UTR5)## List of 1
## $ UTR5_GC: Named num [1:9555] 62 63.4 80 64.1 73.3 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
Comparisons of 5’UTR length and GC content between mRNAs belonging to regulated gene sets.
Step 3: Changes in translation efficiency are
modelled using the featureIntegration() function which
performs stepwise regression to identify the collection of features that
best explain the observed regulation. In this minimal example, a set of
pre-calculated features is used including both cis-acting
(enumerated mRNA features) and trans-acting (signatures of
upstream regulatory pathways) factors.
# Prepare the list of pre-calculated features to be used in modelling (for example,
# the 'len' and 'content_UTR' variables defined in the previous step can be included)
features <- postNetVignette$features
str(features)
# Signatures of trans-acting factors can also be included, and some are provided with the package
humanSignatures <- get_signatures(species = "human")
str(humanSignatures)
# Signatures can be converted into feature inputs using the signCalc function
trans_signatures <- signCalc(ptn, humanSignatures)
# Compile the final feature input:
features <- c(features, trans_signatures)
# Group the features according to category (optional)
group <- c(
"UTR5", "UTR3", "UTR5", "CDS", "UTR5", "UTR5", "UTR5", "CDS", "CDS",
rep("Pathway", 2), "UTR5", rep("Pathway", 2)
)
groupColour <- c("#834b62", "#6699cc", "#e9724c", "#fff299")
names(groupColour) <- c("UTR5", "CDS", "UTR3", "Pathway")
# Run feature integration modelling using forward stepwise regression
ptn <- featureIntegration(
ptn = ptn,
features = features,
pdfName = "omnibus",
regOnly = TRUE,
allFeat = FALSE,
analysis_type = "lm",
covarFilt = 20,
comparisons = list(c(1, 2)),
lmfeatGroup = group,
lmfeatGroupColour = groupColour,
NetModelSel = "omnibus"
)Network plot of the results of the stepwise regression omnibus model.
Step 4: The plotFeaturesMap() function
is used to visualize the relationship between translation efficiency and
different features that were identified in the modelling step as
significantly contributing to the observed post-transcriptional
regulation. These visualizations are also useful for exploring the
overlaps between different features within the same mRNAs.
ptn <- plotFeaturesMap(ptn,
regOnly = TRUE,
comparisons = list(c(1, 2)),
featSel = names(ptn_selectedFeatures(ptn,
analysis_type = "lm",
comparison = 1
)),
remBinary = TRUE,
featCol = "UTR5_SCSCGS",
scaled = FALSE,
remExtreme = 0.1,
pdfName = "Example"
)UMAP visualizing changes in translation efficiency (Effect) generated based on features identified in the omnibus model (left), with mRNAs containing SCSCGS motifs in the 5´UTR coloured (right).
This example provided an overview of a minimal postNet workflow. Full details of alternative workflows and additional inputs and options are discussed in the following sections.
The basic workflow of a postNet analysis aims to explain the post-transcriptional regulatory fate of mRNAs based on their repertoire of cis-acting sequence features and/or interactions with trans-acting factors. Three basic inputs are required to run a postNet analysis:
These inputs are compiled at the start of the analysis by
initializing a postNetData object using the
postNetStart() function. This class of object serves as the
container for storing inputs and some outputs generated during the
analysis, and is designated by ptn in the example code
provided here. The following sections describe the different options and
customizations available, along with examples illustrating how to set up
a postNet analysis.
Throughout this vignette, we will use example data provided with the package to illustrate the implementation of different analyses. These data are published in the study by Krokowski et al. (Krokowski et al. 2022), where changes in translation efficiency were measured using ribosome profiling and RNA-seq in immortalized human corneal epithelial cells responding to osmotic stress (500 mOsm, NaCl) for 1 h, along with controls. Genes that were translationally activated or suppressed under osmotic stress were identified using the anota2seq algorithm (Oertlin et al. 2019).
The most flexible implementation of postNet allows gene lists of
interest and regulatory effect measurements to be supplied from custom
sources. Here, the postNetStart() function will be used
with the geneList, geneListcolours,
customBg, and effectMeasure parameters to
initialize the postNetData object.
Depending on the aim of the analysis, different inputs can be
provided. In the case where the goal is simply to enumerate sequence
features in genes of interest without statistical comparisons,
geneList can be a single list of gene IDs with no
background gene set. To perform statistical comparisons, one or more
gene sets of interest must be provided, either with or without a custom
background gene set (customBg parameter). Although
optional, it is strongly recommended to carefully consider that the
appropriate background is used. Typically, this would be all genes in
the given dataset passing expression and/or reproducibility thresholds.
When providing custom gene lists, it is also required to specify colours
(geneListcolours parameter) that will be used in
visualizations generated at various steps of the analysis.
For each gene provided as part of a gene set of interest or background set, you must provide a regulatory effect measurement. As postNet is designed for examining post-transcriptional regulation, this would typically be the log2 fold changes in mRNA translation efficiency or expression between two conditions. However, other continuous numeric variables representing a change between two conditions could also be used, allowing a high degree of flexibility in potential applications.
The example below illustrates the set-up of a postNet analysis allowing identification of mRNA features and statistical comparisons between sets of translationally activated and suppressed genes, as well as modelling and network analysis to identify features associated with changes in translational regulation and their interdependence.
# Genes of interest should be provided in a named list.
myGenes <- postNetVignette$geneList
str(myGenes)## List of 2
## $ translationUp : chr [1:1445] "DDX6" "CLK1" "DUSP1" "TXNIP" ...
## $ translationDown: chr [1:1378] "PPDPF" "NCKAP1" "AP2M1" "ABHD12" ...
# All gene IDs in the list should be present in the background.
myBg <- postNetVignette$background
str(myBg)## chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" "AAGAB" "AAK1" "AAMDC" "AAMP" ...
# The regulatory effect measurement must be named with the same gene IDs
# present in the background (or the gene list if no custom background is provided).
myEffect <- postNetVignette$effect
str(myEffect)## Named num [1:9555] -2.095 0.275 -0.57 -0.608 0.576 ...
## - attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
A postNetData object can also be constructed directly
downstream of running an analysis using the Anota2seq package
to identify differentially translated genes. Here, the
postNetStart() function will be used with the
ads, regulation, contrast,
regulationGen, and contrastSel parameters to
initialize the postNetData object.
Anota2seq can be applied using data from both polysome and ribosome profiling techniques, and categorizes genes into three different regulatory modes: 1) Abundance, where changes in mRNA levels can be explained by transcription or mRNA stability. 2) Translation, where polysome-association is altered without a change in total mRNA level (this mode is expected to alter protein levels). 3) Buffering/translational offsetting, where polysome-association is maintained while total mRNA level changes (this mode is not expected to alter protein levels).
An Anota2seqDataSet can be provided using the
ads parameter of postNetStart(). The
regulation parameter allows the user to select which sets
of regulated genes identified using anota2seq will be used in
statistical comparisons and downstream feature integration modelling and
network analysis. In addition to the three regulatory modes listed
above, where changes in both total and translated mRNA are taken into
account, it is also possible to examine differences in total and
translated mRNA independently. Gene sets can be selected from multiple
contrasts in anota2seq specified with the contrast
parameter (see the anota2seq
vignette for more details).
The regulatory effect measurement is taken directly from the
Anota2seqDataSet object, and is specified by the
regulationGen parameter. Anota2seq calculates sets
of fold changes for each gene, in each regulatory mode. Although gene
sets from all regulatory modes and contrasts can be selected for
statistical comparisons, only one “general” regulatory effect
measurement can be supplied (i.e., fold changes in translation
efficiency, or buffering, etc.), so consideration should be
given to downstream modelling where the mRNA features (and/or other
input signatures) quantified in the gene sets specified by
regulation will be used to explain changes in the
regulatory effect measurement specified by regulationGen.
Similarly to the gene sets, the regulatory effect measurement can be
selected from any contrast from the anota2seq analysis using
the contrastSel parameter.
When an Anota2seqDataSet is supplied, the appropriate
background gene set, and colours for each gene list will be
automatically retrieved.
The example below illustrates the set-up of a postNet analysis using
the output of anota2seq, ads.
This postNet analysis will allow identification of mRNA features and comparison between translationally activated and suppressed genes, as well as buffered (total mRNA up and down) genes from contrast 1. Feature integration and network analysis will model the fold changes in translation efficiency from contrast 1. Note that in this example, if the buffering gene sets are included in modelling comparisons, the regulatory effect measurement values for these genes will be the fold changes in translation efficiency.
# Initialize the postNetData object:
ptn_withAds <- postNetStart(
ads = ads,
regulation = c("translationUp", "translationDown", "bufferingmRNAUp", "bufferingmRNADown"),
contrast = c(1, 1, 1, 1),
regulationGen = "translation",
contrastSel = 1,
selection = "random",
setSeed = 123, # ensures reproducibility of random isoform selection
source = "load",
species = "human"
)Reference sequence annotations in postNet are divided according to
different regions of mRNA molecules (5’UTR, CDS, and 3’UTR) to allow
regions to be compared separately. The source parameter of
the postNetStart() function allows the user to select from
several in-built sequence annotations provided with the package,
retrieve sequence annotations directly from the NCBI RefSeq database
(O’Leary et al. 2016), or provide custom
sequence annotations. Optionally, UTR sequences can also be adjusted if
more precise sequences are available, for example those experimentally
determined using approaches like CAGE, QuantSeq, or long-read
sequencing, etc.
It is highly recommended to use reference sequence annotations that
correspond to those that were used in generating the input gene lists.
For example, if RNA sequencing reads were counted using Ensembl
gene/transcript annotations, these may not always correspond well to the
sequences defined by RefSeq annotations, even if identifiers have been
converted to be compatible. When running postNetStart(), a
warning message describing differences in gene identifiers between the
input gene lists and the reference sequence annotations will be printed
to the console. Minor differences may be acceptable depending on the
application. However, larger differences may warrant either reprocessing
of input data, or selecting a more compatible reference sequence
annotation before proceeding with the analysis.
By default, source = "load" meaning
postNetStart() will load one of the in-built reference
sequence annotations provided with the package when it is first
needed.The data are downloaded from the postNet GitHub releases and
stored in a user-specific cache directory managed by BiocFileCache. Once
cached, the reference data are reused in subsequent sessions and do not
require re-downloading. This option is available when
species = "human" or "mouse". In-built
annotations are based on NCBI RefSeq GRCh38 (human) and GRCm39 (mouse)
genome assemblies and corresponding transcript annotations.The following
reference annotation versions are currently available:
Human (RefSeq GRCh38): ver_40.202408 Mouse (RefSeq GRCm39): ver_27.202402
A postNetData object can then be initialized using the
in-built reference sequence annotations, in this case using the RefSeq
ver_40.202408 release version for human.
# Prepare custom gene lists and regulatory effect measurement using example data:
myGenes <- postNetVignette$geneList
myBg <- postNetVignette$background
myEffect <- postNetVignette$effect
# Initialize a postNetData object using in-built annotations for human:
ptn <- postNetStart(
geneList = myGenes,
geneListcolours = c("#FC9272", "#99000D"),
customBg = myBg,
effectMeasure = myEffect,
source = "load",
species = "human",
version = "ver_40.202408"
)In addition to the in-built annotations, it is also possible to either retrieve or construct reference sequences for any RefSeq release version. Currently, only human and mouse are supported with these options.
By specifying source = "create" in
postNetStart(), annotation files from the most recent
RefSeq release version for the indicated species will be automatically
downloaded and used to construct a new reference sequence annotation
locally. Note that downloads may take several minutes, and files will be
stored in the working directory. Using this option requires an internet
connection.
# Initialize a postNetData object creating new RefSeq reference sequence annotations for human:
ptn <- postNetStart(
geneList = myGenes,
geneListcolours = c("#FC9272", "#99000D"),
customBg = myBg,
effectMeasure = myEffect,
source = "create",
species = "human",
)Alternatively, two additional offline methods are available to
construct reference sequence annotations if files have already been
downloaded from RefSeq and are available
locally. This can be done by specifying
source = "createFromSourceFiles" in
postNetStart(), and providing the RNA GBFF (“rna.gbff”),
RNA FASTA (“rna.fna”), and genomic GFF (“genomic.gff”) files. These
files must be provided using the rna_gbff_file,
rna_fa_file, and genomic_gff_file parameters
of postNetStart().
# The required RefSeq annotation files downloaded from the NCBI database will
# have the following naming:
# - "example_rna.gbff.gz"
# - "example_rna.fa.gz"
# - "example_genomic.gff.gz"
# Initialize a postNetData object creating a new RefSeq reference sequence annotation
# from source files:
ptn <- postNetStart(
geneList = myGenes,
geneListcolours = c("#FC9272", "#99000D"),
customBg = myBg,
effectMeasure = myEffect,
source = "createFromSourceFiles",
species = "human",
rna_gbff_file = "example_rna.gbff.gz",
rna_fa_file = "example_rna.fa.gz",
genomic_gff_file = "example_genomic.gff.gz"
)For added flexibility in defining sequence regions, it is also
possible to assemble reference sequence annotations from a FASTA file
using source = "createFromFasta" in
postNetStart(). This option requires that a FASTA file be
provided using the fastaFile parameter, along with an
additional file specifying the coordinates of the mRNA sequence regions
(provided with the posFile parameter). This position file
must be tab delimited and have columns indicating the transcript id,
5’UTR length, end of the coding sequence, and the total transcript
length (see example below). Optionally, a genomic GFF file can also be
provided using the genomic_gff_file parameter. However, if
no GFF is provided the latest version for the indicated species will be
automatically downloaded from the RefSeq database.
# An example of the required format for the posFile parameter ("positions.txt"):
# id UTR5_len cds_stop total_length
# NM_000014 70 4495 4610
# NM_000015 70 943 1285
# NM_000016 79 1345 2261
# Initialize a postNetData object creating a new RefSeq reference sequence annotation
# from source files:
ptn <- postNetStart(
geneList = myGenes,
geneListcolours = c("#FC9272", "#99000D"),
customBg = myBg,
effectMeasure = myEffect,
source = "createFromFasta",
species = "human",
fastaFile = "_rna.fa.gz",
posFile = "positions.txt",
genomic_gff_file = "_genomic.gff.gz"
)The transcriptome is highly diverse, with mRNA isoform expression patterns varying across cell types (FANTOM Consortium and the RIKEN PMI and CLST (DGT) et al. 2014), and between normal and disease states (Qamra et al. 2017). The sequences of mRNA molecules can also be dynamically regulated through processes such as alternative splicing (Wright, Smith, and Jiggins 2022), altered transcription start site usage (Watt et al. 2025), and alternative polyadenylation (Navickas et al. 2023). Analyses relating mRNA sequence features to post-transcriptional regulation will benefit from more precise sequence annotations if these are available.
Custom reference sequence annotations can be used with the
postNetStart() function and may be desirable in cases when
working with data from species not currently supported by the options
described above, with annotations other than those in the NCBI RefSeq
database, or if sequences have been experimentally determined. A
pre-prepared reference sequence annotation file can be provided using
the customFile parameter. This file must be tab delimited
and contain the columns: transcriptID, geneID, UTR5_seq, CDS_seq, and
UTR3_seq.
# An example of the required format for the customFile parameter ("customSequences.txt"):
# id geneID UTR5_seq CDS_seq UTR3_seq
# NM_000014 A2M GGGACCAG... ATGGGGAA... AGACCACA...
# Initialize a postNetData object with a custom reference sequence annotation file:
ptn <- postNetStart(
geneList = myGenes,
geneListcolours = c("#FC9272", "#99000D"),
customBg = myBg,
effectMeasure = myEffect,
source = "custom",
species = NULL,
customFile = "customSequences.txt"
)Using sequence annotations provided with the package or built from
the RefSeq database, postNet will perform gene-level analyses. For genes
with multiple mRNA isoforms, the selection parameter of the
postNetStart() function can be used to specify which
isoforms will be considered in analyses. Depending on the application,
it may sometimes be of interest to consider the longest or shortest mRNA
isoforms for each gene. However, selecting the extremes for all genes
may skew the results for some types of analysis. By default, isoforms
for each gene will be selected at random to prevent a systematic length
bias. However, it is important to note that each time the
postNetStart() function is run, different isoforms may be
selected leading to slight variations in results. If it is necessary to
ensure reproducibility of isoform selection between different runs of
the postNetStart() function, the setSeed
parameter can be used. Alternatively, high-confidence representative
protein-coding transcript isoforms can be selected based on the Match
Annotation from NCBI and EMBL-EBI (MANE) initiative (Morales et al. 2022).
Although it is also possible to perform isoform-level analyses of
mRNA features with postNet by supplying custom sequence
annotations as described above, careful consideration should be given to
the approach used when analyzing isoform versus gene-level data. If
differential regulation of unique isoforms can be confidently
quantified, it may be possible to treat these similarly to gene-level
data when performing featureIntegration() modelling.
However, it should be considered that including multiple isoforms for
the same genes that contain shared sequences may have confounding
effects, giving results that may be misleading or difficult to
interpret. An alternative option is to enumerate mRNA features at the
isoform level, and then summarize these at the gene-level prior to
modelling (Watt et al. 2025).
Numerous sequencing approaches are available to specifically map UTRs, such as nanoCAGE (Poulain et al. 2017) and QuantSeq (Moll et al. 2014). For this reason, it is also possible to adjust only UTR sequences if more specific data is available for your experimental condition or model of interest.
Custom UTR sequences will replace UTRs in an existing annotation file
when using the adjObj and region_adj
parameters of the postNetStart() function. A list of custom
sequences must be provided, specifying which UTR region(s) should be
replaced. If custom UTR sequences are available for some, but not all
genes in the existing sequence annotation, whether genes and isoforms
without custom sequences are kept or discarded from the analysis can be
controlled using the excl parameter. The
keepAll parameter can also be used to control how custom
UTR isoform sequences are stored for genes with multiple isoforms.
# Create the adjObj list with custom 5'UTR sequences to replace those in the loaded annotation file
myUTR5seqs <- ptn_sequences(ptn, "UTR5")
myIDd <- ptn_id(ptn, "UTR5")
names(myUTR5seqs) <- myIDd
customUTR5s <- list(UTR5 = myUTR5seqs)
str(customUTR5s)## List of 1
## $ UTR5: Named chr [1:9555] "AGCAAATAACTGGAATTCCGCAAACACTGGGGTGATGCAGGCAATGCCCTTAGCACGGTCCCGGCAGGAGGCGGTGGGATCAGGCCGACCCGGGTCCCAGGGGTGACAGCG"| __truncated__ "ACTTCCGTTGAGTTCCGCCTCGCCGTTTGTCCCTTGCGGTACCCGTCCGCATACGAATCTAGCCCGGGAACCGAGTTGCGGGAGTGCGGTCTGTGCCGTTCCGGCCAGGAG"| __truncated__ "AGTCCCTTGTTCCCCGCCGCCGCCGTCGCTGACCCAGCCCGCCAGGCGCTCCTGACCGTCGCTTCCTCCGGTCCCAGGTCCCCGGCCCTCGCCTCAGCCCCGGCCCCTGGT"| __truncated__ "AAAATGCACAGTCCCAGTGACTGAGAGGAGGCCAGCGGAGCGCGCGGGAACGAGGGTGGGGCCGCTCCCGGCTGGGCGCAGCTCGGGAGCCCCCTGCAGTGGACCCAAGTC"| __truncated__ ...
## ..- attr(*, "names")= chr [1:9555] "NM_001318038" "NM_001173466" "NM_001414675" "NM_001286683" ...
# Initialize a postNetData object with custom 5'UTR sequences, replacing the 5'UTR sequences
# in the loaded annotation file with custom ones where available, and discarding all isoforms
# without custom sequences. Genes with no custom sequence will retain all isoforms from the original annotation:
ptn <- postNetStart(
geneList = myGenes,
geneListcolours = c("#FC9272", "#99000D"),
customBg = myBg,
effectMeasure = myEffect,
source = "load",
species = "human",
selection = "random",
setSeed = 123,
adjObj = customUTR5s,
region_adj = "UTR5",
excl = FALSE,
keepAll = FALSE
)PostNet includes a variety of tools for identifying and
enumerating sequence features of mRNA. These tools can be used for
stand-alone analyses to visualize and perform statistical comparisons of
mRNA features between sets of regulated genes (described below).
However, the outputs of these tools can also be compiled into per-gene
catalogs of cis-acting regulatory features and used with the
featureIntegration() function to model changes in
post-transcriptional regulation, and dissect networks of interactions
between features.
For the tools described in the following sections, where possible,
statistical analyses are performed by providing the
comparisons parameter, where the input is a list of numeric
vectors specifying the pairwise comparisons to be made between gene sets
defined by geneList (if using custom inputs), or
regulation (if using the results of an anota2seq
analysis) parameters in the postNetStart() function.
Comparisons can be made between gene sets of interest, and between gene
sets and the background set, which is always denoted by 0. For example,
list(c(1,2), c(0,1), c(0,2)) would produce three sets of
statistics for the comparisons between gene sets 1 and 2, and for each
against the background set. Unless otherwise described, significant
differences in enumerated mRNA features between pairs of gene sets are
evaluated using two-sided Wilcoxon rank sum tests.
Some regulatory sequence features of mRNA are highly positional, such
as 5’ terminal oligopyrimidine (TOP) motifs (Philippe et al. 2020; Cockman, Anderson, and Ivanov
2020), or codon “ramp” sequences (Verma et
al. 2019). Therefore, for some tools described below, users may
wish to search for and/or enumerate mRNA sequence features within more
specific sequence sub-regions. This can be accomplished using the
subregion, and subregionSel parameters. The
subregion parameter takes an integer indicating the number
of nucleotides to include or exclude from either the start of the
sequence region if positive, or the end if negative. For example,
subregion = 50 denotes the first 50 nucleotides of the
sequence region(s) specified by the region parameter
(5’UTR, CDS, or 3’UTR), while subregion = -50, denotes the
last 50 nucleotides of the sequence region(s). The
subregionSel parameter can then be applied to allow the
user to either examine only this selected sub-region, or to exclude it
from the analysis.
In most cases, three options for visualizations are available with
the plotType parameter, including box plots, violin plots,
and empirical cumulative distribution functions (eCDFs). If the
plotType selected is ecdf, then differences
between distributions at the 25th, 50th, and 75th percentiles will also
be reported.
The length of different mRNA sequence regions can have important
implications for post-transcriptional regulation (Mayr 2016; Gandin 2016). For each gene, the
lengthAnalysis() function computes and returns a list of
the log2 length of each mRNA sequence region specified by the
region parameter. Statistical
comparisons and visualizations can be
generated using the parameters described above.
The example below will compute a list (len) of the log2
length for each gene and sequence region, and produce three PDF files
with eCDFs and statistics comparing the length of each sequence region
between gene sets and background.
# Calculate the length of the 5'UTR, 3'UTR and coding sequence for each gene,
# and compare lengths between translationUp genes vs. background,
# translationDown genes vs. background, and translationUp vs. translationDown genes
len <- lengthAnalysis(
ptn = ptn,
region = c("UTR5", "CDS", "UTR3"),
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
plotOut = TRUE,
plotType = "ecdf",
pdfName = "Example"
)
str(len)## List of 3
## $ UTR5_length: Named num [1:9555] 8.34 7.18 7.23 7.71 4.91 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ CDS_length : Named num [1:9555] 10.05 10.59 10.94 10.32 9.89 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ UTR3_length: Named num [1:9555] 9.59 4.86 11.96 9.46 11.07 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
eCDFs visualizing differences in Log2(Length) of mRNA sequence regions between different gene sets. 5’UTR (left), 3’UTR (middle), CDS (right). Statistical comparisons are made between translationally activated (blue), suppressed (red), and background (grey) gene sets. Visualizations can also be generated as either box plots, or violin plots
Nucleotide content is associated with the post-transcriptional fate
of mRNAs. For each gene, the contentAnalysis() function
computes and returns a list of the percentage of nucleotide content of
each mRNA sequence region specified by the region
parameter. It is also possible to select more specific sequence sub-regions to either examine or exclude from the
analysis. Statistical comparisons and visualizations can be generated using the
parameters described above.
The contentIn parameter is used to specify one or more
nucleotide(s) or nucleotide combinations to quantify. These can be any
selection or combination of A, T, G, or C. For example
c("GC") to obtain the combined percentage of G and C, or
c("G", "C") to obtain the individual percentages. Within
the coding sequence, it may also be of interest to calculate nucleotide
content at specific codon positions, for example the GC content at the
3rd codon position (“GC3”) is often used as an indicator of codon bias
(Aota and Ikemura 1986). For this reason,
when the input for the region parameter includes
"CDS", nucleotide content at specific codon positions can
also be quantified by supplying the nucleotide or combination of
nucleotides, along with the codon position to be examined (1, 2, or 3).
For example, c("GC3") calculates percentage of GC content
in the third codon position, while c("A12") would provide
the percentage of A content in the first and second codon positions.
The example below will compute a list (GC_content) of
the percentage of GC content for each gene and sequence region, and
produce three PDF files with violin plots and statistics comparing the
GC content of each sequence region between gene sets and background.
# Calculate the GC content of each sequence region for each gene, and compare between
# translationUp genes vs. background, translationDown genes vs. background, and
# translationUp vs. translationDown genes
GC_content <- contentAnalysis(
ptn = ptn,
region = c("UTR5", "CDS", "UTR3"),
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
contentIn = c("GC"),
plotOut = TRUE,
plotType = "violin",
pdfName = "Example"
)Violin plots visualizing differences in the percentage of GC content of mRNA sequence regions between different gene sets. 5’UTR (left), 3’UTR (middle), CDS (right). Statistical comparisons are made between translationally activated (blue), suppressed (red), and background (grey) gene sets. Visualizations can also be generated as either box plots, or eCDFs
In addition to calculating the percentage of GC content for each gene and sequence region (as above), the code below will return the GC content at the third codon position (GC3) for the coding sequence.
# Calculate the GC of each sequence region for each gene, as well as the GC3 content of
# the coding region and compare between translationUp genes vs. background, translationDown
# genes vs. background, and translationUp vs. translationDown genes
GC3_content <- contentAnalysis(
ptn = ptn,
region = c("CDS"),
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
contentIn = c("GC3"),
plotOut = TRUE,
plotType = "violin",
pdfName = "Example"
)
str(GC_content)## List of 3
## $ UTR5_GC: Named num [1:9555] 62 63.4 80 64.1 73.3 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ CDS_GC : Named num [1:9555] 63.7 57.5 56 41.4 45.3 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ UTR3_GC: Named num [1:9555] 63.1 24.1 52.5 33.6 41.4 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
Finally, the code below provides an example of how the
subregion and subregionSel parameters can be
used to quantify the percentage of G content in the first and last 15
nucleotides of the 5’UTR.
# Calculate the G content of the first and last 15 nucleotides of the 5'UTR for each gene, and
# compare between translationUp genes vs. background, translationDown genes vs. background, and
# translationUp vs. translationDown genes
content_UTR5_first15 <- contentAnalysis(
ptn = ptn,
region = c("UTR5"),
subregion = 15,
subregionSel = "select",
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
contentIn = c("G"),
plotOut = TRUE,
plotType = "ecdf",
pdfName = "Example_First15utr5"
)
content_UTR5_last15 <- contentAnalysis(
ptn = ptn,
region = c("UTR5"),
subregion = -15,
subregionSel = "select",
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
contentIn = c("G"),
plotOut = TRUE,
plotType = "ecdf",
pdfName = "Example_Last15utr5"
)The folding energy of mRNA molecules is an indication of the
thermodynamic stability of higher-order structures, and has implications
for post-transcriptional regulation (Faure et al.
2016; Zur and Tuller 2012). For each gene, the
foldingEnergyAnalysis() function returns a list of free
energy measurements for each mRNA sequence region specified by the
region parameter. Statistical
comparisons and visualizations can be
generated using the parameters described above.
The foldingEnergyAnalysis() function does not perform
folding energy calculations for sequences, but allows comparisons
between gene sets of interest using pre-calculated values. The source of
these values is specified using the sourceFE parameter,
where pre-calculated values provided with the package can either be
loaded, or custom values can be supplied using the
customFileFE parameter. Pre-calculated folding energies
supplied with the package are currently available for human and mouse
RefSeq releases “rel_109.20201120”, and “rel_109.20200923”. These values
were calculated for all sequence regions using the mfold algorithm (Zuker 2003), available here. Mfold reports the Gibbs free
energy (ΔG) for the most probable RNA structures, where lower
values indicate more stable structures and larger ones indicate less
stable, more flexible mRNA molecules.
To use custom free energy values with the customFileFE
parameter, the user must supply a TAB delimited file with three columns
corresponding to: transcript ID, folding energy, and length of the
sequence region. Note that folding energies for different sequence
regions (e.g. 5’UTR and 3’UTR) must be provided separately.
The residFE parameter also offers the option of
correcting the folding energies for the length of the sequences, which
may sometimes be desirable to facilitate comparisons between gene sets.
Here, instead of ΔG, the values returned are the residuals from
the linear model: FE ~ log2(sequence region length). Note that if
folding energies will be used in downstream modelling analysis with
featureIntegration(), consideration should be given to the
residFE parameter. If the length of the sequence regions
will also be included in modelling, it is not recommended to correct the
folding energies for the sequence length.
The example below will compute a list (FE) of the
folding energies for 5’UTR regions, and produce a PDF file with box
plots and statistics comparing the folding energy between regulated gene
sets and background.
# Compare the folding energy of the 5'UTRs between translationUp
# genes vs. background, translationDown genes vs. background,
# and translationUp vs. translationDown genes.
FE <- foldingEnergyAnalysis(
ptn = ptn,
region = c("UTR5"),
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
residFE = FALSE,
plotType = "boxplot",
sourceFE = "load",
plotOut = TRUE,
pdfName = "Example"
)
str(FE)## List of 1
## $ UTR5_foldingEnergy: Named num [1:9130] -13.3 -51 -80.7 -58.8 -23.6 ...
## ..- attr(*, "names")= chr [1:9130] "ACADS" "PSEN1" "ADRB2" "ALAD" ...
Box plots visualizing differences in the folding energy of mRNA 5’UTRs between different gene sets. Statistical comparisons are made between translationally activated (light red), suppressed (dark red), and background (grey) gene sets. Visualizations can also be generated as either violin plots, or eCDFs
In the example code below, a custom file with pre-calculated folding energies is provided for 5’UTR sequences.
# An example of the required input format for the customFile parameter ("custom_UTR5FE.txt"):
# id fold_energy length
# NM_018117 -16.3 34
# NM_001025603 -36.84 162
# NM_001003891 -28.4 81
# NM_021107 -120.14 350
# Compare the folding energy of the 5'UTR for each gene between translationUp genes vs. background,
# translationDown genes vs. background, and translationUp vs. translationDown genes.
customFE <- foldingEnergyAnalysis(
ptn = ptn,
region = c("UTR5"),
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
residFE = FALSE,
plotType = "violin",
sourceFE = "custom",
customFileFE = "~/Path/To/CustomFile/custom_UTR5FE.txt",
plotOut = TRUE,
pdfName = "Example_customFE"
)Upstream open reading frames (uORFs) are common regulatory elements
found in the 5’UTRs of many transcripts, imparting context-dependent
post-transcriptional regulatory effects (Barbosa,
Peixeiro, and Romão 2013). For each gene, the
uorfAnalysis() function returns either the number, or the
coordinates of the positions of uORFs in the 5’UTR. Statistical comparisons can be generated using
the parameters described above.
By default, the uorfAnalysis() function will search for
uORFs with canonical start codons (“AUG”) in a strong Kozak context.
However, it is also possible to identify uORFs with non-canonical start
codons using the startCodon parameter, and to specify the
Kozak context of the start codon using the KozakContext
parameter, where:
"strong" = [AG][ATGC][ATGC] [Start codon] G"adequate1" = [AG][ATGC][ATGC] [Start codon] [ATC]"adequate2" = [TC][ATGC][ATGC] [Start codon] G"weak" = [TC][ATGC][ATGC] [Start codon] [ATC]"any" = [ATGC][ATGC][ATGC] [Start codon] [ATGC]Note that if KozakContext = "any", the nucleotide
context of the start codon is not considered.
Some uORFs are completely contained within the 5’UTR, as is the case
for the uORF found in the transcript CHOP (DDIT3) (Jousse et al. 2001). However, in other cases
the start codon for the uORF may be in the 5’UTR, but the stop codon may
be found downstream of the start codon of the main ORF, as for ATF4
(Vattem and Wek 2004). Using the
onlyUTR5 parameter, it is possible to specifically identify
uORFs completely contained in 5’UTRs, instead of all uORFs.
Finally, in addition to quantifying the number of uORFs for a given
transcript, it is also often of interest to know the positions in the
sequence where the uORFs occur. The unitOut parameter can
be used to specify whether the number, or coordinates (nucleotide
position of the start and stop codons) of the detected uORFs will be
returned.
The example below will compute a list (uORFs) of the
number of uORFs with canonical start codons in a strong Kozak context
for each gene, and produce a PDF file with a bar plot and statistics
comparing between gene sets.
# Identify all uORFs with canonical start codons in a strong Kozak context (including
# those not fully contained within 5'UTRs, and compare
# between translationUp vs. translationDown genes
uORFs <- uorfAnalysis(
ptn = ptn,
comparisons = list(c(1, 2)),
startCodon = "ATG",
KozakContext = c("strong"),
onlyUTR5 = FALSE,
unitOut = "number",
plotOut = TRUE,
pdfName = "Example"
)
str(uORFs)## List of 1
## $ uORFs_ATG_strong: Named num [1:9555] 2 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
Bar plot visualizing differences in the proportion of transcripts containing uORFs with canonical start codons in a strong Kozak context between different gene sets. Statistical comparisons are made between translationally activated (blue), and suppressed (red) gene sets.
De novo motif discovery in postNet is carried out using the
motifAnalysis() function, which implements STREME (Bailey 2021) from the MEME-Suite (Bailey et al. 2015) to identify ungapped motifs
(recurring, fixed-length patterns) that are enriched in the selected
sequence region(s) of gene sets of interest relative to control
sequences (background). It is also possible to select more specific
sequence sub-regions to either examine or
exclude from the analysis.
The motifAnalysis() function applies
runStreme() from the memes
package (Nystrom 2021). This requires
that the MEME Suite software tools be installed locally, and the path to
the executables must be provided (memePath parameter). The
MEME Suite can be downloaded here following
the installation
guide. It is possible to specify the minimum width (in nucleotides)
and the enrichment p-value threshold for selecting motifs using the
minwidth and stremeThreshold parameters. If
custom reference sequence annotations have been used, the sequence type
(RNA, DNA, protein) can also be adjusted with the seqType
parameter. It is important to note that when using postNet with a custom
gene list, if no background is provided,
postNet defines the background as all genes in the gene list. In the
scenario where only one gene set is provided in the list, it is not
advisable to use motifAnalysis() to identify enriched
motifs, as the input and background will be the same.
The motifAnalysis() function returns an updated
postNetData object where results are stored in the “motifs”
slot. Identified motifs can be enumerated in transcripts and included in
downstream modelling with featureIntegration() by using the
contentMotifs() function. All results, including the list
of significantly enriched motifs can be extracted using the
ptn_motifSelection() and ptn_motifGeneList()
functions.
Note: If you use motifAnalysis() in
your work, please appropriately cite the memes R package, The MEME
Suite, and the STREME tool. Licensing for The MEME Suite is free for
non-profit use. However, for-profit users should contact OIC-MEMESuite@ucsd.edu and purchase a license. See http://meme-suite.org/doc/copyright.html for details
In the example below, de novo motifs enriched in translationally activated and suppressed gene sets compared to background will be identified for all sequence regions. Motifs passing an enrichment p-value threshold of 0.05 will be extracted, along with the full motif analysis results for the translationally activated gene set.
# Note that as users must provide the path to the MEME Suite executables. An example of
# how to run this function is provided below, and can be updated with the correct memePath argument.
ptn <- motifAnalysis(
ptn = ptn,
stremeThreshold = 0.05,
minwidth = 6,
memePath = "/meme/bin",
region = c("UTR5", "CDS", "UTR3")
)
# Extract the significantly enriched motifs:
denovo_UTR5motifs <- ptn_motifSelection(
ptn = ptn,
region = "UTR5"
)
# Extract full motif results from one gene set of interest:
denovo_UTR5motifs_TransUp <- ptn_motifGeneList(
ptn = ptn,
region = "UTR5",
geneList = "translationUp"
)Sequence motifs can contribute to the structure and stability of RNA
molecules, as well as mediate RNA-protein interactions. Numerous motifs
play an important role in post-transcriptional regulation, including
5’TOP motifs, which render translation sensitive to regulation via mTOR
(Philippe et al. 2020; Cockman, Anderson, and
Ivanov 2020), and G-quadruplexes that can impact the translation
and stability of the mRNAs where they occur (Bugaut and Balasubramanian 2012). For each
gene, the contentMotifs() function identifies user-supplied
sequence motifs and returns a list of enumerations and/or positions for
each mRNA sequence region specified by the region
parameter. It is also possible to select more specific sequence sub-regions to either examine or exclude from the
analysis. Statistical comparisons between
gene sets can be specified using the parameters described above.
The motifsIn parameter is used to provide the motif
sequences to be detected and enumerated. Ambiguities can be specified
using IUPAC
codes or [ ] (bracket) annotations. The input motifs provided should
match the type of reference sequences to be searched in your analysis
(i.e., RNA, DNA, or protein), and can be specified using the
seqType parameter. The contentMotifs()
function can also implement pqsfinder
to identify G-quadruplexes when motifsIn = "G4", using the
min_score to adjust the prediction threshold. If you use
the “G4” option in your work, please cite Hon et al. (Hon et al. 2017). The ATtRACT database, is a database of
RNA binding proteins and associated motifs (Giudice et al. 2016), and may also be useful
for identifying motifs of interest and potential mechanisms of
post-transcriptional regulation in your data. Note that often you will
first run the motifAnalysis() function to identify the
enriched motifs to enumerate. In the example code above,
denovo_UTR5motifs, extracted using the
ptn_motifSelection() function can be supplied directly as
input to contentMotifs() using the motifsIn
parameter.
It is also possible to identify overlapping, or non-overlapping
motifs, as well as dictate spatial constraints between motifs using the
dist parameter to dictate the minimum nucleotide distance
between motifs. Similarly to uorfAnalysis(), it is also
often of interest to know the positions in the sequence where the motifs
occur, and the unitOut parameter can be used to specify
whether the number, or coordinates (nucleotide position of the beginning
and end of the motif) will be returned.
As for foldingEnergyAnalysis(), the resid
parameter of contentMotifs() offers the option of
correcting the number of motifs per transcript for the length of the
sequences (as longer sequences have more opportunity to contain motifs).
Here, the values returned are the residuals from the linear model:
number of motifs ~ log2(sequence region length). Note that if motifs
will be used in downstream modelling analysis with
featureIntegration(), and the length of the sequence
regions will also be included in the same models, it is not recommended
to correct the number of motifs for the sequence length.
It is important to note that the contentMotifs()
function uses a more strict method of sequence matching to count motifs
than the MEME-Suite (Bailey et al. 2015),
and for this reason some motifs identified with
motifAnalysis(), which implements STREME (Bailey 2021), may have divergent quantification
between methods. Using a more stringent threshold with the
stremeThreshold parameter of motifAnalysis()
may remedy this. Alternatively, motifs identified with STREME can be
counted using the FIMO (Grant, Bailey, and Noble 2011) tool provided
with the MEME-Suite.
The example below will compute a list (UTR5_SCSCGS_num)
of the number of SCSCGS in the 5’UTR for each gene, and produce a PDF
file with an eCDF plot and statistics comparing enrichment of the motif
between gene sets. A second list (UTR5_SCSCGS_pos) will
contain the coordinates of the positions of the motifs within the 5’UTR
for each gene.
# Detect and quantify a given motif within 5'UTRs, and compare between translationUp vs.
# translationDown genes
UTR5_SCSCGS_num <- contentMotifs(
ptn = ptn,
motifsIn = "SCSCGS",
region = c("UTR5"),
comparisons = list(c(1, 2)),
dist = 1,
unitOut = "number",
pdfName = "Example",
plotOut = TRUE
)
str(UTR5_SCSCGS_num)## List of 1
## $ UTR5_SCSCGS: Named num [1:9555] 0 1 6 1 0 3 0 0 0 1 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
# Now find the coordinates of positions where the motif occurs in 5'UTRs
UTR5_SCSCGS_pos <- contentMotifs(
ptn = ptn,
motifsIn = "SCSCGS",
region = c("UTR5"),
comparisons = list(c(1, 2)),
dist = 1,
unitOut = "position"
)
str(UTR5_SCSCGS_pos$UTR5_SCSCGS[1:3])## List of 3
## $ A4GALT: logi NA
## $ AAAS :List of 2
## ..$ start: num 62
## ..$ end : num 67
## $ AACS :List of 2
## ..$ start: num [1:6] 12 37 80 98 130 138
## ..$ end : num [1:6] 17 42 85 103 135 143
eCDFs visualizing differences in the presence of the SCSCGS motif in the 5’UTR of different gene sets. Statistical comparisons are made between translationally activated (blue) and suppressed (red) gene sets. Background genes are shown in grey.
PostNet includes functionality for enumerating and comparing codon
usage across gene sets, implemented with the codonUsage(),
and codonCalc() functions. In this workflow, the
codonUsage() function is first used to enumerate usage of
single codons (or multiple, e.g. dicodons) or amino acids (AAs), and
identify those that are enriched or depleted between different gene sets
of interest. The codonCalc() function can then be applied
to obtain the number or frequency of selected codons or AAs for each
gene, and perform statistical comparisons between gene sets. The outputs
of codonCalc() can be used as input features for modelling
analyses using featureIntegration().
The codonUsage() function can be used to enumerate
either codon or AA usage, specified with the analysis
parameter. Analyses can be performed using the coding region of
reference sequence annotations stored in the postNetData
object. Alternatively, coding sequences can be either automatically
created from the NCBI
Consensus CDS database or loaded using the annotType
and sourceSeq parameters. If loaded, the data are
downloaded from the postNet GitHub releases and stored in a
user-specific cache directory managed by BiocFileCache. Once cached, the
reference data are reused in subsequent sessions and do not require
re-downloading. This option is available when
species = "human" or "mouse".The following
reference annotation versions are currently available:
CCDS Release 25 (2022)
It is also possible to enumerate multiple codon combinations, for
example dicodons, using the codonN parameter. Similarly to
contentAnalysis() and contentMotifs(), it is
also possible to select more specific sequence sub-regions for analysis, for example to
specifically examine the codon “ramp” downstream of the start codon
(Verma et al. 2019).
The example below demonstrates an implementation of
codonUsage() generating a summary table of codon counts and
frequencies for all genes. Note that “frequency” corresponds to the
number of codons relative to all codons in the gene, while “relative
frequency” corresponds to the number of codons relative to all
synonymous codons in the gene. This table can then be retrieved using
the ptn_codonAnalysis() function. In this case, statistical comparisons will be performed
between translationally activated and suppressed gene sets.
# Examine the composition of all genes:
ptn <- codonUsage(
ptn = ptn,
annotType = "ptnCDS",
sourceSeq = "load",
analysis = "codon",
codonN = 1,
comparisons = list(c(1, 2))
)
# Retrieve the codon usage summary table
codonUsageTable <- ptn_codonAnalysis(ptn)In addition to the summary table, the example code above also
generates plots comparing the average codon usage and frequency between
the gene sets of interest specified by the comparisons
parameter.
Scatter plots comparing average codon usage and frequency between gene sets of interest. Codons encoding the same amino acid are coloured and connected by lines.
The codonUsage() function also computes several codon
indexes and performs statistical comparisons between gene sets of
interest. These include the CAI (codon adaptation index) (Sharp and Li 1987), CBI (Codon Bias Index)
(Bennetzen and Hall 1982), FOP (frequency
of optimal codons) (Ikemura 1981), L_aa
(Number of AAs in protein), and tAI (tRNA adaptation index) (Reis, Savva, and Wernisch 2004). The type of Visualizations generated can be controlled as
described above, using the plotType_index parameter.
Below are several examples produced by the example code above.
Violin plots of several codon usage indexes calculated by the codonUsage() function. Shown are several examples including CAI (left), CBI (middle), and Fop (right). Statistical comparisons are made between translationally activated (blue), suppressed (red), and background (grey) gene sets. Visualizations can also be generated as either box plots, or eCDFs
The codonUsage() function can also be used to identify
enriched or depleted codons or AAs between gene sets of interest. In the
first step of the analysis, for each gene set comparison, a Chi-square
test is applied to assess significant differences in codon or AA usage.
To visualize the results of this test, a clustered heatmap of the
standardized Chi-square residuals for each codon or AA between the gene
sets of interest can be produced using the plotHeatmap
parameter. Then, for codons or AAs passing significance thresholds
(specified using the pAdj parameter), the odds ratio for
each desired comparison pair is calculated and plotted against the
frequency. Codons or AAs of interest with differential usage between
gene sets are usually identified as those with a high or low odds ratio,
and high frequency. These thresholds can be defined using the
thresOddsUp, thresFreqUp,
thresOddsDown and thresFreqDown
parameters.
In the example below, enriched and depleted codons will be identified between translationally activated and suppressed gene sets. These are defined as those with an adjusted p-value less than 0.01 in a Chi-square test, and are within the top 30% of enrichment or depletion based on odds ratios and average frequencies.
# Identify enriched and depleted codons between
# translationally activated and suppressed genes
codonUsage(
ptn = ptn,
annotType = "ptnCDS",
sourceSeq = "load",
analysis = "codon",
codonN = 1,
pAdj = 0.01,
plotHeatmap = TRUE,
thresOddsUp = 0.3,
thresFreqUp = 0.3,
thresOddsDown = 0.3,
thresFreqDown = 0.3,
comparisons = list(c(1, 2)),
pdfName = "Example"
)Visualizations of differential codon usage between different gene sets of interest. A clustered heatmap of standardized residuals for each codon from a Chi-square test (left), and a plot of the log2 odds ratio for each codon between translationally activated and suppressed gene sets, vs. the average codon frequency. Those passing thresholds for enrichment or depletion between gene sets are highlighted in red and blue, respectively.
After identifying codons or AAs with differential usage between gene
sets using the codonUsage() function, these can be
enumerated across genes using the codonSelection() function
for use in modelling using featureIntegration(). Statistical comparisons and visualizations can also be generated using the
parameters described above.
Codons or AAs to examine are specified using the featsel
parameter, and the output can be calculated as either the number of
occurrences per gene, or the frequency (the count per gene relative to
all other codons or AAs in the gene) specified by the unit
parameter.
In the example below, codons identified as enriched or depleted between gene sets based on Chi-square and odds ratio analyses are enumerated for each gene, and these are compared between translationally activated and suppressed gene sets.
# Select codons of interest with high frequency, and the highest and lowest odds ratios
# between translationally activated and suppressed genes
codons <- ptn_codonSelection(ptn,
comparison = 1
)
# Calculate and plot eCDFs of codon frequencies, and compare between gene sets
codonCounts <- codonCalc(
ptn = ptn,
analysis = "codon",
featsel = codons,
unit = "freq",
comparisons = list(c(1, 2)),
plotType = "ecdf"
)eCDFs visualizing differences in the frequency of enriched (left) and depleted (right) codons between gene sets. Statistical comparisons are made between translationally activated (blue), suppressed (red) gene sets, with background shown in grey. Visualizations can also be generated as either box plots, or violin plots.
The featureIntegration() function is used to model
changes in post-transcriptional regulation and identify cis-
and/or trans-acting factors (features of mRNA molecules,
signatures of upstream regulators, etc.) that can explain the observed
regulatory effects. The method integrates mRNA features and signatures
obtained in previous steps of the postNet workflow (or from custom
sources), and has several options for implementation, including forward
stepwise linear regression modelling and network analysis, and Random
Forest feature selection and classification, which allows prediction of
post-transcriptional regulation in new datasets. These topics and
considerations for use will be discussed in the following sections.
Both stepwise regression and Random Forest modelling implementations produce statistics and plots visualizing the relationship between regulatory effects and features. These relationships can be further explored using UMAP visualizations, described in later sections.
Many different types of features can be included in modelling with
the featureIntegration() function. The input features must
be provided as a list summarizing the properties of each gene, that will
be used to explain the observed changes in regulation (using forward
stepwise linear regression modelling), or to predict the regulatory
outcome for a given gene (using a Random Forest classifier). Prior to
modelling, careful consideration should be given to the features that
are included in the models. These considerations will largely depend on
the individual datasets being examined and on the biological questions
being addressed. However, some general advice to keep in mind is
outlined here.
Most analyses of post-transcriptional regulation using postNet will
likely be interested in identifying sequence-encoded features of mRNA
molecules that may be responsible for mediating selective regulation
(cis-acting elements). From the postNet workflow, the outputs
of the lengthAnalysis(), contentAnalysis(),
uorfAnalysis(), foldingEnergyAnalysis(),
contentMotifs(), and codonCalc() functions
generate per-gene summary values that can be directly supplied as
features in modelling. In addition to the features that can be
enumerated using postNet, additional cis-acting features could
be included from custom sources, or based on experimental measurements.
For example, the length of the poly-A tail, the presence of RNA
modifications such as m6A, or measurements of RNA stability or
structures.
In addition to cis-acting features, it is also possible to include features describing upstream regulators (trans-acting features). These can be gene signatures of known regulation downstream of particular pathways or factors, for example genes translationally activated or suppressed by an inhibitor treatment or knockdown of an RNA binding protein. Including such gene signatures in modelling can serve several hypothesis-generating purposes. For example, when examining post-transcriptional regulation in a complex physiological condition, including signatures of upstream regulators may help to identify factors or pathways that may be modulated in that condition. For example, including known signatures of translational regulation when modelling the translatome under hypoxia confirmed that suppression of mTOR signalling and activation of the ISR play key roles in the observed translational control, and suggested the potential involvement of other factors such as DAP5 that were not previously appreciated (Watt et al. 2025). In addition, including trans-acting feature signatures can also allow assessment of the relationship between upstream regulators and sequence-encoded features. For example, stepwise regression modelling can identify covariance between features. If a cis-acting sequence feature co-varies substantially with an upstream signalling pathway (meaning the element is present in the same genes regulated by this pathway), one may hypothesize that this sequence element could be involved in mediating the observed post-transcriptional regulation exerted by that pathway. Several example gene signatures, including mTOR and ISR-sensitive translation, are included with the package.
In general, features that can be included in modelling can be either numeric, or categorical. Numeric features can be both continuous (for example, sequence region nucleotide content or length), or discrete (for example, the number of uORFs in 5’UTRs). Categorical features can also be included by converting to binary variables. This is useful for evaluating gene signatures in relation to regulatory outcomes, but can also be applied in many scenarios where mRNA can be divided into two classes (for example those with a certain property would be coded as 1, while those without would be coded as 0, etc.). It is also possible to supply categorical features with a directionality. For example, a feature that can be increased, decreased, or unchanged could be coded as 1, -1, or 0 for each gene to be included in modelling.
Any gene signature constituting a list of genes, including those
provided with the package (stored in the humanSignatures
and mouseSignatures objects) can be converted to feature
inputs compatible with modelling using the signCalc()
function. In the example code below, signatures of mTOR and
ISR-sensitive translation, as well as mRNAs known to contain classical
TOP motifs will be converted into binary variables that can be used as
feature inputs in featureIntegration() modelling.
## List of 5
## $ Gandin_etal_2016_mTOR_transUp : chr [1:1569] "ZNF467" "PHLDA2" "PYCARD" "FHOD1" ...
## $ Gandin_etal_2016_mTOR_transDown: chr [1:1721] "DST" "OTULINL" "MME" "BRD8" ...
## $ Cockman_etal_2020_classicTOP : chr [1:92] "RPSA" "RPS2" "RPS3" "RPS3A" ...
## $ Guan_etal_2017_Tg1_transUp : chr [1:1192] "AR" "ATP7A" "BRCA2" "LYST" ...
## $ Guan_etal_2017_Tg1_transDown : chr [1:731] "AGA" "BCKDHB" "CST3" "CYBA" ...
# Convert the gene signatures to a binary format for featureIntegration:
signatureFeatures <- signCalc(ptn = ptn, signatures = humanSignatures)
str(signatureFeatures)## List of 5
## $ Gandin_etal_2016_mTOR_transUp : Named num [1:9555] 1 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ Gandin_etal_2016_mTOR_transDown: Named num [1:9555] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ Cockman_etal_2020_classicTOP : Named num [1:9555] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ Guan_etal_2017_Tg1_transUp : Named num [1:9555] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ Guan_etal_2017_Tg1_transDown : Named num [1:9555] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
Features that will be included in modelling are passed to the
featureIntegration() function as a named list of numeric
vectors. Each vector element in the list must have names corresponding
to the gene IDs in your postNetData object. Ideally, all
genes in the dataset should be represented in each feature vector as any
genes missing data for one or more features will be excluded from the
analysis.
In the example code below, a list of input features for modelling is compiled using the outputs of previous steps of the analysis.
After compiling an appropriate list of features, it is now possible
to implement the featureIntegration() function to model the
observed post-transcriptional regulation. The following sections will
describe modelling using stepwise linear regression and network
analysis. This analysis uses hierarchical multiple regression to
identify features explaining changes in the regulatory effect, rank them
according to their importance, and reveals independent and combinatorial
effects (for example, co-occurrence of regulatory features in the same
mRNA molecules). The results of this analysis can be visualized using
network plots showing the association of features to regulatory effects,
and the relationships between features. Stepwise linear regression
modelling is carried out in three phases to generate univariate,
omnibus, and adjusted models (described below).
In phase 1 of the analysis, each candidate feature is evaluated separately in univariate linear models to identify significant associations with the regulatory effect measurement.
The results of these univariate models are summarized in an output table where for each feature, the P-value, and FDR are provided, along with the proportion of variance in the regulatory effect measurement that can be explained by that feature.
Output of univariate linear models from featureIntegration.
The results of the univariate models are used to prioritize the features included in the second phase of stepwise regression modelling.
In phase 2, starting with the feature that best explained changes in the regulatory effect from univariate models (in the example above this would be 5’UTR GC content), forward stepwise regression is performed by adding features to the model in an iterative fashion, keeping covariance assigned to the most influential feature (a rank-based greedy model). In each step, the best performing model (the feature with the strongest association with the regulatory effect) is retained, and features that fail to explain additional variance are discarded. Whether a feature is retained in the model is based on the size of the F-statistic, and a P-value below a selected threshold (by default < 0.05). The resulting omnibus model represents the smallest set of features that can explain the greatest proportion of variance in the regulatory effect measurement. This step both ranks features according to their relative importance, and reveals relationships between them.
Output of stepwise regression summarizing F-values at each step of modelling.
In the output table above, the F-statistics are summarized at each step, where features are iteratively added to the model. The table is coloured according to a “traffic-light” system, where cells marked in green indicate features that were significant and included in the final omnibus model. Cells marked in red/purple indicate features that did not improve the ability to explain additional variance in the regulatory effect measurement, and were therefore discarded from the model. For example, after the addition of CDS GC3 in step 2, the set of codons identified as enriched (“Codons_up”) is dropped from the model (the F-statistic decreases from more than 204.6 to 1.3) as it is not able to significantly explain additional variance in the regulatory effect. Finally, cells marked in orange indicate a substantial change in the F-statistic for a given feature when another feature is added to the model (an increase or decrease of at least 50%), indicating a relationship between features. For example, we can see a relationship between the 5’UTR SCSCGS motif and 5’UTR length, as the F-statistic for the motif decreased by ~88.5% (from 132.5 to 15.2) between steps 3 and 4 when 5’UTR length was added to the model. These relationships can be explored in detail in the resulting network analyses, with correlations, and using UMAP visualizations (all described below).
The features included in the final omnibus model are summarized in an output table:Summary table of features included in the final omnibus and adjusted models after forward stepwise regression.
Here, the total proportion of variance in the regulatory effect that can be explained by features included in the omnibus model is ~45%. The P-value for each feature from the stepwise regression model is provided, along with the variance explained. Note that since covariance is assigned to the most influential feature, the proportion of variance explained assigned to 5’UTR GC content is 30.4%, the same as in the univariate analysis. However, CDS GC3 content now explains 6.6% of the variance as opposed to 21.4% in the univariate analysis. This is because of the rank-based greedy nature of the model, as since 5’UTR GC and CDS GC3 content co-vary, this covariance is assigned to the more influential of the two features. This relationship is also apparent when examining the F-value table above, as there was a large decrease in the F-statistic for CDS GC3 between steps 1 and 2 indicating a relationship with 5’UTR GC.
In the 3rd and final phase, the independent contribution of each feature identified in phase 2 is determined. This is done by removing covariance from the omnibus model, which as described above, was previously assigned to the most influential features. This provides the adjusted contribution of each feature. The output of phase 3 is provided in the summary table above (Figure 15), where the last column is the percentage of variance in the regulatory effect explained by a given feature, independent of all other features in the omnibus model.
These adjusted values are useful when evaluating whether particular features tend to co-occur in the same mRNAs along with other features included in the model. This may suggest combinatorial effects between features, or reveal whether more specific features have impacts that are independent of more general features. For example, in the table above we see that after adjusting for all other features, the 5’UTR GC content still explains 10% of the regulatory effect independently, whereas the CDS GC3 content can independently explain only 0.3%, meaning that this feature co-varies very substantially with other features in the model. Note that in some instances the proportion of variance explained by a feature may increase after the adjustment, such as for TOP mRNAs (Cockman_etal_2020_classicTOP). This may suggest that this feature is anti-correlated with other features in the model. Indeed, in Figure 14 we can see that the F-statistic for the TOP mRNA signature increases between steps 1 and 2, and steps 4 and 5, suggesting that there is some relationship between TOP mRNAs, and 5’UTR GC content and length.
In addition to the “traffic-light” colouring of the F-value table to
highlight substantial relationships between features based on changes in
F-value, when the useCorel parameter is TRUE,
an additional table of Pearson’s correlation coefficients between
features will also be displayed. A threshold for the strength of
association between features in order for correlation coefficients to be
calculated can be provided using the covarFilt parameter.
This threshold is based on an integer representing the absolute change
in F-value. The default value is 20, but this may not be
appropriate for all datasets and research questions and users can adjust
this to see more or fewer correlations (this is discussed further below
for network visualizations). Note that the orange “traffic-light”
highlighting indicating substantial relationships in the F-value table
(Figure 14) is independent of the threshold set with
covarFilt, and will always represent an increase or
decrease in F-value of at least 50%.
Summary table of Pearson’s correlation coefficients between features.
Stepwise regression modelling and network analysis are specified in
the featureIntegration() function by setting the
analysis_type parameter to lm. A number of
parameters can be used to control the behavior of the modelling.
Similarly to the functions previously described, statistical comparisons can be specified using
the comparisons parameter. It is possible to perform
multiple pair-wise comparisons with stepwise regression modelling. By
default, regression modelling is performed using only the set of
regulated genes (i.e., those included in geneList, or
regulation if starting from an
Anota2seqDataSet object). However, it is also possible to
perform the modelling with all genes included in the dataset by setting
the regOnly parameter to FALSE.
By default, only the set of input features that are significantly
associated with the regulatory effect in univariate models (phase 1)
will be included in stepwise regression (phase 2). However, in some
instances you may be interested in how various features interact with
each other, so it is possible to bypass this filtering by setting the
allFeat parameter to TRUE, which will allow
all features provided as input with the features parameter
to be used in stepwise regression modelling. In addition, the phase 1
filtering is based on an FDR threshold of 0.05 for univariate
models, however, this can be adjusted using the fdrUni
parameter. Furthermore, features are retained or discarded in stepwise
regression analysis based on a P-value threshold of 0.05. This
can also be adjusted using the stepP parameter.
In the example code below the three phases of modelling will be performed using the list of input features defined above. In this analysis, only the sets of regulated genes will be included, comparing translationally activated and suppressed genes. Those features passing the FDR threshold of 0.05 in univariate models will be included in stepwise regression.
# Run feature integration modelling using forward stepwise regression
ptn <- featureIntegration(
ptn = ptn,
features = myFeatureList,
pdfName = "Example",
regOnly = TRUE,
allFeat = FALSE,
analysis_type = "lm",
comparisons = list(c(1, 2))
)The results of the modelling will be stored in the
postNetData object, and a series of PDFs will be produced.
The PDF with the suffix “FinalModel” will contain the summary tables
shown above in Figures 13-16. It is possible to retrieve the full
results from each phase of modelling using the ptn_model()
function, as well as the set of features that were selected in the
omnibus model that significantly explain changes in the regulatory
effect with ptn_selectedFeatures(). Furthermore, if
modelling using multiple pairwise comparisons was performed, you can
check which modelling results are available using the
ptn_check_models() function. See the example code
below:
# Check which models are available in the postNetData object
ptn_check_models(ptn = ptn, analysis_type = "lm")
# Extract the set of significant features selected in the omnibus model
selectedFeaturesLM <- ptn_selectedFeatures(
ptn = ptn,
analysis_type = "lm",
comparison = 1
)
# Retrieve the stepwise regression results (other options include "stepwiseModel" and "finalModel"):
univariateModel <- ptn_model(ptn, analysis_type = "lm", model = "univariateModel", comparison = 1)The other PDF files produced will contain the network visualization of the features included in the modelling, as well is plots of individual features, described below.
The results of feature integration using forward stepwise regression
modelling can be visualized using network analyses. Each feature
selected in the omnibus model that can significantly explain a
proportion of variance in the regulatory effect appear in the networks
as nodes. The size of the nodes corresponds to the proportion of
variance explained by the given feature, and can represent either the
values from the omnibus model, or the adjusted model (selected using the
NetModelSel parameter). If NetModelSel is
omnibus, the size of each node representing features will
be proportional to the variance explained, where covariance is assigned
to the most influential feature. If NetModelSel is
adjusted, the size of the nodes will reflect the adjusted
total variance explained, and therefore the independent contribution of
each feature to the overall observed regulation.
The edges connecting nodes can either represent the Pearson
correlation coefficients between features if the parameter
useCorel = TRUE. However, if FALSE, the edges
will instead represent the magnitude of the change in F-value for a
given feature between sequential steps of stepwise regression,
indicative of the strength of the associations between features. Note
that while changes in F-value are indicative of the strength of
relationships between features, this metric is not always easy to
interpret and may not readily reveal whether features are positively or
negatively correlated. Therefore, by default, edges will represent
correlation coefficients. As described above, covarFilt
parameter can be used to control the threshold for displaying edges
between feature nodes. This threshold is based on the minimum change in
F-value between sequential steps of stepwise regression, and when larger
values are provided, only the strongest correlations will be displayed
as edges in the network plot.
Finally, some features may not be significant in the omnibus model but have substantial relationships with one or more features that are. These features will not be assigned a node, but will appear in pink with edges connecting to the feature nodes they are substantially correlated with. For example, in the network displayed below, 5’UTR folding energy did not independently explain changes in the regulatory effect, but as expected, was substantially negatively correlated with 5’UTR length and GC content. This example was included to highlight this functionality. However, it is likely not usually informative to include folding energy in modelling alongside length and GC content of 5’UTRs, as it is known to be highly dependent on these two features. Furthermore, the signature of genes translationally suppressed upon mTOR activation was not independently significant, but was substantially negatively correlated with the 5’UTR GC content and the GC3 codon index.
The network plot below displays the results of the example in the
section above, where by default, NetModelSel = "omnibus",
useCorel = TRUE, and covarFilt = 20.
Network visualization of the results of forward stepwise regression modelling. Nodes represent features that were selected in the omnibus model, with the size proportional to the total variance in the regulatory effect explained by that feature. Edges represent the Pearson correlation coefficient between features. Wider lines indicate stronger positive or negative correlations. Features without a node indicated in pink represent those that were not selected in the omnibus model, but correlate substantially with one or more features that were significant in the model.
When many features are selected in the stepwise regression modelling
or when there are many substantial correlations between features, it may
be desirable to organize the network plots to be more easily
interpreted. The lmfeatGroup and
lmfeatGroupColour parameters can be used to group features
in space, and assign coloured circles to the groupings. For example, it
is possible to organize the layout of features in the network plot
according to regions of the mRNA sequence they are found in. However,
grouping is highly customizable and can be tailored to the dataset and
biological context.
In the example below, the same feature integration modelling as above will be performed adding grouping variables and colours to the network plot. Note that the total proportion of variance explained by the group of features is also displayed on the network plot for each category.
# Examine the input features
names(myFeatureList)
# Make a vector to assign groups to each feature according to categories:
group <- c(
"UTR5", "UTR3", "UTR5", "CDS", "UTR5", "UTR5", "UTR5", "CDS", "CDS",
rep("Pathway", 2), "UTR5", rep("Pathway", 2)
)
# Assign colours to each grouping category:
groupColour <- c("#834b62", "#6699cc", "#e9724c", "#fff299")
names(groupColour) <- c("UTR5", "CDS", "UTR3", "Pathway")
# Run feature integration modelling using forward stepwise regression
ptn <- featureIntegration(
ptn = ptn,
features = myFeatureList,
pdfName = "Example_grouped",
regOnly = TRUE,
allFeat = FALSE,
analysis_type = "lm",
comparisons = list(c(1, 2)),
lmfeatGroup = group,
lmfeatGroupColour = groupColour
)Network visualization of the results of forward stepwise regression modelling. Selected features are grouped according to which region of the mRNA they correspond to, as well as upstream regulatory pathways known to alter translation efficiency.
Finally, in addition to producing a PDF file, the results of the
network analysis are stored in the postNetData object as an
igraph object
(Csárdi and Nepusz 2006; Antonov et al. 2023;
Csárdi et al. 2025), and can be retrieved using the
ptn_networkGraph function. See the example code below:
In addition to forward stepwise regression, the
featureIntegration() function can also apply Breiman’s
Random Forest algorithm (as implemented in the randomForest
package (Liaw and Wiener 2002)) to
classify genes according to their regulation, and identify features
associated with regulatory classes. Unlike with forward stepwise
regression modelling, the relationships between features will not
necessarily be revealed using feature selection and Random Forest
classification, and network analysis cannot be implemented. However,
Random Forest classifiers can be constructed and trained in one dataset,
and then applied to predict post-transcriptional regulation in new
contexts using the selected features. Random Forest classification with
featureIntegration() is implemented in two steps described
below.
In the first step, genes are divided into training (70%) and validation (30%) sets. A pre-modelling step is carried out where the Breiman’s Random Forest algorithm (Liaw and Wiener 2002; Breiman 2019) is applied to perform supervised classification of genes into regulatory classes. Next, feature selection is performed using Boruta (Miron B. Kursa 2010). Boruta uses feature importance measurements from Random Forest classification to iteratively select features based on the comparison of importance attributes to randomly shuffled “shadow” attributes. This feature selection process finds all relevant features associated with the prediction, meaning features may be redundant with others already selected (see the Boruta Vignette for details). Although not directly comparable, this differs somewhat in principle from forward stepwise regression modelling, where features will only be selected if they independently explain a proportion of variance in the regulatory effect beyond what is explained by other features in the model. In this example, in agreement with forward stepwise regression modelling, 5’UTR GC content is found to be the most important feature predicting post-transcriptional regulation in this context.
Boxplots of Z-Scores of Importance from feature selection using Boruta. Features coloured in green indicate retained features. Shadow feature metrics (randomly shuffled attributes) are indicated in blue.
After performing feature selection, the Random Forest classification of genes into different regulatory classes is repeated using the refined set of retained features. The performance of the model is assessed using Receiver Operating Characteristic (ROC) curves (implemented with ROCR) (Sing et al. 2005). The ROC curve plots the rate of false positives against the rate of true positives and provides several metrics to evaluate the predictive power of the model. The area under the curve (AUC) describes the probability of ranking a randomly selected positive higher than a randomly selected negative. Also provided are the sensitivity (true positive rate), and the specificity (true negative rate). The feature importance from the final model is also reported as Mean Decreased Accuracy (see randomForest for details), where higher values indicate more important features.
Final modelling results using Random Forest Classification. Left panel: Feature importance (mean decrease accuracy) for selected features included in the final model. Right panel: Receiver Operating Characteristic (ROC) curve for Random Forest classification of translationally activated vs. suppressed genes.
The featureIntegration() function can also be used to
implement Random Forest classification. To do so, the function is run as
described above changing the input for the analysis_type
parameter to rf. In the example code below, a Random Forest
classification model will predict membership of regulated genes to
translationally activated or suppressed categories using the list of
input features defined above.
# Run feature integration modelling using Random Forest classification:
ptn <- featureIntegration(
ptn = ptn,
features = myFeatureList,
pdfName = "Example",
regOnly = TRUE,
allFeat = FALSE,
analysis_type = "rf",
comparisons = list(c(1, 2))
)The results of the modelling will be stored in the
postNetData object, and a series of PDFs will be produced.
The PDF with the suffix “featureImportance.pdf” will contain a summary
plot of feature importance from the feature selection step (Figure 19).
The PDF with the suffix “FinalModel” will contain a summary of the
feature importance for the selected features retained in the final
model, as well as the ROC curve evaluating the performance of the model
predictions (Figure 20). As for the forward stepwise regression
implementation, it is possible to retrieve the full results from each
step of modelling using the ptn_model() function, as well
as the set of features that were retained during feature selection using
ptn_selectedFeatures(). See the example code below:
# Check which models are available in the postNetData object:
ptn_check_models(ptn = ptn, analysis_type = "rf")
# Extract the set of selected features and their importance:
selectedFeaturesRF <- ptn_selectedFeatures(
ptn = ptn,
analysis_type = "rf",
comparison = 1
)
# Retrieve the final model results (other options include "preModel" or "borutaModel"):
finalModel <- ptn_model(ptn, analysis_type = "rf", model = "finalModel", comparison = 1)Both implementations of featureIntegration() output PDF
files with scatterplots for each individual feature identified in either
the final stepwise regression or Random Forest models. The Pearson’s
correlation coefficient between the feature and regulatory effect
(two-sided test) is displayed, along with the linear trend line and
cubic smoothing spline (allowing assessment of linearity). For binary
features, these metrics are not displayed.
Individual feature visualizations. Scatterplots of 5’UTR GC content (left), 5’UTR SCSCGS motif content (middle), and signature of genes translationally activated by mTOR (right) versus translation efficiency (effM). Pearson’s correlation coefficients, linear trend lines, and cubic smoothing splines indicated.
Feature-based models trained using featureIntegration()
can be used not only to interpret factors contributing to
post-transcriptional regulation within a single dataset, but also test
whether the same features generalize across biological contexts. When
featureIntegration() is implemented using Random Forest
classification (when analysis_type = "rf"), the final
trained model can be exported and applied to new gene sets using the
rfPred() function. This enables evaluation of whether
features that explain post-transcriptional regulation in one system can
accurately predict regulation in an independent dataset, condition, or
experimental model.
Running rfPred() requires a postNetData
object that already contains a trained Random Forest model. By providing
a single integer for the comparison argument, the user can
select between models trained for different comparisons
during featureIntegration() modelling. The trained model
will then be applied to a new list of genes provided with the
predGeneList parameter. The list must have the same
regulatory classes as the original dataset used to train the model
(e.g., translationUp, translationDown). For
the genes included in predGeneList, features must be
enumerated and compiled into a list, as described above for
featureIntegration(). This list is provided with the
predFeatures parameter, and must include exactly the same
features as those in the final Random Forest model. These can be easily
retrieved using the ptn_selectedFeatures() function as
described above. The output of the rfPred() function is a
PDF file containing the ROC curve evaluating how well the original
features can explain the regulatory effect in the new dataset.
In the example below, the Random Forest model trained in the example above will be applied to a new simulated dataset created by randomly sampling the original data. Note that in this example, due to the randomization, the model is expected to have poor predictive power.
# Simulate a new gene list by randomly sampling genes from the existing background:
newGenes <- sample(postNetVignette$background, size = 100)
newGeneList <- list(translationUp = newGenes[1:50], translationDown = newGenes[51:100])
# Select the features that were used to train the final Random Forest model in the original dataset:
predFeatureNames <- names(ptn_selectedFeatures(ptn,
analysis_type = "rf",
comparison = 1
))
# Prepare the predictive features. In this case the same list of input features will be used
# to predict as the new gene lists are taken from the same dataset the model was trained on.
# However, usually the input for the rfPred() function would be taken from a postNetData
# object from a separate analysis on a distinct dataset.
newFeatures <- myFeatureList[predFeatureNames]
# Evaluate the model in the new simulated dataset:
ptn <- rfPred(
ptn = ptn,
comparison = 1,
predGeneList = newGeneList,
predFeatures = newFeatures,
pdfName = "Example"
)Receiver Operating Characteristic (ROC) curve for the Random Forest classification model trained in the example above, applied to a new simulated dataset of randomized genes to predict translationally activated vs. suppressed genes.
After identifying putative regulatory features using
featureIntegration(), it is often helpful to explore how
these features co-occur across genes and how they relate to regulatory
effects. The plotFeaturesMap() function uses umap to generate
Uniform Manifold Approximation and Projection (UMAP) visualizations
based on user-selected features, allowing you to overlay regulatory
effects and visually inspect how individual features vary across the
same embedding. This provides an intuitive tool to identify transcripts
with shared or differing regulatory behaviour and feature
compositions.
When plotting UMAPs, each point will represent a gene. By default,
only regulated genes will be included (those defined by the
geneList or regulation parameters of the
postNetStart() function). However, similarly to
featureIntegration(), all genes can be included by setting
the regOnly parameter to FALSE. The UMAP
embedding is constructed from a user-defined set of features provided
using the featSel parameter, often starting with those that
were identified as informative during featureIntegration()
using either stepwise regression or Random Forest modelling. However,
additional features of interest can also be included. The regulatory
effect and feature values can then be overlaid on the embedding. Which
features are coloured can be controlled using the featCol
parameter. One PDF file with two panels will be generated for each
feature provided with featCol. The left panel will be
overlaid with the regulatory effect from the selected
comparisons, and the right panel will be overlaid with the
given feature.
UMAP is sensitive to input parameters and data pre-processing (Huang Haiyang 2022). Therefore, several options
are provided that can be applied to feature values prior to
dimensionality reduction. The remBinary parameter will
remove binary features (0 or 1) which will often have a strong influence
on gene clustering (default is TRUE). There are also
options to scale and/or center the feature data using the
scaled and centered parameters. Finally, the
remExtreme parameter can be used to filter outlier values
prior to generating the colour scales for features and effect
measurements overlaid on UMAP embeddings. The default settings are a
suggested starting point but may not be appropriate for all inputs, so
it is recommended to test the impact of different input parameters.
Users should exercise caution when interpreting spatial clustering of
genes using this approach. These visualizations are intended to be used
as a tool for data exploration and understanding relationships between
different features and regulatory effects, particularly in cases where
multiple features are co-occurring in the same mRNA.
In the example below, the set of features identified using forward
stepwise regression modelling will be used to generate a UMAP embedding.
PDF files will be generated with the regulatory effect and the values
for 5’UTR GC content overlaid. The resulting UMAP coordinates are stored
in the featuresMap slot of the postNetData
object.
# Plot UMAP visualizations using the features identified in stepwise regression modelling
ptn <- plotFeaturesMap(
ptn = ptn,
regOnly = TRUE,
comparisons = list(c(1, 2)),
featSel = c(names(ptn_selectedFeatures(ptn,
analysis_type = "lm",
comparison = 1
))),
featCol = c("UTR5_GC"),
remBinary = TRUE,
scaled = FALSE,
centered = TRUE,
remExtreme = 0.1,
pdfName = "Example"
)UMAP visualizing changes in translation efficiency (Effect) generated based on features identified in the omnibus model using forward stepwise regression (left), with the 5’UTR GC content of mRNAs coloured (right).
In addition to tools for identifying and enumerating sequence features of mRNA and modelling networks of post-transcriptional regulation, postNet can also be used to perform enrichment analyses including implementations of GSEA, GAGE, and GO term analysis to provide functional insights into gene sets of interest. It is also possible to examine the regulation of gene signatures in your dataset in a threshold-independent manner, providing insights into possible mechanisms explaining observed regulation and allowing comparisons between datasets.
When performing GSEA, GAGE, or GO term analysis using the output of
an anota2seq analysis, it is often necessary to filter the
input genes and log2 fold changes for the “translation” and “buffering”
regulatory modes prior to performing enrichment analyses. This is
because the slopes fitted by the anota2seq APV models can
sometimes have unrealistic values, or suggest unlikely translational
regulation, impacting the analysis of changes in translation or
translational buffering (or offsetting). Filtering out genes with these
unrealistic slopes is especially important for GSEA and GAGE analyses,
which rely on rankings. For analyses relying on hypergeometric tests,
such as GO term enrichment, the impact of filtering on the analysis is
likely to be more negligible. However, slope filtering is still
recommended. The slopeFilt function identifies these genes
with unrealistic slopes, allowing them to be removed from downstream
analyses. Note that in high-quality datasets few genes will require
slope filtering, but filtering thresholds can be adjusted with the
minSlope and maxSlope parameters.
# Get the genes to be filtered out of downstream enrichment analyses
# using the buffering regulatory mode:
filtOutGenes <- slopeFilt(ads,
regulationGen = "buffering",
contrastSel = 1
)The output of the slopeFilt() function can be directly
supplied to downstream enrichment analysis functions using the
genesSlopeFiltOut parameter.
PostNet performs Gene Set Enrichment Analysis (GSEA) (Mootha et al. 2003; Subramanian et al. 2005),
using the gseaAnalysis() function and the regulatory effect
measurement contained in a postNetData object. The analysis
implements the fgsea package (Korotkevich et al. 2016), and can be applied
using gene sets from The Molecular Signatures Database MSigDB. Gene
expression signatures to be considered in the analysis can be selected
from MSigDB using the collection and
subcollection parameters, and specific gene sets can be
specified using subsetNames. Gene sets can be further
refined using the maxSize and minSize
parameters to set upper and lower thresholds for the number of genes
included in each gene set.
The code below provides an example of how to run GSEA analysis on
genes ranked according to log2 fold changes in translation efficiency,
using specific gene set terms from the MSigDB. If you are using the
results of an anota2seq analysis, it would also be necessary to
perform slope filtering, and supply the
output using the genesSlopeFiltOut parameter in both the
gseaAnalysis() and gseaPlot() functions.
# If using a GSEA collection from MSigDB, check the available versions:
version <- msigdb::getMsigdbVersions()
# Retrieve the MSigDB data for "human":
msigdbOut <- msigdb::getMsigdb(
org = "hs",
id = "SYM",
version = version[1]
)
# Check the available collections or subcollections:
msigdb::listCollections(msigdbOut)
msigdb::listSubCollections(msigdbOut)
# Run GSEA on the C5 collection with GO:BP and specific terms:
ptn <- gseaAnalysis(
ptn = ptn,
collection = "c5",
subcollection = "GO:BP",
subsetNames = c(
"GOBP_CELL-CELL_SIGNALING_BY_WNT",
"GOBP_ENDOCYTOSIS"
),
name = "c5_Example"
)In addition to the collections in MSigDB, GSEA can also be run using
custom gene sets using the geneSet parameter.
# Create example custom gene sets for GSEA:
set1 <- sample(myGenes[[1]], 100)
set2 <- sample(myGenes[[2]], 100)
inSet <- list(Set1 = set1, Set2 = set2)
# Run GSEA on custom gene sets:
ptn <- gseaAnalysis(
ptn = ptn,
geneSet = inSet,
name = "Example"
)GSEA results can be extracted from the postNetData
object and plotted using the ptn_GSEA() and
gseaPlot functions.
# Extract the significant enrichment results from the postNetData object:
gseaOut <- ptn_GSEA(ptn,
threshold = 0.05
)
# Plot GSEA results:
gseaPlot(
ptn = ptn,
termNames = gseaOut$Term[1]
)A plot visualizing GSEA results for the enrichment of the gene Set1 in the example dataset.
Generally Applicable Gene-set Enrichment (GAGE) analysis can also be
performed using the regulatory effect measurement contained in a
postNetData object. The gageAnalysis()
function implements the gage package (Luo et al. 2009), and can be used with the
Biological Process “BP”, Molecular Function “MF”, or Cellular Component
“CC” Gene Ontology categories, as well as with ‘KEGG’ pathways. These
are specified using the category parameter. Similarly to
GSEA, terms/pathways can be filtered using the maxSize and
minSize parameters to set upper and lower thresholds for
the number of genes included in each term or pathway.
The results of the analysis can be retrieved using the
ptn_GAGE function. As GAGE uses a two-directional test, the
directionality of the results (“greater”, “less”) can be specified with
the direction parameter.
The example code below performs GAGE analysis using the “MF” GO term
category, and extracts terms that are significantly associated with
increased log2 fold changes in translation efficiency. Note that if you
are using the results of an anota2seq analysis, it would also
be necessary to perform slope filtering,
and supply the output using the genesSlopeFiltOut parameter
in the gageAnalysis() function.
Within postNet, GAGE can also be applied to identify enrichments in
microRNA (miRNA) targets in regulated gene sets of interest. This is
done using the miRNAanalysis() function which implements
the gage package (Luo et al. 2009) and uses miRNA target
information from the TargetScan
database. Note that this approach does not identify or predict miRNA
binding sites in mRNA sequences, but rather assesses whether gene sets
of interest may be regulated by particular miRNAs.
To perform the analysis, it is necessary to download a miRNA target
prediction file from the TargetScan
database. This file must contain the headings ‘Cumulative weighted
context++ score’, ‘Aggregate PCT’, ‘Gene Symbol’, and ‘Representative
miRNA’. Importantly, predictions for multiple species are contained
within the file, so it is essential to subset to retain information for
only the desired species prior to running the analysis. Once downloaded
and subset, this file is supplied using the
miRNATargetScanFile parameter.
Prior to running GAGE analysis, miRNA-mRNA targeting predictions
should be filtered to retain high-confidence predictions, or adjusted
depending on the biological question. This filtering is performed by
specifying thresholds in the cumulative weighted context++ score (CWCS)
using the contextScore parameter, and/or the Aggregate PCT
using the Pct parameter. The CWCS provides a measure of the
efficacy of predicted miRNA targeting, where larger negative values
indicate more repression in response to a miRNA (Agarwal et al. 2015; Grimson et al. 2007). This
metric is useful as it captures all types of interactions and can be
used in cases where miRNAs are not highly conserved across species. The
Aggregate PCT is a measurement of how well conserved the predicted
miRNA-mRNA targeting is across species [Agarwal et al., 2015, Friedman
et al., 2009]. Values closer to 1 indicate a greater probability of
evolutionary conservation due to maintenance of miRNA targeting (i.e.,
biological function). Depending on the biological question, filtering
can be performed on one or both of these metrics (see the TargetScan FAQ for further details
regarding target prediction).
After the appropriate filtering has been applied, GAGE is implemented
to identify enrichments in miRNAs predicted to target genes in gene sets
of interest based on rankings (genes are ranked by the regulatory effect
measurement). Similarly to GSEA and GAGE analyses, sets of miRNA target
predictions can be filtered using the maxSize and
minSize parameters to set upper and lower thresholds for
the number of genes included.
The example code below performs GAGE miRNA target enrichment analysis
using human mRNA-miRNA target predictions. Target predictions are
filtered according to both the CWCS and the Aggregate PCT in this
example, but these may vary depending on what the user is interested in.
Note that if you are using the results of an anota2seq
analysis, it would also be necessary to perform slope filtering, and supply the output using
the genesSlopeFiltOut parameter in the
miRNAanalysis() function.
# An example of the required format for the input for miRNATargetScanFile
# parameter ("miRNA_predictions_hsa.txt"):
# Transcript.ID Gene.Symbol miRNA.family Species.ID Total.num.conserved.sites
# Number.of.conserved.8mer.sites Number.of.conserved.7mer.m8.sites
# Number.of.conserved.7mer.1a.sites Total.num.nonconserved.sites
# Number.of.nonconserved.8mer.sites Number.of.nonconserved.7mer.m8.sites
# Number.of.nonconserved.7mer.1a.sites #Number.of.6mer.sites Representative.miRNA
# Total.context...score Cumulative.weighted.context...score # Aggregate.PCT
# Predicted.occupancy...low.miRNA Predicted.occupancy...high.miRNA
# Predicted.occupancy...transfected.miRNA
# ENST00000055682.6 KIAA2022 UGGCACU 9606 3 0 0 3 0 0 0 0 0
# hsa-miR-183-5p.2 -0.604 -0.604 1 0.0807 0.5089 1.8669
# ENST00000207157.3 TBX15 UGGCACU 9606 1 1 0 0 0 0 0 0 1
# hsa-miR-183-5p.2 -0.589 -0.552 1 0.1287 0.5218 0.8900
# ENST00000215832.6 MAPK1 ACAGUAC 9606 2 0 1 1 3 0 2 1 3
# hsa-miR-101-3p.1 -0.509 -0.279 1 0.0244 0.1671 0.8723
# Note that this target prediction file has been filtered to include only human miRNA ('hsa')
# Run GAGE miRNA target enrichment analysis:
ptn <- miRNAanalysis(ptn,
genesSlopeFiltOut = filtOutGenes,
miRNATargetScanFile = "miRNA_predictions_hsa.txt",
contextScore = -0.2,
Pct = 0.7
)After running the analysis, the results can be retrieved using either
the ptn_miRNA_analysis, or ptn_miRNA_to_gene
functions.
The ptn_miRNA_analysis function will return the results
of the GAGE analysis. When interpreting the results, note that
enrichments in miRNA predicted to target genes that are upregulated
(e.g., if the regulatory effect measurement is log2 fold change) that
appear in the results table labelled “greater” can be interpreted as
those miRNA that may be downregulated or otherwise not active in the
experimental condition. Likewise, enrichments in miRNA predicted to
target genes that are downregulated that appear in the results table
labelled “less” can be interpreted as those miRNA that may be
upregulated or active in the experimental condition.
The ptn_miRNA_to_gene function will return the lists of
genes that are predicted to be targeted by the miRNA passing the
filtering threshold set by contextScore and
Pct parameters. These may be desirable if you plan to
include specific miRNA in modelling analyses using the
featureIntegration() function. These gene lists can be
converted to signatures using the signCalc() function and
be included as features in the models.
# Extract the significant miRNA target enrichment results from the postNetData object:
miRNAout <- ptn_miRNA_analysis(
ptn = ptn,
direction = "less",
threshold = 0.05
)
# Extract the lists of genes for miRNA that were found to have targets significantly
# enriched in the downregulated gene set of interest:
miRNAtargets <- ptn_miRNA_to_gene(
ptn = ptn,
miRNAs = c("hsa-miR-138-5p", "hsa-miR-182-5p")
)Gene Ontology (GO) term enrichment analysis can also be performed
using the gene lists of interest in a postNetData object.
The goAnalysis() function implements clusterProfiler
(Xu et al. 2024), and similarly to GAGE
analysis can be used with the Biological Process “BP”, Molecular
Function “MF”, or Cellular Component “CC” Gene Ontology categories, and
with ‘KEGG’ pathways specified using the category
parameter, with optional filtering using the maxSize and
minSize parameters to set upper and lower thresholds for
the number of genes included. In addition, enriched terms can be
filtered according to FDR threshold, and for the number of genes in the
gene sets of interest that are included in the term using the
FDR and counts parameters, respectively.
The example code below performs GO term enrichment analysis using the
“BP” GO term category, and “KEGG” pathways. Note that if you are using
the results of an anota2seq analysis, it would also be
necessary to perform slope filtering, and
supply the output using the genesSlopeFiltOut parameter in
the goAnalysis() function.
# Run GO term analysis using a postNetData object:
ptn <- goAnalysis(
ptn = ptn,
category = c("BP"), # Note that multiple terms can be run simultaneously
name = "Example"
)The results of the analysis can be retrieved using the
ptn_GO function and visualized using the
goDotplot() function. The pool parameter of
the goDotplot function controls if separate plots will be
generated for each gene list of interest, or whether results for all
gene lists will be pooled into a single plot. It is also possible to
specify which terms will be plotted using the termSel
parameter. Finally, the metric used to determine the size of the dot for
each enriched term can be controlled using the size
parameter to reflect either the number of genes that were included in
the enriched term, or the ratio of genes included to total number of
genes in the term.
# Extract the significant enrichment results for Biological Process:
goOut <- ptn_GO(ptn,
category = "BP",
geneList = "translationUp",
threshold = 0.05
)
# Plot GO term analysis results:
goDotplot(
ptn = ptn,
category = "BP",
pool = TRUE,
size = "geneRatio",
pdfName = "Example"
)A dot plot visualizing GO term enrichment analysis results. Dot size corresponds to the ratio of genes included, to the total number of genes in the term.
In addition to the various forms of gene set enrichment analysis described above, postNet also provides functionality for assessing the regulation of gene signatures using empirical cumulative distribution functions (eCDFs). This approach allows visualizations of changes in the regulatory effect measurement (often log2 fold changes). ECDFs for genes belonging to the gene signatures of interest are compared to those for all other genes (background). Differences in the regulatory effect measurement distributions are calculated at the quantiles, and significant directional shifts in the distributions for gene signature versus background are identified using a Wilcoxon rank-sum test.
Gene signature analysis in postNet can be implemented in two ways depending on whether you are using custom gene lists and regulatory effect measurements, or if your analysis is based on the output of anota2seq. These are described below.
Using both approaches, gene signatures are supplied as a list using
the signatureList parameter. It is possible to assess
multiple signatures at the same time, and it is important to note that
the background gene set is automatically determined based on the gene
signatures provided. When there is no overlap between genes in the
signatures, the background set is determined to be all genes not
included in any signature (i.e., mutually exclusive signatures will be
compared to the same background gene set). However, when there is
overlap between the genes in the signatures provided, the background set
is determined separately for each signature (i.e., overlapping
signatures will each be compared to separate backgrounds). For example,
you may be interested in examining how the sets of upregulated and
downregulated genes taken from a particular study or condition behave in
your dataset. In this case, the gene signatures will be mutually
exclusive as a gene cannot be both up and downregulated, so the
background is created such that the comparison will be between
“regulated” vs. “unregulated” genes. In another example, you may wish to
compare two signatures of mTOR-activated genes taken from different
studies. As the signatures come from the same condition they will likely
overlap considerably, and therefore a separate set of background genes
will be created for each signature. Consideration should be given to
what is the appropriate background gene set when deciding whether gene
signatures should be compared together, or individually.
Several gene signatures relevant to post-transcriptional regulation are included with the package for both human and mouse.
When starting from custom gene lists and regulatory effect
measurement the regulation of gene signatures can be assessed using the
plotSignatures function. All signatures supplied to the
signatureList parameter will be included on the same plot,
where the background gene set is indicated in grey.
The example code below will load the set of signatures included with the package, and assess the regulation of the signatures for genes whose translation is either activated or suppressed by mTOR signalling.
# load human signature data:
data("humanSignatures")
# Examine the regulation of mTOR-sensitive transcripts
plotSignatures(
ptn = ptn,
signatureList = humanSignatures[c(
"Gandin_etal_2016_mTOR_transUp",
"Gandin_etal_2016_mTOR_transDown"
)],
signature_colours = c("red", "blue"),
dataName = "Example",
generalName = "mTOR_sensitive_translation",
xlim = c(-3, 3),
tableCex = 0.7,
pdfName = "mTORsignatures"
)eCDF of log2 fold changes in translation efficiency for signatures of mTOR-sensitive translation. Grey curve indicates the background gene set. Wilcoxon p-values and differences between the distributions at each quantile are indicated. A shift to the right indicates an increase compared to background, while shift to the left indicates a decrease.
When starting from the results of an anota2seq analysis,
gene signatures can be assessed using the
plotSignatures_ads function. The key difference from the
custom gene list workflow is that instead of providing the
postNetData object, the anota2seqDataSet
object (the ads parameter) must be provided instead. Four
eCDFs will be generated examining the distribution of log2 fold changes
in total mRNA, polysome-associated mRNA (or ribosome-protected
fragments; RPFs), translation, and buffering (also known as offsetting)
for each gene signature provided. In addition to eCDFs, a scatter plot
of total mRNA versus polysome-associated mRNA will also be generated,
with the genes corresponding to the signatures provided coloured.
The example code below will load the set of signatures included with the package, and assess the regulation of the signatures for genes whose translation is either activated or suppressed by activation of the integrated stress response with thapsigargin. Note that a very minimal example dataset was used to illustrate the anota2seq workflow, so very few genes are present in the outputs.
# Examine the regulation of ISR-sensitive transcripts in the results of an anota2seq analysis:
plotSignatures_ads(
ads = ads,
contrast = 1,
dataName = "Osmosis_1h",
signatureList = humanSignatures[c(
"Guan_etal_2017_Tg1_transUp",
"Guan_etal_2017_Tg1_transDown"
)],
generalName = c("ISR_activated", "ISR_suppressed"),
signature_colours = c("red", "blue"),
xlim = c(-3, 3),
scatterXY = 4,
tableCex = 1,
pdfName = "Example_ISRsignatures"
)Gene signature analysis starting from an anota2seqDataSet. Scatterplot of total vs. translated mRNA from an example anota2seq analysis with the location of genes belonging to signatures of interest coloured (left panel). Subsequent panels show the eCDFs of log2 fold changes in translation and buffering regulatory modes, as well as for translated and total mRNA for the signatures of interest. Grey curves indicate the background gene set. Wilcoxon p-values and differences between the distributions at each quantile are indicated. A shift to the right indicates an increase compared to background, while shift to the left indicates a decrease.
Finally, it is also possible to generate a heatmap comparing the
regulation of multiple gene signatures in your dataset of interest using
the signaturesHeatmap function. The values displayed in the
heatmap are specified using the unit parameter, and can
either be based on significance or the magnitude of the regulation for
each given gene signature. For significance, the values displayed in the
heatmap are obtained from a two-sided Wilcoxon Rank Sum test comparing
the regulatory effect measurement values for the gene signature of
interest against the background. P-values are corrected for multiple
testing using the Benjamini & Hochberg method, and the -log10(FDR)
value is multiplied by either 1 or -1 corresponding to up- or
down-regulation of the gene signature compared to background,
respectively. For magnitude, the values displayed in the heatmap are
obtained by taking the difference between the eCDFs of the regulatory
effect measurements for the gene signature and the background, at the
percentile specified.
The example code below will examine the regulation of all gene signatures included with the package, based on the significance and directionality of the regulation.
# Examine the regulation of all gene signatures:
signaturesHeatmap(ptn,
signatureList = humanSignatures,
unit = "FDR",
pdfName = "ExampleSignatures"
)A heatmap of -log10(FDRs) for regulation of various gene signatures in the example dataset. Colours indicate the directionality of regulation.
If you use postNet in your work, please cite:
Watt, K., Dauber, B., Szkop, K.J. et al. Epigenetic alterations facilitate transcriptional and translational programs in hypoxia. Nat Cell Biol 27, 1965–1981 (2025). DOI: 10.1038/s41556-025-01786-8.
Additional citations:
If you use the motifAnalysis() function in your work,
please also cite:
Bailey, T. L., Johnson, J., Grant, C. E., & Noble, W. S. (2015). The MEME Suite. Nucleic acids research, 43(W1), W39–W49. DOI: 10.1093/nar/gkv416.
Bailey T. L. (2021). STREME: accurate and versatile sequence motif discovery. Bioinformatics (Oxford, England), 37(18), 2834–2840. DOI: 10.1093/bioinformatics/btab203.
Nystrom SL, McKay DJ (2021) Memes: A motif analysis environment in R using tools from the MEME Suite. PLoS Comput Biol 17(9): e1008991. DOI: 10.1371/journal.pcbi.1008991.
If you use the G4 option with the
contentMotifs() function in your work, please also
cite:
Hon, J., Martínek, T., Zendulka, J., & Lexa, M. (2017). pqsfinder: an exhaustive and imperfection-tolerant search tool for potential quadruplex-forming sequences in R. Bioinformatics (Oxford, England), 33(21), 3373–3379. DOI: 10.1093/bioinformatics/btx413.
If you use the Random Forest implementation of the
featureIntegration() function in your work, please
cite:
Kursa, Miron B., and Witold R. Rudnicki. 2010. “Feature Selection with the Boruta Package.” Journal of Statistical Software 36 (11): 1–13. https://doi.org/10.18637/jss.v036.i11.
Sing, T., Sander, O., Beerenwinkel, N., & Lengauer, T. (2005). ROCR: visualizing classifier performance in R. Bioinformatics (Oxford, England), 21(20), 3940–3941. https://doi.org/10.1093/bioinformatics/bti623.
If you use the gseaAnalysis() function in your work,
please also cite:
Korotkevich G, Sukhov V, Sergushichev A (2019). “Fast gene set enrichment analysis.” bioRxiv. doi:10.1101/060012, http://biorxiv.org/content/early/2016/06/20/060012.
If you use the gageAnalysis() function in your work,
please also cite:
Luo, Weijun, Friedman, Michael, Shedden, Kerby, Hankenson, Kurt, Woolf, Peter (2009). “GAGE: generally applicable gene set enrichment for pathway analysis.” BMC Bioinformatics, 10, 161.
If you use the goAnalysis() function in your work,
please also cite:
Xu S, Hu E, Cai Y, Xie Z, Luo X, Zhan L, Tang W, Wang Q, Liu B, Wang R, Xie W, Wu T, Xie L, Yu G (2024). “Using clusterProfiler to characterize multiomics data.” Nature Protocols, 19(11), 3292-3320. doi:10.1038/s41596-024-01020-z. # Session info
## R version 4.5.3 (2026-03-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] postNet_0.99.9 data.table_1.18.2.1 dplyr_1.2.1
## [4] BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-9 DBI_1.3.0
## [3] gridExtra_2.3 httr2_1.2.2
## [5] rlang_1.2.0 magrittr_2.0.5
## [7] ade4_1.7-24 vioplot_0.5.1
## [9] matrixStats_1.5.0 compiler_4.5.3
## [11] RSQLite_2.4.6 png_0.1-9
## [13] vctrs_0.7.3 reshape2_1.4.5
## [15] stringr_1.6.0 pkgconfig_2.0.3
## [17] crayon_1.5.3 fastmap_1.2.0
## [19] dbplyr_2.5.2 XVector_0.51.0
## [21] labeling_0.4.3 caTools_1.18.3
## [23] rmarkdown_2.31 purrr_1.2.2
## [25] bit_4.6.0 xfun_0.57
## [27] cachem_1.1.0 seqinr_4.2-36
## [29] jsonlite_2.0.0 blob_1.3.0
## [31] DelayedArray_0.37.1 BiocParallel_1.45.0
## [33] parallel_4.5.3 R6_2.6.1
## [35] bslib_0.10.0 stringi_1.8.7
## [37] RColorBrewer_1.1-3 limma_3.67.3
## [39] reticulate_1.46.0 GenomicRanges_1.63.2
## [41] jquerylib_0.1.4 Rcpp_1.1.1-1
## [43] Seqinfo_1.1.0 SummarizedExperiment_1.41.1
## [45] knitr_1.51 zoo_1.8-15
## [47] IRanges_2.45.0 Matrix_1.7-5
## [49] splines_4.5.3 igraph_2.3.0
## [51] tidyselect_1.2.1 qvalue_2.43.0
## [53] abind_1.4-8 yaml_2.3.12
## [55] gplots_3.3.0 codetools_0.2-20
## [57] curl_7.1.0 lattice_0.22-9
## [59] tibble_3.3.1 plyr_1.8.9
## [61] Biobase_2.71.0 withr_3.0.2
## [63] S7_0.2.2 askpass_1.2.1
## [65] evaluate_1.0.5 survival_3.8-6
## [67] BiocFileCache_3.1.0 Biostrings_2.79.5
## [69] pillar_1.11.1 BiocManager_1.30.27
## [71] filelock_1.0.3 MatrixGenerics_1.23.0
## [73] KernSmooth_2.23-26 stats4_4.5.3
## [75] sm_2.2-6.0 generics_0.1.4
## [77] S4Vectors_0.49.2 ggplot2_4.0.3
## [79] scales_1.4.0 gtools_3.9.5
## [81] anota2seq_1.33.0 glue_1.8.1
## [83] maketools_1.3.2 tools_4.5.3
## [85] sys_3.4.3 RSpectra_0.16-2
## [87] fgsea_1.37.4 locfit_1.5-9.12
## [89] buildtools_1.0.0 fastmatch_1.1-8
## [91] cowplot_1.2.0 grid_4.5.3
## [93] plotrix_3.8-14 umap_0.2.10.0
## [95] edgeR_4.9.9 cli_3.6.6
## [97] rappdirs_0.3.4 S4Arrays_1.11.1
## [99] gtable_0.3.6 DESeq2_1.51.7
## [101] sass_0.4.10 digest_0.6.39
## [103] BiocGenerics_0.57.1 SparseArray_1.11.13
## [105] farver_2.1.2 memoise_2.0.1
## [107] htmltools_0.5.9 multtest_2.67.0
## [109] lifecycle_1.0.5 statmod_1.5.1
## [111] openssl_2.4.0 bit64_4.8.0
## [113] MASS_7.3-65