dinoR is available on github and can be installed by running:
then we can load dinoR and other necessary packages:
suppressPackageStartupMessages({
library(dinoR)
library(ggplot2)
library(dplyr)
library(SummarizedExperiment)
})
We use biscuit to map 300bp paired-end reads to the genome, UMI-tools to remove duplicated UMIs, and the fetch-NOMe package to get the protection from GCH methylation calls for each read pair (fragment) overlapping a region of interest (ROI). The ROIs provided to fetchNOMe should all be centered around a transcription factor motif, with the strand of the motif indicated. That will ensure that the genomic positions around the motif are sorted according to motif strand, which will allow the user to observe potential asymetries in protection from methylation relative to the TF motif. Note that we use protection from methylation calls (0 = methylated, 1 = not methylated). We then use the R package NOMeConverteR to convert the resulting tibble into a ranged summarized experiment object. This represents an efficient way of sharing NOMe-seq data.
NomeData <- readRDS(system.file("extdata", "NOMeSeqData.rds", package = "dinoR"))
NomeData
#> class: RangedSummarizedExperiment
#> dim: 219 4
#> metadata(0):
#> assays(5): nFragsFetched nFragsNonUnique nFragsBisFailed nFragsAnalyzed
#> reads
#> rownames(219): Adnp_chr8_47978653_47979275
#> Adnp_chr6_119394879_119395501 ... Rest_chr4_140283342_140283964
#> Rest_chr7_64704080_64704702
#> rowData names(1): motif
#> colnames(4): AdnpKO_1 AdnpKO_2 WT_1 WT_2
#> colData names(2): samples group
The reads assay contains GPos objects with the GCH methylation data in two sparse logical matrices, one for protection from methylation , and one for methylation.
assays(NomeData)[["reads"]][1,1]
#> [[1]]
#> UnstitchedGPos object with 623 positions and 2 metadata columns:
#> seqnames pos strand | protection methylation
#> <Rle> <integer> <Rle> | <lgCMatrix> <lgCMatrix>
#> [1] chr8 47978653 + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#> [2] chr8 47978654 + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#> [3] chr8 47978655 + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#> [4] chr8 47978656 + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#> [5] chr8 47978657 + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#> ... ... ... ... . ... ...
#> [619] chr8 47979271 + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#> [620] chr8 47979272 + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#> [621] chr8 47979273 + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#> [622] chr8 47979274 + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#> [623] chr8 47979275 + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#> -------
#> seqinfo: 53 sequences from an unspecified genome; no seqlengths
We generate metaplots, grouping our ROIs into those that have Rest, Ctcf, or Adnp bound to the motifs in their center. We use 2 samples from WT mouse ES cells, and two samples from Adnp KO mouse ES cells. We exclude any ROI - sample combinations which contain less than 10 reads (nr=10).
avePlotData <- metaPlots(NomeData=NomeData,nr=10,ROIgroup = "motif")
#plot average plots
ggplot(avePlotData, aes(x=position,y=protection)) + geom_point(alpha=0.5) +
geom_line(aes(x=position,y=loess),col="darkblue",lwd=2) +
theme_classic() + facet_grid(rows = vars(type),cols= vars(sample), scales = "free") +
ylim(c(0,100)) + geom_hline(yintercept = c(10,20,30,40,50,60,70,80,90),
alpha=0.5,color="grey",linetype="dashed")
We can already see that while the NOMe footprints around Rest and Ctcf bound motifs don’t change, there are clear differences between WT and Adnp KO cells around the Adnp bound motifs.
To quantify the differences visible in above meta plots, we adopted and slightly modified the approch of Sönmezer et al., 2021. We classify each fragment according to five types of footprints: transcription factor bound (TF), open chromatin, and nucleosome (we distinguish also upstream positioned nucleosome (upNuc), downstream positioned nucleosome (downNuc), and all other nucleosome (Nuc) footprints). To do this we use three windows (-50:-25, -8:8, 25:50) around the motif center (which should correspond to the ROI center of the provided ROIs). Then we count the number of fragments in each sample-ROI combination supporting each footprint category.
NOMe patterns
NomeData <- footprintCalc(NomeData)
NomeData <- footprintQuant(NomeData)
NomeData
#> class: RangedSummarizedExperiment
#> dim: 219 4
#> metadata(0):
#> assays(12): nFragsFetched nFragsNonUnique ... downNuc all
#> rownames(219): Adnp_chr8_47978653_47979275
#> Adnp_chr6_119394879_119395501 ... Rest_chr4_140283342_140283964
#> Rest_chr7_64704080_64704702
#> rowData names(1): motif
#> colnames(4): AdnpKO_1 AdnpKO_2 WT_1 WT_2
#> colData names(2): samples group
Note that if a fragment does not have methylation protection data in all three windows needed for classification, the fragment will not be used.
Next we can test for differential abundance of footprints between Adnp KO and WT samples.
We use edgeR to check for differences in abundance between wild type and Adnp KO samples for each footprint type fragment count compared to the total fragment counts. Library sizes for TMM normalization are calculated on the total fragment counts.
res <- diNOMeTest(NomeData,WTsamples = c("WT_1","WT_2"),
KOsamples = c("AdnpKO_1","AdnpKO_2"))
res
#> # A tibble: 1,040 × 10
#> logFC logCPM F PValue FDR contrasts ROI motif logadjPval regulated
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 1.69 10.9 15.8 2.10e-4 0.0436 open_vs_… Adnp… Adnp 1.36 up
#> 2 1.53 10.4 10.2 2.35e-3 0.242 open_vs_… Adnp… Adnp 0.617 no
#> 3 1.22 9.34 9.12 3.75e-3 0.242 open_vs_… Adnp… Adnp 0.617 no
#> 4 1.05 10.6 8.73 4.65e-3 0.242 open_vs_… Adnp… Adnp 0.617 no
#> 5 -1.46 10.5 7.71 7.44e-3 0.310 open_vs_… Ctcf… Adnp 0.509 no
#> 6 1.13 10.2 7.28 9.13e-3 0.317 open_vs_… Adnp… Adnp 0.500 no
#> 7 0.757 10.5 5.01 2.91e-2 0.666 open_vs_… Adnp… Adnp 0.177 no
#> 8 1.00 10.8 4.87 3.16e-2 0.666 open_vs_… Adnp… Adnp 0.177 no
#> 9 -1.00 9.62 4.54 3.76e-2 0.666 open_vs_… Rest… Adnp 0.177 no
#> 10 -1.12 10.9 4.50 3.82e-2 0.666 open_vs_… Ctcf… Adnp 0.177 no
#> # ℹ 1,030 more rows
We can then simply plot the number of regulated ROIs within each ROI type…
res %>% group_by(contrasts,motif,regulated) %>% summarize(n=n()) %>%
ggplot(aes(x=motif,y=n,fill=regulated)) + geom_bar(stat="identity") +
facet_grid(~contrasts) + theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
scale_fill_manual(values=c("orange","grey","blue3"))
#> `summarise()` has grouped output by 'contrasts', 'motif'. You can override
#> using the `.groups` argument.
…or display the results in MA plots.
ggplot(res,aes(y=logFC,x=logCPM,col=regulated)) + geom_point() +
facet_grid(~contrasts) + theme_bw() +
scale_color_manual(values=c("orange","grey","blue3"))
compareFootprints(footprint_percentages,res,WTsamples = c("WT_1","WT_2"),
KOsamples = c("AdnpKO_1","AdnpKO_2"),plotcols = c("#f03b20", "#a8ddb5", "#bdbdbd"))
We can see that in Adnp KO samples, transcription factor footprints significantly increase around Adnp motifs, while nucleosome footprints decrease.
In case we are not interested in the upstream and downstream nucleosome patterns, but would rather keep all nucleosome pattern fragments within the nucleosome group, we can do that using the option combineNucCounts=TRUE.
res <- diNOMeTest(NomeData,WTsamples = c("WT_1","WT_2"),
KOsamples = c("AdnpKO_1","AdnpKO_2"),combineNucCounts = TRUE)
footprint_percentages <- footprintPerc(NomeData,combineNucCounts = TRUE)
#fpPercHeatmap(footprint_percentages,plotcols = c("#236467","#AA9B39","#822B26"))
compareFootprints(footprint_percentages,res,WTsamples = c("WT_1","WT_2"),
KOsamples = c("AdnpKO_1","AdnpKO_2"),plotcols = c("#f03b20", "#a8ddb5", "#bdbdbd"))
sessionInfo()
#> R Under development (unstable) (2024-01-16 r85808)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.33.3 Biobase_2.63.0
#> [3] GenomicRanges_1.55.2 GenomeInfoDb_1.39.5
#> [5] IRanges_2.37.1 S4Vectors_0.41.3
#> [7] BiocGenerics_0.49.1 MatrixGenerics_1.15.0
#> [9] matrixStats_1.2.0 dplyr_1.1.4
#> [11] ggplot2_3.4.4 dinoR_0.99.6
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.0 farver_2.1.1 bitops_1.0-7
#> [4] fastmap_1.1.1 RCurl_1.98-1.14 digest_0.6.34
#> [7] lifecycle_1.0.4 cluster_2.1.6 Cairo_1.6-2
#> [10] statmod_1.5.0 magrittr_2.0.3 compiler_4.4.0
#> [13] rlang_1.1.3 sass_0.4.8 tools_4.4.0
#> [16] utf8_1.2.4 yaml_2.3.8 knitr_1.45
#> [19] S4Arrays_1.3.3 labeling_0.4.3 DelayedArray_0.29.1
#> [22] RColorBrewer_1.1-3 abind_1.4-5 withr_3.0.0
#> [25] purrr_1.0.2 grid_4.4.0 fansi_1.0.6
#> [28] colorspace_2.1-0 edgeR_4.1.15 scales_1.3.0
#> [31] iterators_1.0.14 cli_3.6.2 rmarkdown_2.25
#> [34] crayon_1.5.2 generics_0.1.3 rjson_0.2.21
#> [37] cachem_1.0.8 stringr_1.5.1 splines_4.4.0
#> [40] zlibbioc_1.49.0 parallel_4.4.0 XVector_0.43.1
#> [43] vctrs_0.6.5 Matrix_1.6-5 jsonlite_1.8.8
#> [46] GetoptLong_1.0.5 clue_0.3-65 magick_2.8.2
#> [49] locfit_1.5-9.8 foreach_1.5.2 limma_3.59.1
#> [52] jquerylib_0.1.4 tidyr_1.3.1 glue_1.7.0
#> [55] codetools_0.2-19 cowplot_1.1.3 stringi_1.8.3
#> [58] shape_1.4.6 gtable_0.3.4 ComplexHeatmap_2.19.0
#> [61] munsell_0.5.0 tibble_3.2.1 pillar_1.9.0
#> [64] htmltools_0.5.7 GenomeInfoDbData_1.2.11 circlize_0.4.15
#> [67] R6_2.5.1 doParallel_1.0.17 evaluate_0.23
#> [70] lattice_0.22-5 highr_0.10 png_0.1-8
#> [73] bslib_0.6.1 Rcpp_1.0.12 SparseArray_1.3.3
#> [76] xfun_0.41 pkgconfig_2.0.3 GlobalOptions_0.1.2