library(crisprDesign)
We illustrate the CRISPR base editing (CRISPRbe) functionalities
of crisprDesign by designing and characterizing gRNAs targeting the gene
IQSEC3 using the cytidine base editor BE4max (Koblan et al. 2018).
We first load the BE4max BaseEditor object available in crisprBase:
data(BE4max, package="crisprBase")
BE4max
## Class: BaseEditor
## CRISPR Nuclease name: SpCas9
## Target type: DNA
## Metadata: list of length 2
## PAMs: NGG, NAG, NGA
## Weights: 1, 0.2593, 0.0694
## Spacer length: 20
## PAM side: 3prime
## Distance from PAM: 0
## Prototype protospacers: 5'--SSSSSSSSSSSSSSSSSSSS[NGG]--3', 5'--SSSSSSSSSSSSSSSSSSSS[NAG]--3', 5'--SSSSSSSSSSSSSSSSSSSS[NGA]--3'
## Base editor name: BE4max
## Editing strand: original
## Maximum editing weight: C2T at position -15
The editing probabilities of the base editor BE4max are stored in a matrix
where rows correspond to the different nucleotide substitutions, and columns
correspond to the genomic coordinate relative to the PAM site.
The editingWeights function from crisprBase allows us to retrieve
those probabilities. One can see that C to T editing is optimal
around 15 nucleotides upstream of the PAM site for the BE4max base editor:
crisprBase::editingWeights(BE4max)["C2T",]
## -36 -35 -34 -33 -32 -31 -30 -29 -28 -27 -26 -25 -24
## 0.007 0.007 0.008 0.018 0.010 0.020 0.014 0.012 0.023 0.013 0.024 0.022 0.034
## -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11
## 0.022 0.021 0.035 0.058 0.162 0.318 0.632 0.903 1.000 0.870 0.620 0.314 0.163
## -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
## 0.100 0.056 0.033 0.019 0.018 0.024 0.017 0.005 0.002 0.001
To learn how to build a object for your own base editor enzyme, see the package vignette.
Next, we load the
grListExample object in crisprDesign that contains gene
coordinates in hg38 for exons of all human IQSEC3 isoforms. The object was
obtained by converting an Ensembl TxDb object into a GRangesList
object using the TxDb2GRangesList convenience function in crisprDesign.
data(grListExample, package="crisprDesign")
The queryTxObject function allows us to query such objects for a specific
gene and feature. Here, we obtain a GRanges object containing the first exon
of the IQSEC3 gene:
gr <- queryTxObject(txObject=grListExample,
featureType="cds",
queryColumn="gene_symbol",
queryValue="IQSEC3")
and retain the first exon only:
gr <- gr[1]
findSpacers is the main function to obtain a list of all
possible spacer sequences targeting protospacers located in the target
DNA sequence(s). If a GRanges object is provided as input, a BSgenome
object (object containing sequences of a reference genome) will need to be
provided as well:
library(BSgenome.Hsapiens.UCSC.hg38)
bsgenome <- BSgenome.Hsapiens.UCSC.hg38
guideSet <- findSpacers(gr,
bsgenome=bsgenome,
crisprNuclease=BE4max,
strict_overlap=FALSE)
guideSet
## GuideSet object with 129 ranges and 5 metadata columns:
## seqnames ranges strand | protospacer pam
## <Rle> <IRanges> <Rle> | <DNAStringSet> <DNAStringSet>
## spacer_1 chr12 66884 + | GCAGGCGGCGGCGGGCGGCA TGG
## spacer_2 chr12 66893 - | CGCGCACCGGATTCTCCAGC AGG
## spacer_3 chr12 66896 + | GGGCGGCATGGAGAGCCTGC TGG
## spacer_4 chr12 66905 + | GGAGAGCCTGCTGGAGAATC CGG
## spacer_5 chr12 66906 - | AGGTAGAGCACGGCGCGCAC CGG
## ... ... ... ... . ... ...
## spacer_125 chr12 67434 - | TCTCCTCTCCCCGCTCACTC AGG
## spacer_126 chr12 67442 + | GAGCAGGAGACCTGAGTGAG CGG
## spacer_127 chr12 67443 + | AGCAGGAGACCTGAGTGAGC GGG
## spacer_128 chr12 67444 + | GCAGGAGACCTGAGTGAGCG GGG
## spacer_129 chr12 67449 + | AGACCTGAGTGAGCGGGGAG AGG
## pam_site cut_site region
## <numeric> <numeric> <character>
## spacer_1 66884 66881 region_1
## spacer_2 66893 66896 region_1
## spacer_3 66896 66893 region_1
## spacer_4 66905 66902 region_1
## spacer_5 66906 66909 region_1
## ... ... ... ...
## spacer_125 67434 67437 region_1
## spacer_126 67442 67439 region_1
## spacer_127 67443 67440 region_1
## spacer_128 67444 67441 region_1
## spacer_129 67449 67446 region_1
## -------
## seqinfo: 711 sequences (1 circular) from hg38 genome
## crisprNuclease: SpCas9
The argument set to FALSE enables spacer sequences to be
not-strictly overlapping with the CDS region; this is useful for base editing
design as the editing window can extend beyond the protospacer sequence region.
The function returns a GuideSet object that stores genomic coordinates for all spacer
sequences found in the regions provided by gr. The GuideSet object is an
extension of a GenomicRanges object that stores additional information about
gRNAs. For the sake of time, we will retain only two gRNAs:
guideSet <- guideSet[c(50,51)]
print(guideSet)
## GuideSet object with 2 ranges and 5 metadata columns:
## seqnames ranges strand | protospacer pam
## <Rle> <IRanges> <Rle> | <DNAStringSet> <DNAStringSet>
## spacer_50 chr12 67138 - | TTCCTGGGCCCTGGCGGCGG CGG
## spacer_51 chr12 67141 - | AGGTTCCTGGGCCCTGGCGG CGG
## pam_site cut_site region
## <numeric> <numeric> <character>
## spacer_50 67138 67141 region_1
## spacer_51 67141 67144 region_1
## -------
## seqinfo: 711 sequences (1 circular) from hg38 genome
## crisprNuclease: SpCas9
The function addEditedAlleles finds, characterizes, and scores predicted
edited alleles for each gRNA, for a chosen transcript. It requires a
transcript-specific annotation that can be obtained using the
function getTxInfoDataFrame. Here, we will perform the
analysis using the main isoform of IQSEC3 (transcript id ENST00000538872).
We first get the transcript table for ENST00000538872,
txid <- "ENST00000538872"
txTable <- getTxInfoDataFrame(tx_id=txid,
txObject=grListExample,
bsgenome=bsgenome,
extend=30)
head(txTable)
## DataFrame with 6 rows and 10 columns
## chr pos nuc aa aa_number exon pos_plot
## <character> <numeric> <character> <character> <integer> <integer> <integer>
## 1 chr12 66737 G NA NA NA 1
## 2 chr12 66738 G NA NA NA 2
## 3 chr12 66739 C NA NA NA 3
## 4 chr12 66740 A NA NA NA 4
## 5 chr12 66741 G NA NA NA 5
## 6 chr12 66742 T NA NA NA 6
## pos_mrna pos_cds region
## <integer> <integer> <character>
## 1 NA NA Upstream
## 2 NA NA Upstream
## 3 NA NA Upstream
## 4 NA NA Upstream
## 5 NA NA Upstream
## 6 NA NA Upstream
The argument extend specifies the number of nucleotides upstream and downstream
of the exons to include. This is useful to characterize gRNAs overlapping
splice junctions. The region column indicates where the location of the
nucleotide: 3UTR, 5UTR, CDS, Intron, and Upstream and Downstream of the CDS.
We are ready to add predicted alleles to the GuideSet object:
editingWindow <- c(-20,-8)
guideSet <- addEditedAlleles(guideSet,
baseEditor=BE4max,
txTable=txTable,
editingWindow=editingWindow,
minEditingWeight = 0,
minMutationScore = 0.3)
## [addEditedAlleles] Obtaining edited alleles at each gRNA target site.
## [addEditedAlleles] Adding functional consequences to alleles.
## [addEditedAlleles] Adding summary to GuideSet.
The editingWindow argument specifies the window of editing that
we are interested in. When not provided, it uses the default window
provided in the BaseEditor object. Note that providing large windows
can exponentially increase computing time as the number of possible
alleles grows exponentially.
The minEditingWeight specifies the minimum editing weight required for an
allele to be listed as a predicted allele. Default is 0. A higher threshold can
be used to omit alleles with low probabilities.
The mutationScore specifies a minimum predicted probability for labeling an
allele with a predicted variant. Alleles with scores lower than this threshold
will be labeled as “not_editing”. Default of 0.3.
For each gRNA, a predicted alleles annotation is stored and can be retrieved
using the editedAlleles accessor function. As an example, let’s retrieve
the predicted alleles for the first gRNA:
alleles <- editedAlleles(guideSet[1])
print(alleles)
## DataFrame with 99 rows and 9 columns
## seq score variant positions changes
## <DNAStringSet> <numeric> <character> <character> <character>
## spacer_50 TTCTTGGGCCCTG 0.2021425 silent 91 NA
## spacer_50 TTTTTGGGCCCTG 0.0975437 missense 92 E92K
## spacer_50 TTCTTGGGTCCTG 0.0960253 missense 90 A90T
## spacer_50 TTTCTGGGCCCTG 0.0495436 missense 92 E92K
## spacer_50 TTCCTGGGTCCTG 0.0487724 missense 90 A90T
## ... ... ... ... ... ...
## spacer_50 TTCCTGGGCCATG 0.000345692 missense 89 R89M
## spacer_50 TTATTGGGCCTTG 0.000344266 nonsense 89;92 R89K;E92*
## spacer_50 TTCATGGGTCTTG 0.000341052 missense_multi 89;90;91 R89K;A90T;Q91H
## spacer_50 TTTTTGGGTACTG 0.000336996 missense_multi 89;90;92 R89S;A90T;E92K
## spacer_50 TTTTTGGGTGCTG 0.000336996 missense_multi 89;90;92 R89S;A90T;E92K
## aa n_mismatches n_nonsense n_missense
## <character> <integer> <numeric> <numeric>
## spacer_50 ARRRAAAQQQEEE 0 0 0
## spacer_50 ARRRAAAQQQKKK 1 0 1
## spacer_50 ARRRTTTQQQEEE 1 0 1
## spacer_50 ARRRAAAQQQKKK 1 0 1
## spacer_50 ARRRTTTQQQEEE 1 0 1
## ... ... ... ... ...
## spacer_50 AMMMAAAQQQEEE 1 0 1
## spacer_50 AKKKAAAQQQ*** 2 1 1
## spacer_50 AKKKTTTHHHEEE 3 0 3
## spacer_50 ASSSTTTQQQKKK 3 0 3
## spacer_50 ASSSTTTQQQKKK 3 0 3
The resulting DataFrame is ordered so that the top predicted alleles
(based on the score column) are shown first. The score
represents the likelihood of the edited allele to occur relative
to all possible edited alleles, and is calculated using the editing
weights stored in the BE4max object. The seq column represents
the edited nucleotide sequences. They are always reported from the 5’ to 3’
direction on the strand corresponding to the gRNA strand.
n_mismatches column indicates the number of amino acid that differs
between the edited allele and the wildtype allele.n_nonsense and n_missense columns indicate the number of mismatches
that are nonsense and missense mutation, respectively. Those two columns sum to
the n_mismatches column.variant column indicates the functional consequence of the allele.
There are 7 possible choices:
silent: single or multiple silent mutationsmissense: single missense mutationnonsense: single nonsense mutationmissense_multi: multiple missense mutationsnonsense_multi: multiple nonsense mutationssplice_junction: mutation in a splice junctionnot_targeting: no mutations found in CDS or splice junctions
In case an edited allele leads to multiple editing events that have different
variant label, the most detrimental mutation
(splice junction over nonsense, nonsense over missense,
missense over silent) is reported.positions column lists the amino acid numbers where mutations occur.
For alleles that are labeled as splice_junction, it lists the closest amino
acid.aa column reports the result edited amino acid sequence.The alleles object also contains useful metadata information that can be
accessed using the metadata accessor function:
metadata(alleles)
## $wildtypeAllele
## spacer_50
## "TTCCTGGGCCCTG"
##
## $start
## [1] 67146
##
## $end
## [1] 67158
##
## $chr
## [1] "chr12"
##
## $strand
## [1] "-"
##
## $editingWindow
## [1] -20 -8
##
## $wildtypeAmino
## [1] "ARRRAAAQQQEEE"
The wildtypeAllele reports the unedited nucleotide sequence of the
region specified by the editing window (with respect to the gRNA PAM site).
It is always reported from the 5’ to 3’ direction on the strand corresponding
to the gRNA strand. The start and end specify the corresponding
coordinates on the transcript.
For both analysis and visualization purposes, it is often useful to label gRNAs
with a score and label that represents the most likely functional consequence
of that gRNA, for a given base editor. The addEditedAlleles function described
above also implements a gRNA-level aggregate scoring scheme that adds
several gRNA-level aggregate scores to the GuideSet object:
head(guideSet)
## GuideSet object with 2 ranges and 18 metadata columns:
## seqnames ranges strand | protospacer pam
## <Rle> <IRanges> <Rle> | <DNAStringSet> <DNAStringSet>
## spacer_50 chr12 67138 - | TTCCTGGGCCCTGGCGGCGG CGG
## spacer_51 chr12 67141 - | AGGTTCCTGGGCCCTGGCGG CGG
## pam_site cut_site region
## <numeric> <numeric> <character>
## spacer_50 67138 67141 region_1
## spacer_51 67141 67144 region_1
## editedAlleles
## <list>
## spacer_50 TTCTTGGGCCCTG:0.2021425:silent:...,TTTTTGGGCCCTG:0.0975437:missense:...,TTCTTGGGTCCTG:0.0960253:missense:...,...
## spacer_51 AGGTTTTTGGGCC:0.7819826:missense:...,AGGTTTGTGGGCC:0.0503345:missense_multi:...,AGGTTTTTGGGTC:0.0465862:missense_multi:...,...
## score_missense_single score_missense_multi score_nonsense_multi
## <numeric> <numeric> <numeric>
## spacer_50 0.436801 0.173223 0
## spacer_51 0.900358 0.153410 0
## score_nonsense_single score_silent score_splice_junction
## <numeric> <numeric> <numeric>
## spacer_50 0.00777293 0.262366 0
## spacer_51 0.02991694 0.000000 0
## score_missense score_nonsense maxVariant maxVariantScore
## <numeric> <numeric> <character> <numeric>
## spacer_50 0.610024 0.00777293 missense 0.610024
## spacer_51 1.053768 0.02991694 missense 1.053768
## aaPos aaChanges
## <character> <character>
## spacer_50 92 E92K(0.31);A90T(0.31..
## spacer_51 92 E92K(1);Q91H(0.09);A..
## -------
## seqinfo: 711 sequences (1 circular) from hg38 genome
## crisprNuclease: SpCas9
score_missense_single: sum of probability scores across all alleles with a single missense mutationscore_nonsense_single: sum of probability scores across all alleles with a single nonsense mutationscore_missense_multi: sum of probability scores across all alleles with multiple missense mutationsscore_nonsense_multi: sum of probability scores across all alleles with multiple nonsense mutationsscore_silent: sum of probability scores across all alleles with only silent mutationsscore_splice_junction: sum of probability scores across all alleles with a splice junction mutationscore_missense: score_missense_single and score_missense_multiple added togetherscore_nonsense: score_nonsense_single and score_nonsense_multiple added togethermaxVariantScore: maximum score across all score columnsmaxVariant: variant label for the maximum scoreaapos: amino acid position for the top predicted allele for the variant category that has the maximum scoreFor both gRNAs, the highest scores are missense, and therefore the maxVariant is set to missense.
sessionInfo()
## R Under development (unstable) (2025-10-20 r88955)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.79.1
## [3] rtracklayer_1.71.0 BiocIO_1.21.0
## [5] Biostrings_2.79.2 XVector_0.51.0
## [7] GenomicRanges_1.63.0 GenomeInfoDb_1.47.0
## [9] Seqinfo_1.1.0 IRanges_2.45.0
## [11] S4Vectors_0.49.0 BiocGenerics_0.57.0
## [13] generics_0.1.4 crisprDesign_1.13.1
## [15] crisprBase_1.15.0 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 bitops_1.0-9
## [3] httr2_1.2.1 biomaRt_2.67.0
## [5] rlang_1.1.6 magrittr_2.0.4
## [7] Rbowtie_1.51.0 matrixStats_1.5.0
## [9] compiler_4.6.0 RSQLite_2.4.3
## [11] GenomicFeatures_1.63.1 dir.expiry_1.19.0
## [13] png_0.1-8 vctrs_0.6.5
## [15] txdbmaker_1.7.1 stringr_1.6.0
## [17] pkgconfig_2.0.3 crayon_1.5.3
## [19] fastmap_1.2.0 dbplyr_2.5.1
## [21] Rsamtools_2.27.0 rmarkdown_2.30
## [23] tzdb_0.5.0 UCSC.utils_1.7.0
## [25] bit_4.6.0 xfun_0.54
## [27] randomForest_4.7-1.2 cachem_1.1.0
## [29] cigarillo_1.1.0 jsonlite_2.0.0
## [31] progress_1.2.3 blob_1.2.4
## [33] DelayedArray_0.37.0 BiocParallel_1.45.0
## [35] prettyunits_1.2.0 parallel_4.6.0
## [37] R6_2.6.1 VariantAnnotation_1.57.0
## [39] bslib_0.9.0 stringi_1.8.7
## [41] reticulate_1.44.0 jquerylib_0.1.4
## [43] Rcpp_1.1.0 bookdown_0.45
## [45] SummarizedExperiment_1.41.0 knitr_1.50
## [47] readr_2.1.5 Matrix_1.7-4
## [49] tidyselect_1.2.1 abind_1.4-8
## [51] yaml_2.3.10 codetools_0.2-20
## [53] curl_7.0.0 lattice_0.22-7
## [55] tibble_3.3.0 Biobase_2.71.0
## [57] KEGGREST_1.51.0 evaluate_1.0.5
## [59] crisprScoreData_1.15.0 BiocFileCache_3.1.0
## [61] ExperimentHub_3.1.0 pillar_1.11.1
## [63] BiocManager_1.30.26 filelock_1.0.3
## [65] MatrixGenerics_1.23.0 crisprScore_1.15.0
## [67] RCurl_1.98-1.17 BiocVersion_3.23.1
## [69] hms_1.1.4 glue_1.8.0
## [71] tools_4.6.0 AnnotationHub_4.1.0
## [73] GenomicAlignments_1.47.0 XML_3.99-0.20
## [75] grid_4.6.0 AnnotationDbi_1.73.0
## [77] basilisk_1.23.0 restfulr_0.0.16
## [79] cli_3.6.5 rappdirs_0.3.3
## [81] S4Arrays_1.11.0 dplyr_1.1.4
## [83] crisprBowtie_1.15.0 sass_0.4.10
## [85] digest_0.6.37 SparseArray_1.11.1
## [87] rjson_0.2.23 memoise_2.0.1
## [89] htmltools_0.5.8.1 lifecycle_1.0.4
## [91] httr_1.4.7 bit64_4.6.0-1
Koblan, Luke W, Jordan L Doman, Christopher Wilson, Jonathan M Levy, Tristan Tay, Gregory A Newby, Juan Pablo Maianti, Aditya Raguram, and David R Liu. 2018. “Improving Cytidine and Adenine Base Editors by Expression Optimization and Ancestral Reconstruction.” Nature Biotechnology 36 (9): 843–46.