1 Introduction

library(crisprDesign)

2 Defining the base editor object

We illustrate the CRISPR base editing (CRISPRbe) functionalities of crisprDesign by designing and characterizing gRNAs targeting the gene IQSEC3 using the cytidine base editor BE4max (Koblan et al. 2018).

We first load the BE4max BaseEditor object available in crisprBase:

data(BE4max, package="crisprBase")
BE4max
## Class: BaseEditor
##   CRISPR Nuclease name: SpCas9
##       Target type: DNA
##       Metadata: list of length 2
##       PAMs: NGG, NAG, NGA
##       Weights: 1, 0.2593, 0.0694
##       Spacer length: 20
##       PAM side: 3prime
##         Distance from PAM: 0
##       Prototype protospacers: 5'--SSSSSSSSSSSSSSSSSSSS[NGG]--3', 5'--SSSSSSSSSSSSSSSSSSSS[NAG]--3', 5'--SSSSSSSSSSSSSSSSSSSS[NGA]--3'
##   Base editor name: BE4max
##       Editing strand: original
##       Maximum editing weight: C2T at position -15

The editing probabilities of the base editor BE4max are stored in a matrix where rows correspond to the different nucleotide substitutions, and columns correspond to the genomic coordinate relative to the PAM site. The editingWeights function from crisprBase allows us to retrieve those probabilities. One can see that C to T editing is optimal around 15 nucleotides upstream of the PAM site for the BE4max base editor:

crisprBase::editingWeights(BE4max)["C2T",]
##   -36   -35   -34   -33   -32   -31   -30   -29   -28   -27   -26   -25   -24 
## 0.007 0.007 0.008 0.018 0.010 0.020 0.014 0.012 0.023 0.013 0.024 0.022 0.034 
##   -23   -22   -21   -20   -19   -18   -17   -16   -15   -14   -13   -12   -11 
## 0.022 0.021 0.035 0.058 0.162 0.318 0.632 0.903 1.000 0.870 0.620 0.314 0.163 
##   -10    -9    -8    -7    -6    -5    -4    -3    -2    -1 
## 0.100 0.056 0.033 0.019 0.018 0.024 0.017 0.005 0.002 0.001

To learn how to build a object for your own base editor enzyme, see the package vignette.

3 Designing spacer sequences

Next, we load the grListExample object in crisprDesign that contains gene coordinates in hg38 for exons of all human IQSEC3 isoforms. The object was obtained by converting an Ensembl TxDb object into a GRangesList object using the TxDb2GRangesList convenience function in crisprDesign.

data(grListExample, package="crisprDesign")

The queryTxObject function allows us to query such objects for a specific gene and feature. Here, we obtain a GRanges object containing the first exon of the IQSEC3 gene:

gr <- queryTxObject(txObject=grListExample,
                    featureType="cds",
                    queryColumn="gene_symbol",
                    queryValue="IQSEC3")

and retain the first exon only:

gr <- gr[1]

findSpacers is the main function to obtain a list of all possible spacer sequences targeting protospacers located in the target DNA sequence(s). If a GRanges object is provided as input, a BSgenome object (object containing sequences of a reference genome) will need to be provided as well:

library(BSgenome.Hsapiens.UCSC.hg38)
bsgenome <- BSgenome.Hsapiens.UCSC.hg38
guideSet <- findSpacers(gr,
                        bsgenome=bsgenome,
                        crisprNuclease=BE4max,
                        strict_overlap=FALSE)
guideSet
## GuideSet object with 129 ranges and 5 metadata columns:
##              seqnames    ranges strand |          protospacer            pam
##                 <Rle> <IRanges>  <Rle> |       <DNAStringSet> <DNAStringSet>
##     spacer_1    chr12     66884      + | GCAGGCGGCGGCGGGCGGCA            TGG
##     spacer_2    chr12     66893      - | CGCGCACCGGATTCTCCAGC            AGG
##     spacer_3    chr12     66896      + | GGGCGGCATGGAGAGCCTGC            TGG
##     spacer_4    chr12     66905      + | GGAGAGCCTGCTGGAGAATC            CGG
##     spacer_5    chr12     66906      - | AGGTAGAGCACGGCGCGCAC            CGG
##          ...      ...       ...    ... .                  ...            ...
##   spacer_125    chr12     67434      - | TCTCCTCTCCCCGCTCACTC            AGG
##   spacer_126    chr12     67442      + | GAGCAGGAGACCTGAGTGAG            CGG
##   spacer_127    chr12     67443      + | AGCAGGAGACCTGAGTGAGC            GGG
##   spacer_128    chr12     67444      + | GCAGGAGACCTGAGTGAGCG            GGG
##   spacer_129    chr12     67449      + | AGACCTGAGTGAGCGGGGAG            AGG
##               pam_site  cut_site      region
##              <numeric> <numeric> <character>
##     spacer_1     66884     66881    region_1
##     spacer_2     66893     66896    region_1
##     spacer_3     66896     66893    region_1
##     spacer_4     66905     66902    region_1
##     spacer_5     66906     66909    region_1
##          ...       ...       ...         ...
##   spacer_125     67434     67437    region_1
##   spacer_126     67442     67439    region_1
##   spacer_127     67443     67440    region_1
##   spacer_128     67444     67441    region_1
##   spacer_129     67449     67446    region_1
##   -------
##   seqinfo: 711 sequences (1 circular) from hg38 genome
##   crisprNuclease: SpCas9

The argument set to FALSE enables spacer sequences to be not-strictly overlapping with the CDS region; this is useful for base editing design as the editing window can extend beyond the protospacer sequence region. The function returns a GuideSet object that stores genomic coordinates for all spacer sequences found in the regions provided by gr. The GuideSet object is an extension of a GenomicRanges object that stores additional information about gRNAs. For the sake of time, we will retain only two gRNAs:

guideSet <- guideSet[c(50,51)]
print(guideSet)
## GuideSet object with 2 ranges and 5 metadata columns:
##             seqnames    ranges strand |          protospacer            pam
##                <Rle> <IRanges>  <Rle> |       <DNAStringSet> <DNAStringSet>
##   spacer_50    chr12     67138      - | TTCCTGGGCCCTGGCGGCGG            CGG
##   spacer_51    chr12     67141      - | AGGTTCCTGGGCCCTGGCGG            CGG
##              pam_site  cut_site      region
##             <numeric> <numeric> <character>
##   spacer_50     67138     67141    region_1
##   spacer_51     67141     67144    region_1
##   -------
##   seqinfo: 711 sequences (1 circular) from hg38 genome
##   crisprNuclease: SpCas9

4 Allele prediction

The function addEditedAlleles finds, characterizes, and scores predicted edited alleles for each gRNA, for a chosen transcript. It requires a transcript-specific annotation that can be obtained using the function getTxInfoDataFrame. Here, we will perform the analysis using the main isoform of IQSEC3 (transcript id ENST00000538872).

We first get the transcript table for ENST00000538872,

txid <- "ENST00000538872"
txTable <- getTxInfoDataFrame(tx_id=txid,
                              txObject=grListExample,
                              bsgenome=bsgenome,
                              extend=30)
head(txTable)
## DataFrame with 6 rows and 10 columns
##           chr       pos         nuc          aa aa_number      exon  pos_plot
##   <character> <numeric> <character> <character> <integer> <integer> <integer>
## 1       chr12     66737           G          NA        NA        NA         1
## 2       chr12     66738           G          NA        NA        NA         2
## 3       chr12     66739           C          NA        NA        NA         3
## 4       chr12     66740           A          NA        NA        NA         4
## 5       chr12     66741           G          NA        NA        NA         5
## 6       chr12     66742           T          NA        NA        NA         6
##    pos_mrna   pos_cds      region
##   <integer> <integer> <character>
## 1        NA        NA    Upstream
## 2        NA        NA    Upstream
## 3        NA        NA    Upstream
## 4        NA        NA    Upstream
## 5        NA        NA    Upstream
## 6        NA        NA    Upstream

The argument extend specifies the number of nucleotides upstream and downstream of the exons to include. This is useful to characterize gRNAs overlapping splice junctions. The region column indicates where the location of the nucleotide: 3UTR, 5UTR, CDS, Intron, and Upstream and Downstream of the CDS.

We are ready to add predicted alleles to the GuideSet object:

editingWindow <- c(-20,-8)
guideSet <- addEditedAlleles(guideSet,
                             baseEditor=BE4max,
                             txTable=txTable,
                             editingWindow=editingWindow,
                             minEditingWeight = 0,
                             minMutationScore = 0.3)
## [addEditedAlleles] Obtaining edited alleles at each gRNA target site.
## [addEditedAlleles] Adding functional consequences to alleles.
## [addEditedAlleles] Adding summary to GuideSet.
  • The editingWindow argument specifies the window of editing that we are interested in. When not provided, it uses the default window provided in the BaseEditor object. Note that providing large windows can exponentially increase computing time as the number of possible alleles grows exponentially.

  • The minEditingWeight specifies the minimum editing weight required for an allele to be listed as a predicted allele. Default is 0. A higher threshold can be used to omit alleles with low probabilities.

  • The mutationScore specifies a minimum predicted probability for labeling an allele with a predicted variant. Alleles with scores lower than this threshold will be labeled as “not_editing”. Default of 0.3.

For each gRNA, a predicted alleles annotation is stored and can be retrieved using the editedAlleles accessor function. As an example, let’s retrieve the predicted alleles for the first gRNA:

alleles <- editedAlleles(guideSet[1])
print(alleles)
## DataFrame with 99 rows and 9 columns
##                      seq       score        variant   positions        changes
##           <DNAStringSet>   <numeric>    <character> <character>    <character>
## spacer_50  TTCTTGGGCCCTG   0.2021425         silent          91             NA
## spacer_50  TTTTTGGGCCCTG   0.0975437       missense          92           E92K
## spacer_50  TTCTTGGGTCCTG   0.0960253       missense          90           A90T
## spacer_50  TTTCTGGGCCCTG   0.0495436       missense          92           E92K
## spacer_50  TTCCTGGGTCCTG   0.0487724       missense          90           A90T
## ...                  ...         ...            ...         ...            ...
## spacer_50  TTCCTGGGCCATG 0.000345692       missense          89           R89M
## spacer_50  TTATTGGGCCTTG 0.000344266       nonsense       89;92      R89K;E92*
## spacer_50  TTCATGGGTCTTG 0.000341052 missense_multi    89;90;91 R89K;A90T;Q91H
## spacer_50  TTTTTGGGTACTG 0.000336996 missense_multi    89;90;92 R89S;A90T;E92K
## spacer_50  TTTTTGGGTGCTG 0.000336996 missense_multi    89;90;92 R89S;A90T;E92K
##                      aa n_mismatches n_nonsense n_missense
##             <character>    <integer>  <numeric>  <numeric>
## spacer_50 ARRRAAAQQQEEE            0          0          0
## spacer_50 ARRRAAAQQQKKK            1          0          1
## spacer_50 ARRRTTTQQQEEE            1          0          1
## spacer_50 ARRRAAAQQQKKK            1          0          1
## spacer_50 ARRRTTTQQQEEE            1          0          1
## ...                 ...          ...        ...        ...
## spacer_50 AMMMAAAQQQEEE            1          0          1
## spacer_50 AKKKAAAQQQ***            2          1          1
## spacer_50 AKKKTTTHHHEEE            3          0          3
## spacer_50 ASSSTTTQQQKKK            3          0          3
## spacer_50 ASSSTTTQQQKKK            3          0          3

The resulting DataFrame is ordered so that the top predicted alleles (based on the score column) are shown first. The score represents the likelihood of the edited allele to occur relative to all possible edited alleles, and is calculated using the editing weights stored in the BE4max object. The seq column represents the edited nucleotide sequences. They are always reported from the 5’ to 3’ direction on the strand corresponding to the gRNA strand.

  • The n_mismatches column indicates the number of amino acid that differs between the edited allele and the wildtype allele.
  • The n_nonsense and n_missense columns indicate the number of mismatches that are nonsense and missense mutation, respectively. Those two columns sum to the n_mismatches column.
  • The variant column indicates the functional consequence of the allele. There are 7 possible choices:
    • silent: single or multiple silent mutations
    • missense: single missense mutation
    • nonsense: single nonsense mutation
    • missense_multi: multiple missense mutations
    • nonsense_multi: multiple nonsense mutations
    • splice_junction: mutation in a splice junction
    • not_targeting: no mutations found in CDS or splice junctions In case an edited allele leads to multiple editing events that have different variant label, the most detrimental mutation (splice junction over nonsense, nonsense over missense, missense over silent) is reported.
  • The positions column lists the amino acid numbers where mutations occur. For alleles that are labeled as splice_junction, it lists the closest amino acid.
  • The aa column reports the result edited amino acid sequence.

The alleles object also contains useful metadata information that can be accessed using the metadata accessor function:

metadata(alleles)
## $wildtypeAllele
##       spacer_50 
## "TTCCTGGGCCCTG" 
## 
## $start
## [1] 67146
## 
## $end
## [1] 67158
## 
## $chr
## [1] "chr12"
## 
## $strand
## [1] "-"
## 
## $editingWindow
## [1] -20  -8
## 
## $wildtypeAmino
## [1] "ARRRAAAQQQEEE"

The wildtypeAllele reports the unedited nucleotide sequence of the region specified by the editing window (with respect to the gRNA PAM site). It is always reported from the 5’ to 3’ direction on the strand corresponding to the gRNA strand. The start and end specify the corresponding coordinates on the transcript.

5 gRNA-level aggregate variant scores

For both analysis and visualization purposes, it is often useful to label gRNAs with a score and label that represents the most likely functional consequence of that gRNA, for a given base editor. The addEditedAlleles function described above also implements a gRNA-level aggregate scoring scheme that adds several gRNA-level aggregate scores to the GuideSet object:

head(guideSet)
## GuideSet object with 2 ranges and 18 metadata columns:
##             seqnames    ranges strand |          protospacer            pam
##                <Rle> <IRanges>  <Rle> |       <DNAStringSet> <DNAStringSet>
##   spacer_50    chr12     67138      - | TTCCTGGGCCCTGGCGGCGG            CGG
##   spacer_51    chr12     67141      - | AGGTTCCTGGGCCCTGGCGG            CGG
##              pam_site  cut_site      region
##             <numeric> <numeric> <character>
##   spacer_50     67138     67141    region_1
##   spacer_51     67141     67144    region_1
##                                                                                                                              editedAlleles
##                                                                                                                                     <list>
##   spacer_50               TTCTTGGGCCCTG:0.2021425:silent:...,TTTTTGGGCCCTG:0.0975437:missense:...,TTCTTGGGTCCTG:0.0960253:missense:...,...
##   spacer_51 AGGTTTTTGGGCC:0.7819826:missense:...,AGGTTTGTGGGCC:0.0503345:missense_multi:...,AGGTTTTTGGGTC:0.0465862:missense_multi:...,...
##             score_missense_single score_missense_multi score_nonsense_multi
##                         <numeric>            <numeric>            <numeric>
##   spacer_50              0.436801             0.173223                    0
##   spacer_51              0.900358             0.153410                    0
##             score_nonsense_single score_silent score_splice_junction
##                         <numeric>    <numeric>             <numeric>
##   spacer_50            0.00777293     0.262366                     0
##   spacer_51            0.02991694     0.000000                     0
##             score_missense score_nonsense  maxVariant maxVariantScore
##                  <numeric>      <numeric> <character>       <numeric>
##   spacer_50       0.610024     0.00777293    missense        0.610024
##   spacer_51       1.053768     0.02991694    missense        1.053768
##                   aaPos              aaChanges
##             <character>            <character>
##   spacer_50          92 E92K(0.31);A90T(0.31..
##   spacer_51          92 E92K(1);Q91H(0.09);A..
##   -------
##   seqinfo: 711 sequences (1 circular) from hg38 genome
##   crisprNuclease: SpCas9
  • score_missense_single: sum of probability scores across all alleles with a single missense mutation
  • score_nonsense_single: sum of probability scores across all alleles with a single nonsense mutation
  • score_missense_multi: sum of probability scores across all alleles with multiple missense mutations
  • score_nonsense_multi: sum of probability scores across all alleles with multiple nonsense mutations
  • score_silent: sum of probability scores across all alleles with only silent mutations
  • score_splice_junction: sum of probability scores across all alleles with a splice junction mutation
  • score_missense: score_missense_single and score_missense_multiple added together
  • score_nonsense: score_nonsense_single and score_nonsense_multiple added together
  • maxVariantScore: maximum score across all score columns
  • maxVariant: variant label for the maximum score
  • aapos: amino acid position for the top predicted allele for the variant category that has the maximum score

For both gRNAs, the highest scores are missense, and therefore the maxVariant is set to missense.

6 Session Info

sessionInfo()
## R Under development (unstable) (2025-10-20 r88955)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.79.1                  
##  [3] rtracklayer_1.71.0                BiocIO_1.21.0                    
##  [5] Biostrings_2.79.2                 XVector_0.51.0                   
##  [7] GenomicRanges_1.63.0              GenomeInfoDb_1.47.0              
##  [9] Seqinfo_1.1.0                     IRanges_2.45.0                   
## [11] S4Vectors_0.49.0                  BiocGenerics_0.57.0              
## [13] generics_0.1.4                    crisprDesign_1.13.1              
## [15] crisprBase_1.15.0                 BiocStyle_2.39.0                 
## 
## loaded via a namespace (and not attached):
##  [1] DBI_1.2.3                   bitops_1.0-9               
##  [3] httr2_1.2.1                 biomaRt_2.67.0             
##  [5] rlang_1.1.6                 magrittr_2.0.4             
##  [7] Rbowtie_1.51.0              matrixStats_1.5.0          
##  [9] compiler_4.6.0              RSQLite_2.4.3              
## [11] GenomicFeatures_1.63.1      dir.expiry_1.19.0          
## [13] png_0.1-8                   vctrs_0.6.5                
## [15] txdbmaker_1.7.1             stringr_1.6.0              
## [17] pkgconfig_2.0.3             crayon_1.5.3               
## [19] fastmap_1.2.0               dbplyr_2.5.1               
## [21] Rsamtools_2.27.0            rmarkdown_2.30             
## [23] tzdb_0.5.0                  UCSC.utils_1.7.0           
## [25] bit_4.6.0                   xfun_0.54                  
## [27] randomForest_4.7-1.2        cachem_1.1.0               
## [29] cigarillo_1.1.0             jsonlite_2.0.0             
## [31] progress_1.2.3              blob_1.2.4                 
## [33] DelayedArray_0.37.0         BiocParallel_1.45.0        
## [35] prettyunits_1.2.0           parallel_4.6.0             
## [37] R6_2.6.1                    VariantAnnotation_1.57.0   
## [39] bslib_0.9.0                 stringi_1.8.7              
## [41] reticulate_1.44.0           jquerylib_0.1.4            
## [43] Rcpp_1.1.0                  bookdown_0.45              
## [45] SummarizedExperiment_1.41.0 knitr_1.50                 
## [47] readr_2.1.5                 Matrix_1.7-4               
## [49] tidyselect_1.2.1            abind_1.4-8                
## [51] yaml_2.3.10                 codetools_0.2-20           
## [53] curl_7.0.0                  lattice_0.22-7             
## [55] tibble_3.3.0                Biobase_2.71.0             
## [57] KEGGREST_1.51.0             evaluate_1.0.5             
## [59] crisprScoreData_1.15.0      BiocFileCache_3.1.0        
## [61] ExperimentHub_3.1.0         pillar_1.11.1              
## [63] BiocManager_1.30.26         filelock_1.0.3             
## [65] MatrixGenerics_1.23.0       crisprScore_1.15.0         
## [67] RCurl_1.98-1.17             BiocVersion_3.23.1         
## [69] hms_1.1.4                   glue_1.8.0                 
## [71] tools_4.6.0                 AnnotationHub_4.1.0        
## [73] GenomicAlignments_1.47.0    XML_3.99-0.20              
## [75] grid_4.6.0                  AnnotationDbi_1.73.0       
## [77] basilisk_1.23.0             restfulr_0.0.16            
## [79] cli_3.6.5                   rappdirs_0.3.3             
## [81] S4Arrays_1.11.0             dplyr_1.1.4                
## [83] crisprBowtie_1.15.0         sass_0.4.10                
## [85] digest_0.6.37               SparseArray_1.11.1         
## [87] rjson_0.2.23                memoise_2.0.1              
## [89] htmltools_0.5.8.1           lifecycle_1.0.4            
## [91] httr_1.4.7                  bit64_4.6.0-1

References

Koblan, Luke W, Jordan L Doman, Christopher Wilson, Jonathan M Levy, Tristan Tay, Gregory A Newby, Juan Pablo Maianti, Aditya Raguram, and David R Liu. 2018. “Improving Cytidine and Adenine Base Editors by Expression Optimization and Ancestral Reconstruction.” Nature Biotechnology 36 (9): 843–46.