library(comapr)
#> Warning: replacing previous import 'utils::findMatches' by
#> 'S4Vectors::findMatches' when loading 'AnnotationDbi'
library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
library(BiocParallel)Install via BiocManager as follow:
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("comapr")comapr lets you interrogate the genotyping results for a sequence of markers
across the chromosome and detect meiotic crossovers from genotype shifts. In
this document, we demonstrate how genetic distances are calculated from genotyping
results for a group of samples using functions available from comapr.
The package includes a small data object that contains 70 markers and their genotyping results for 22 samples. The 22 samples are the BC1F1 progenies generated by
Therefore, the genotype shifts detected in BC1F1 samples represent crossovers that have happened during meiosis for the F1 parents.
BiocParallel::register(SerialParam())
BiocParallel::bpparam()
#> class: SerialParam
#>   bpisup: FALSE; bpnworkers: 1; bptasks: 0; bpjobname: BPJOB
#>   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
#>   bpRNGseed: ; bptimeout: NA; bpprogressbar: FALSE
#>   bpexportglobals: FALSE; bpexportvariables: FALSE; bpforceGC: FALSE
#>   bpfallback: FALSE
#>   bplogdir: NA
#>   bpresultdir: NAdata(snp_geno_gr)
data(parents_geno)In order to detect crossovers from the 22 samples’ genotype results (BC1F1
samples), we need to format the result into Homo_ref, Homo_alt and Het
encodings. This can be simply done through comparing with the parents’ genotypes.
Here, the genotype noted as “Fail” will be converted to NA. Please also note that we supplied “Homo_ref” as one of the fail options because “Homo_ref” is not in the possible genotypes of markers in BC1F1 samples.
corrected_geno <- correctGT(gt_matrix = mcols(snp_geno_gr),
                            ref = parents_geno$ref,
                            alt = parents_geno$alt,
                            fail = "Fail",
                            wrong_label = "Homo_ref")
mcols(snp_geno_gr) <- corrected_genoThe corrected_geno matrix
head(mcols(snp_geno_gr)[,1:5])
#> DataFrame with 6 rows and 5 columns
#>           X92         X93         X94         X95         X96
#>   <character> <character> <character> <character> <character>
#> 1    Homo_alt    Homo_alt    Homo_alt    Homo_alt    Homo_alt
#> 2    Homo_alt    Homo_alt    Homo_alt    Homo_alt    Homo_alt
#> 3    Homo_alt    Homo_alt    Homo_alt    Homo_alt    Homo_alt
#> 4    Homo_alt    Homo_alt    Homo_alt    Homo_alt    Homo_alt
#> 5          NA    Homo_alt    Homo_alt    Homo_alt    Homo_alt
#> 6         Het    Homo_alt    Homo_alt    Homo_alt    Homo_altNote that there are missing values in this resulting matrix that can be resulted from:
NAIn this step, we try to identify markers that have NA genotype across many
samples or samples that have a lot markers failed for removal. We use the
countGT function for find bad markers/samples.
genotype_counts <- countGT(mcols(snp_geno_gr))
genotype_counts$plotThe number of markes and samples are saved in a list returned by countGT
genotype_counts$n_markers
#>  [1] 30 33 33 33 33 33 31 31 32 33 33 33 32 33 33 33 32 32 33 32 33 33
genotype_counts$n_samples
#>  [1] 22 22 22 22 18 22 22 22 22 22 22 21 22 22 22 22 22 22 22 22 22 22 22 22 22
#> [26] 22 22 22 16 22 22 21 22We now filter out markers/samples by using function filterGT. min_markers
specifies at least how many markers a sample needs to be kept. Likewise for
min_samples.
A printed message contains information about how much markers or samples have been filtered.
corrected_geno <- filterGT(snp_geno_gr,
                           min_markers = 30,
                           min_samples = 2)
#> filter out 0 marker(s)
#> filter out 0 sample(s)Sample duplicates are identified by finding samples that share exactly same genotypes
across all available markers. findDupSamples can be applied and a threshold
value is provided and used as a cut-off on the percentage of same genotype markers
the duplicated samples should share.
dups <- findDupSamples(mcols(corrected_geno),
                       threshold = 0.99)
dups
#>      X98  
#> [1,] "X98"
#> [2,] "X99"Now we remove the duplicated samples.
mcols(corrected_geno) <- mcols(corrected_geno)[,colnames(mcols(corrected_geno))!="X98"]
#corrected_genoCrossovers are detected and counted through examining the patterns of genotypes
along the chromosome. When there is a shift from one genotype block to another,
a crossover is observed. This is done through calling countCOs function which
returns a GRange object with crossover counts for the list of marker intervals.
The crossover count values in the columns can be non-integer when one observed crossover can not be determined to be completely distributed to the marker interval in the corresponding row. The observed crossover is then distributed to the adjacent intervals proportionally to their interval base pair sizes.
marker_gr_cos <- countCOs(corrected_geno)
marker_gr_cos[1:5,1:5]
#> GRanges object with 5 ranges and 5 metadata columns:
#>       seqnames            ranges strand |       X92       X93       X94
#>          <Rle>         <IRanges>  <Rle> | <numeric> <numeric> <numeric>
#>   [1]        1  6655965-21638463      * | 0.9358742         0         0
#>   [2]        1 21638465-22665059      * | 0.0641258         0         0
#>   [3]        1 22665061-34590735      * | 0.0000000         0         0
#>   [4]        1 35033642-38996025      * | 0.0000000         0         0
#>   [5]        2   4248665-5348752      * | 0.0000000         0         0
#>             X95       X96
#>       <numeric> <numeric>
#>   [1]         0         0
#>   [2]         0         0
#>   [3]         0         0
#>   [4]         0         0
#>   [5]         0         0
#>   -------
#>   seqinfo: 4 sequences from an unspecified genome; no seqlengthsThe genetic distances of marker intervals are calcuated based on the crossover
rates via applying mapping function, Kosambi or Haldane by calling the
calGeneticDist function. The returned genetic distances are in unit of
centiMorgan.
dist_gr <- calGeneticDist(marker_gr_cos,
                          mapping_fun = "k")
dist_gr[1:5,]
#> GRanges object with 5 ranges and 1 metadata column:
#>       seqnames            ranges strand | kosambi_cm
#>          <Rle>         <IRanges>  <Rle> |  <numeric>
#>   [1]        1  6655965-21638463      * |   14.36279
#>   [2]        1 21638465-22665059      * |    5.08472
#>   [3]        1 22665061-34590735      * |    9.64156
#>   [4]        1 35033642-38996025      * |    9.64156
#>   [5]        2   4248665-5348752      * |    4.77638
#>   -------
#>   seqinfo: 4 sequences from an unspecified genome; no seqlengthsAlternatively, instead of returning the genetic distances in supplied marker
intervals, we can specify a bin_size which tells calGeneticDist to return
calculated genetic distances for equally sized chromosome bins.
dist_bin_gr <- calGeneticDist(marker_gr_cos,bin_size = 1e6)
dist_bin_gr[1:5,]
#> GRanges object with 5 ranges and 1 metadata column:
#>       seqnames          ranges strand | kosambi_cm
#>          <Rle>       <IRanges>  <Rle> |  <numeric>
#>   [1]        1        1-998753      * |          0
#>   [2]        1  998754-1997506      * |          0
#>   [3]        1 1997507-2996258      * |          0
#>   [4]        1 2996259-3995011      * |          0
#>   [5]        1 3995012-4993763      * |          0
#>   -------
#>   seqinfo: 4 sequences from mm10 genomeWith genetic distances calculated, we can do a sum of all genetic distances across all marker intervals. We can see that we got the same total genetic distances for marker based intervals and the equally binned intervals.
sum(dist_bin_gr$kosambi_cm)
#> [1] 168.0926
sum(dist_gr$kosambi_cm)
#> [1] 168.0926comapr also includes functions for visulising genetic distances of marker
intervals or binned intervals.
plotGeneticDist(dist_bin_gr,chr = "1")We can also plot the cumulative genetic distances of certain chromosomes
plotGeneticDist(dist_bin_gr,chr=c("1"),cumulative = TRUE)Multiple chromosomes
plotGeneticDist(dist_bin_gr,cumulative = TRUE)comapr implements a whole genome plot function too, that takes all chromosomes
available in the result and plot a cumulative genetic distances by cumulatively
summing all intervals across all chromosomes.
plotWholeGenome(dist_bin_gr)sessionInfo()
#> R version 4.3.0 RC (2023-04-13 r84269)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] BiocParallel_1.34.0  GenomicRanges_1.52.0 GenomeInfoDb_1.36.0 
#> [4] IRanges_2.34.0       S4Vectors_0.38.0     BiocGenerics_0.46.0 
#> [7] comapr_1.4.0         BiocStyle_2.28.0    
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3          rstudioapi_0.14            
#>   [3] jsonlite_1.8.4              shape_1.4.6                
#>   [5] magrittr_2.0.3              magick_2.7.4               
#>   [7] GenomicFeatures_1.52.0      farver_2.1.1               
#>   [9] rmarkdown_2.21              GlobalOptions_0.1.2        
#>  [11] BiocIO_1.10.0               zlibbioc_1.46.0            
#>  [13] vctrs_0.6.2                 memoise_2.0.1              
#>  [15] Rsamtools_2.16.0            RCurl_1.98-1.12            
#>  [17] base64enc_0.1-3             htmltools_0.5.5            
#>  [19] progress_1.2.2              curl_5.0.0                 
#>  [21] Formula_1.2-5               sass_0.4.5                 
#>  [23] bslib_0.4.2                 htmlwidgets_1.6.2          
#>  [25] plyr_1.8.8                  Gviz_1.44.0                
#>  [27] plotly_4.10.1               cachem_1.0.7               
#>  [29] GenomicAlignments_1.36.0    lifecycle_1.0.3            
#>  [31] iterators_1.0.14            pkgconfig_2.0.3            
#>  [33] Matrix_1.5-4                R6_2.5.1                   
#>  [35] fastmap_1.1.1               GenomeInfoDbData_1.2.10    
#>  [37] MatrixGenerics_1.12.0       digest_0.6.31              
#>  [39] colorspace_2.1-0            AnnotationDbi_1.62.0       
#>  [41] Hmisc_5.0-1                 RSQLite_2.3.1              
#>  [43] labeling_0.4.2              filelock_1.0.2             
#>  [45] fansi_1.0.4                 httr_1.4.5                 
#>  [47] compiler_4.3.0              withr_2.5.0                
#>  [49] bit64_4.0.5                 htmlTable_2.4.1            
#>  [51] backports_1.4.1             DBI_1.1.3                  
#>  [53] highr_0.10                  biomaRt_2.56.0             
#>  [55] rappdirs_0.3.3              DelayedArray_0.26.0        
#>  [57] rjson_0.2.21                tools_4.3.0                
#>  [59] foreign_0.8-84              nnet_7.3-18                
#>  [61] glue_1.6.2                  restfulr_0.0.15            
#>  [63] grid_4.3.0                  checkmate_2.1.0            
#>  [65] reshape2_1.4.4              cluster_2.1.4              
#>  [67] generics_0.1.3              gtable_0.3.3               
#>  [69] BSgenome_1.68.0             tidyr_1.3.0                
#>  [71] ensembldb_2.24.0            data.table_1.14.8          
#>  [73] hms_1.1.3                   xml2_1.3.3                 
#>  [75] utf8_1.2.3                  XVector_0.40.0             
#>  [77] foreach_1.5.2               pillar_1.9.0               
#>  [79] stringr_1.5.0               circlize_0.4.15            
#>  [81] dplyr_1.1.2                 BiocFileCache_2.8.0        
#>  [83] lattice_0.21-8              rtracklayer_1.60.0         
#>  [85] bit_4.0.5                   deldir_1.0-6               
#>  [87] biovizBase_1.48.0           tidyselect_1.2.0           
#>  [89] Biostrings_2.68.0           knitr_1.42                 
#>  [91] gridExtra_2.3               bookdown_0.33              
#>  [93] ProtGenerics_1.32.0         SummarizedExperiment_1.30.0
#>  [95] xfun_0.39                   Biobase_2.60.0             
#>  [97] matrixStats_0.63.0          stringi_1.7.12             
#>  [99] lazyeval_0.2.2              yaml_2.3.7                 
#> [101] evaluate_0.20               codetools_0.2-19           
#> [103] interp_1.1-4                tibble_3.2.1               
#> [105] BiocManager_1.30.20         cli_3.6.1                  
#> [107] rpart_4.1.19                munsell_0.5.0              
#> [109] jquerylib_0.1.4             dichromat_2.0-0.1          
#> [111] Rcpp_1.0.10                 dbplyr_2.3.2               
#> [113] png_0.1-8                   XML_3.99-0.14              
#> [115] parallel_4.3.0              ggplot2_3.4.2              
#> [117] blob_1.2.4                  prettyunits_1.1.1          
#> [119] latticeExtra_0.6-30         jpeg_0.1-10                
#> [121] AnnotationFilter_1.24.0     bitops_1.0-7               
#> [123] viridisLite_0.4.1           VariantAnnotation_1.46.0   
#> [125] scales_1.2.1                purrr_1.0.1                
#> [127] crayon_1.5.2                rlang_1.1.0                
#> [129] KEGGREST_1.40.0