Instructions on using selectKSigs on selecting the number of mutational signatures using a perplexity-based measure and cross-validation
The R package HiLDA is developed under the Bayesian framework to select the number of mutational signatures based on perplexity and cross-validation. The mutation signature is defined based on the independent model proposed by Shiraishi’s et al.Â
Shiraishi et al. A simple model-based approach to inferring and visualizing cancer mutation signatures, bioRxiv, doi: http://dx.doi.org/10.1101/019901.
Zhi Yang, Paul Marjoram, Kimberly D. Siegmund. Selecting the number of mutational signatures using a perplexity-based measure and cross-validation.
selectKSigs requires several CRAN and Bioconductor R packages to be installed. Dependencies are usually handled automatically, when installing the package using the following commands:
install.packages("BiocManager")
BiocManager::install("selectKSigs")[NOTE: Ignore the first line if you already have installed the BiocManager.]
You can also download the newest version from the GitHub using devtools:
devtools::install_github("USCbiostats/selectKSigs")selectKSigs is a package built on some basic functions from pmsignature
including how to read the input data. Here is an example from pmsignature on
the input data, mutation features are elements used for categorizing mutations
such as:
sample1 chr1  100   A   C   
sample1 chr1    200 A   T   
sample1 chr2    100 G   T   
sample2 chr1    300 T   C   
sample3 chr3    400 T   C   Here, inputFile is the path for the input file. numBases is the number of flanking bases to consider including the central base (if you want to consider two 5’ and 3’ bases, then set 5). Also, you can add transcription direction information using trDir. numSig sets the number of mutation signatures estimated from the input data. You will see a warning message on some mutations are being removed.
library(HiLDA)## Loading required package: ggplot2library(tidyr)
library(ggplot2)
library(dplyr)## 
## Attaching package: 'dplyr'## The following objects are masked from 'package:stats':
## 
##     filter, lag## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, unioninputFile <- system.file("extdata/esophageal.mp.txt.gz", package="HiLDA")
G <- hildaReadMPFile(inputFile, numBases=5, trDir=TRUE)## Warning in hildaReadMPFile(inputFile, numBases = 5, trDir = TRUE): Out of 24861
## mutations, we could obtaintranscription direction information for 24728
## mutation. Other mutations are removed.Also, we also provided a small simulated dataset which contains 10 mutational catalogs and used it for demonstrating the key functions in selectKSigs. We start with loading the sample dataset G stored as extdata/sample.rdata.
library(selectKSigs)
load(system.file("extdata/sample.rdata", package = "selectKSigs"))After we read in the sample data G, we can run the process from selectKSigs. Here, we specify the inputG as G, the number of cross-validation folds, kfold to be 3, the number of replications, nRep, to be 3, and the upper limit of the K values for exploration to be 7.
After we obtained the results, we can plot each measure by the range of K values that were refitted during the calculation. The optimal value of K is achieved at its minimum value highlighted in grey.
results$Kvalue <- seq_len(nrow(results)) + 1
results_df <- gather(results, Method, value, -Kvalue) %>% 
  group_by(Method) %>%
  mutate(xmin = which.min(value) + 1 - 0.1,
         xmax = which.min(value) + 1 + 0.1)
ggplot(results_df) +
  geom_point(aes(x = Kvalue, y = value, color = Method), size = 2) +
  facet_wrap(~ Method, scales = "free") + 
  geom_rect(mapping = aes(xmin = xmin, xmax = xmax, 
                          ymin = -Inf, ymax = Inf),
            fill = 'grey', alpha = 0.05) +
  theme_bw()+
  xlab("Number of signatures")Here is the output of sessionInfo() for reproducibility in the future.
sessionInfo()## R version 4.5.0 RC (2025-04-04 r88126)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] selectKSigs_1.20.0 dplyr_1.1.4        tidyr_1.3.1        HiLDA_1.22.0      
## [5] ggplot2_3.5.2      BiocStyle_2.36.0  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1                       
##  [2] farver_2.1.2                           
##  [3] blob_1.2.4                             
##  [4] rjags_4-17                             
##  [5] Biostrings_2.76.0                      
##  [6] bitops_1.0-9                           
##  [7] fastmap_1.2.0                          
##  [8] RCurl_1.98-1.17                        
##  [9] GenomicAlignments_1.44.0               
## [10] XML_3.99-0.18                          
## [11] digest_0.6.37                          
## [12] lifecycle_1.0.4                        
## [13] R2jags_0.8-9                           
## [14] KEGGREST_1.48.0                        
## [15] RSQLite_2.3.9                          
## [16] magrittr_2.0.3                         
## [17] compiler_4.5.0                         
## [18] rlang_1.1.6                            
## [19] sass_0.4.10                            
## [20] tools_4.5.0                            
## [21] yaml_2.3.10                            
## [22] rtracklayer_1.68.0                     
## [23] knitr_1.50                             
## [24] labeling_0.4.3                         
## [25] S4Arrays_1.8.0                         
## [26] bit_4.6.0                              
## [27] curl_6.2.2                             
## [28] DelayedArray_0.34.0                    
## [29] abind_1.4-8                            
## [30] BiocParallel_1.42.0                    
## [31] withr_3.0.2                            
## [32] purrr_1.0.4                            
## [33] BiocGenerics_0.54.0                    
## [34] grid_4.5.0                             
## [35] stats4_4.5.0                           
## [36] colorspace_2.1-1                       
## [37] gtools_3.9.5                           
## [38] scales_1.3.0                           
## [39] tinytex_0.57                           
## [40] SummarizedExperiment_1.38.0            
## [41] cli_3.6.4                              
## [42] rmarkdown_2.29                         
## [43] crayon_1.5.3                           
## [44] generics_0.1.3                         
## [45] httr_1.4.7                             
## [46] rjson_0.2.23                           
## [47] DBI_1.2.3                              
## [48] cachem_1.1.0                           
## [49] stringr_1.5.1                          
## [50] parallel_4.5.0                         
## [51] AnnotationDbi_1.70.0                   
## [52] BiocManager_1.30.25                    
## [53] XVector_0.48.0                         
## [54] restfulr_0.0.15                        
## [55] matrixStats_1.5.0                      
## [56] vctrs_0.6.5                            
## [57] boot_1.3-31                            
## [58] Matrix_1.7-3                           
## [59] jsonlite_2.0.0                         
## [60] bookdown_0.43                          
## [61] IRanges_2.42.0                         
## [62] S4Vectors_0.46.0                       
## [63] bit64_4.6.0-1                          
## [64] magick_2.8.6                           
## [65] GenomicFeatures_1.60.0                 
## [66] jquerylib_0.1.4                        
## [67] R2WinBUGS_2.1-22.1                     
## [68] glue_1.8.0                             
## [69] codetools_0.2-20                       
## [70] cowplot_1.1.3                          
## [71] stringi_1.8.7                          
## [72] gtable_0.3.6                           
## [73] GenomeInfoDb_1.44.0                    
## [74] BiocIO_1.18.0                          
## [75] GenomicRanges_1.60.0                   
## [76] UCSC.utils_1.4.0                       
## [77] munsell_0.5.1                          
## [78] tibble_3.2.1                           
## [79] pillar_1.10.2                          
## [80] htmltools_0.5.8.1                      
## [81] BSgenome.Hsapiens.UCSC.hg19_1.4.3      
## [82] GenomeInfoDbData_1.2.14                
## [83] BSgenome_1.76.0                        
## [84] R6_2.6.1                               
## [85] evaluate_1.0.3                         
## [86] lattice_0.22-7                         
## [87] Biobase_2.68.0                         
## [88] png_0.1-8                              
## [89] Rsamtools_2.24.0                       
## [90] memoise_2.0.1                          
## [91] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [92] bslib_0.9.0                            
## [93] Rcpp_1.0.14                            
## [94] coda_0.19-4.1                          
## [95] SparseArray_1.8.0                      
## [96] xfun_0.52                              
## [97] MatrixGenerics_1.20.0                  
## [98] forcats_1.0.0                          
## [99] pkgconfig_2.0.3