Title: | Probabilistic Spelling Correction in a Character Vector |
Version: | 1.0.1 |
Description: | Automatically replaces "misspelled" words in a character vector based on their string distance from a list of words sorted by their frequency in a corpus. The default word list provided in the package comes from the Corpus of Contemporary American English. Uses the Jaro-Winkler distance metric for string similarity as implemented in van der Loo (2014) <doi:10.32614/RJ-2014-011>. The word frequency data is derived from Davies (2008-) "The Corpus of Contemporary American English (COCA)" https://www.english-corpora.org/coca/. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.3.2 |
Depends: | R (≥ 2.10) |
Imports: | hunspell, stringr, stringdist, textclean |
Suggests: | rmarkdown, knitr, testthat (≥ 3.0.0) |
VignetteBuilder: | knitr |
Config/testthat/edition: | 3 |
NeedsCompilation: | no |
Packaged: | 2025-08-29 19:11:59 UTC; runner |
Author: | David Brown |
Maintainer: | David Brown <dwb2@andrew.cmu.edu> |
Repository: | CRAN |
Date/Publication: | 2025-09-03 21:10:02 UTC |
spell.replacer: Probabilistic Spelling Correction
Description
This package provides functions for automatic spelling correction in character vectors using probabilistic methods based on string distance and word frequency data from the Corpus of Contemporary American English (COCA).
Author(s)
Maintainer: David Brown dwb2@andrew.cmu.edu (ORCID)
COCA Word List
Description
A character vector containing the 100,000 most frequent words from the Corpus of Contemporary American English (COCA), sorted by frequency from most to least frequent. This word list serves as the default reference for spelling correction in the spell_replace function.
Usage
coca_list
Format
A character vector with 100,000 elements:
Each element is a word from COCA, with the first element being the most frequent word ("the") and subsequent elements decreasing in frequency.
Source
Corpus of Contemporary American English (COCA) https://www.english-corpora.org/coca/
Examples
# View the first 10 most frequent words
head(coca_list, 10)
# Check if a word is in the list
"hello" %in% coca_list
# Find the rank of a specific word
which(coca_list == "hello")
Correct a Single Misspelled Word
Description
Finds the best correction for a single misspelled word using string distance and frequency-based ranking from a sorted word list.
Usage
correct(word, sorted_words, ignore_punct = FALSE, threshold = 0.12)
Arguments
word |
A character string representing the misspelled word |
sorted_words |
A character vector of correctly spelled words sorted by frequency |
ignore_punct |
Logical. If TRUE, ignores punctuation when calculating string distance |
threshold |
Numeric. Maximum string distance threshold for considering a word as a correction candidate |
Value
A character string with the corrected word
Probabilistic Spelling Correction
Description
Automatically replaces misspelled words in a character vector based on their string distance from a list of words sorted by frequency in a corpus.
Usage
spell_replace(
txt,
word_list = coca_list,
ignore_names = TRUE,
threshold = 0.12,
ignore_punct = FALSE
)
Arguments
txt |
A character vector containing text to be spell-checked |
word_list |
A character vector of correctly spelled words sorted by frequency (default: coca_list) |
ignore_names |
Logical. If TRUE, ignores potential proper names (capitalized words that appear multiple times) |
threshold |
Numeric. Maximum string distance threshold for considering a word as a correction candidate (default: 0.12) |
ignore_punct |
Logical. If TRUE, ignores punctuation when calculating string distance |
Value
A character vector with corrected spellings