Class BPETokenizer
- All Implemented Interfaces:
opennlp.tools.tokenize.Tokenizer
Tokenizer implementation that performs subword tokenization
using Byte Pair Encoding (BPE).
BPE iteratively merges the most frequent pair of adjacent symbols, starting from a character-level representation of each word. This allows the tokenizer to handle out-of-vocabulary words by decomposing them into known subword units.
Usage:
// Train a BPE model from a corpus
BPETokenizerTrainer trainer = new BPETokenizerTrainer();
BPEModel model = trainer.train(corpus, 10000, "en");
// Save the model for later reuse
model.serialize(Path.of("bpe-en.bin"));
// Load and tokenize
BPEModel loaded = new BPEModel(Path.of("bpe-en.bin"));
BPETokenizer tokenizer = new BPETokenizer(loaded);
String[] tokens = tokenizer.tokenize("unseen words are split into subwords");
The tokenizer first splits text on whitespace, then applies learned merge
operations to each word independently. Words are decomposed into characters
with an END_OF_WORD marker on the final character, then merges are
applied in priority order (as learned during training) until no more merges
are applicable. The resulting subword units are returned as tokens.
For reference see:
- Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. https://arxiv.org/abs/1508.07909
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final recordRepresents a pair of adjacent symbols in BPE. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final StringSuffix appended to the last symbol of each word during BPE encoding to distinguish word-final characters from word-internal ones. -
Constructor Summary
ConstructorsConstructorDescriptionBPETokenizer(BPEModel model) Initializes aBPETokenizerfrom a trainedBPEModel. -
Method Summary
-
Field Details
-
END_OF_WORD
Suffix appended to the last symbol of each word during BPE encoding to distinguish word-final characters from word-internal ones.Users constructing
BPETokenizer.SymbolPairmerge rules manually must use this constant to mark word-final symbols (e.g.,new SymbolPair("a", "b" + END_OF_WORD)).- See Also:
-
-
Constructor Details
-
BPETokenizer
Initializes aBPETokenizerfrom a trainedBPEModel.- Parameters:
model- The trained BPE model. Must not benull.- Throws:
IllegalArgumentException- ifmodelisnull.
-
-
Method Details
-
tokenize
Splits the input text on whitespace, then applies BPE merge operations to each word to produce subword tokens. Words not fully covered by learned merges are decomposed into individual characters.
- Specified by:
tokenizein interfaceopennlp.tools.tokenize.Tokenizer
-
tokenizePos
Returns
Spanoffsets into the original text for each subword token. Each span maps back to the exact character range in the input string.- Specified by:
tokenizePosin interfaceopennlp.tools.tokenize.Tokenizer
-