Class BPETokenizer

java.lang.Object
opennlp.tools.tokenize.BPETokenizer
All Implemented Interfaces:
opennlp.tools.tokenize.Tokenizer

public class BPETokenizer extends Object implements opennlp.tools.tokenize.Tokenizer
A Tokenizer implementation that performs subword tokenization using Byte Pair Encoding (BPE).

BPE iteratively merges the most frequent pair of adjacent symbols, starting from a character-level representation of each word. This allows the tokenizer to handle out-of-vocabulary words by decomposing them into known subword units.

Usage:


 // Train a BPE model from a corpus
 BPETokenizerTrainer trainer = new BPETokenizerTrainer();
 BPEModel model = trainer.train(corpus, 10000, "en");

 // Save the model for later reuse
 model.serialize(Path.of("bpe-en.bin"));

 // Load and tokenize
 BPEModel loaded = new BPEModel(Path.of("bpe-en.bin"));
 BPETokenizer tokenizer = new BPETokenizer(loaded);
 String[] tokens = tokenizer.tokenize("unseen words are split into subwords");
 

The tokenizer first splits text on whitespace, then applies learned merge operations to each word independently. Words are decomposed into characters with an END_OF_WORD marker on the final character, then merges are applied in priority order (as learned during training) until no more merges are applicable. The resulting subword units are returned as tokens.

For reference see:

See Also:
  • Field Details

    • END_OF_WORD

      public static final String END_OF_WORD
      Suffix appended to the last symbol of each word during BPE encoding to distinguish word-final characters from word-internal ones.

      Users constructing BPETokenizer.SymbolPair merge rules manually must use this constant to mark word-final symbols (e.g., new SymbolPair("a", "b" + END_OF_WORD)).

      See Also:
  • Constructor Details

  • Method Details

    • tokenize

      public String[] tokenize(String text)

      Splits the input text on whitespace, then applies BPE merge operations to each word to produce subword tokens. Words not fully covered by learned merges are decomposed into individual characters.

      Specified by:
      tokenize in interface opennlp.tools.tokenize.Tokenizer
    • tokenizePos

      public opennlp.tools.util.Span[] tokenizePos(String text)

      Returns Span offsets into the original text for each subword token. Each span maps back to the exact character range in the input string.

      Specified by:
      tokenizePos in interface opennlp.tools.tokenize.Tokenizer