Package opennlp.tools.tokenize
Class WordpieceTokenizer
java.lang.Object
opennlp.tools.tokenize.WordpieceTokenizer
- All Implemented Interfaces:
Tokenizer
A
Tokenizer implementation which performs tokenization
using word pieces.
Adapted under MIT license from https://github.com/robrua/easy-bert.
For reference see:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final StringBERT classification token:[CLS].static final StringBERT separator token:[SEP].static final StringBERT unknown token:[UNK].static final StringRoBERTa classification token:<s>.static final StringRoBERTa separator token.static final StringRoBERTa unknown token. -
Constructor Summary
ConstructorsConstructorDescriptionWordpieceTokenizer(Set<String> vocabulary) WordpieceTokenizer(Set<String> vocabulary, int maxTokenLength) WordpieceTokenizer(Set<String> vocabulary, String classificationToken, String separatorToken, String unknownToken) Initializes aWordpieceTokenizerwith avocabularyand custom special tokens. -
Method Summary
Modifier and TypeMethodDescriptionintString[]Splits a string into its atomic parts.Span[]tokenizePos(String text) Finds the boundaries of atomic parts in a string.
-
Field Details
-
BERT_CLS_TOKEN
BERT classification token:[CLS].- See Also:
-
BERT_SEP_TOKEN
BERT separator token:[SEP].- See Also:
-
BERT_UNK_TOKEN
BERT unknown token:[UNK].- See Also:
-
ROBERTA_CLS_TOKEN
RoBERTa classification token:<s>.- See Also:
-
ROBERTA_SEP_TOKEN
RoBERTa separator token.- See Also:
-
ROBERTA_UNK_TOKEN
RoBERTa unknown token.- See Also:
-
-
Constructor Details
-
WordpieceTokenizer
- Parameters:
vocabulary- A set of tokens considered the vocabulary.
-
WordpieceTokenizer
- Parameters:
vocabulary- A set of tokens considered the vocabulary.maxTokenLength- A non-negative number that is used as maximum token length.
-
WordpieceTokenizer
public WordpieceTokenizer(Set<String> vocabulary, String classificationToken, String separatorToken, String unknownToken) Initializes aWordpieceTokenizerwith avocabularyand custom special tokens. This allows support for models like RoBERTa that use different special tokens instead of the BERT defaults.- Parameters:
vocabulary- The vocabulary.classificationToken- The CLS token.separatorToken- The SEP token.unknownToken- The UNK token.
-
-
Method Details
-
tokenizePos
Description copied from interface:TokenizerFinds the boundaries of atomic parts in a string.- Specified by:
tokenizePosin interfaceTokenizer- Parameters:
text- The string to be tokenized.- Returns:
- The
spans (offsets intofor each token as the individuals array elements.s)
-
tokenize
Description copied from interface:TokenizerSplits a string into its atomic parts. -
getMaxTokenLength
public int getMaxTokenLength()- Returns:
- The maximum token length.
-