Class TokenizerME
- All Implemented Interfaces:
opennlp.tools.ml.Probabilistic,opennlp.tools.tokenize.Tokenizer
Tokenizer for converting raw text into separated tokens. It uses
Maximum Entropy to make its decisions. The features are loosely
based off of Jeff Reynar's UPenn thesis "Topic Segmentation:
Algorithms and Applications.", which is available from his
homepage: http://www.cis.upenn.edu/~jcreynar.
This implementation needs a statistical model to tokenize a text which reproduces
the tokenization observed in the training data used to create the model.
The TokenizerModel class encapsulates that model and provides
methods to create it from the binary representation.
A tokenizer instance is thread-safe. One tokenizer can be shared across multiple threads to save memory.
Note: In container environments with classloader isolation (e.g. Jakarta EE), ensure instances do
not outlive the application's lifecycle, as underlying components use ThreadLocal state that may
pin the classloader.
To train a new model, use train(ObjectStream, TokenizerFactory, TrainingParameters).
Sample usage:
InputStream modelIn;
...
TokenizerModel model = TokenizerModel(modelIn);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("A sentence to be tokenized.");
- See Also:
-
Field Summary
Fields -
Constructor Summary
ConstructorsConstructorDescriptionTokenizerME(String language) Initializes aTokenizerMEby downloading a default model.TokenizerME(TokenizerModel model) Instantiates aTokenizerMEwith an existingTokenizerModel.TokenizerME(TokenizerModel model, Dictionary abbDict) Instantiates aTokenizerMEwith an existingTokenizerModel. -
Method Summary
Modifier and TypeMethodDescriptionvoidRemoves thread-local state to prevent classloader leaks in container environments.double[]Deprecated, for removal: This API element is subject to removal in a future version.double[]probs()The sequence was determined based on the previous call totokenizePos(String).voidsetKeepNewLines(boolean arg0) String[]opennlp.tools.util.Span[]Tokenizes the string.static TokenizerModeltrain(opennlp.tools.util.ObjectStream<opennlp.tools.tokenize.TokenSample> samples, TokenizerFactory factory, opennlp.tools.util.TrainingParameters mlParams) Trains a model for theTokenizerME.boolean
-
Field Details
-
SPLIT
Constant indicates a token split.- See Also:
-
NO_SPLIT
Constant indicates no token split.- See Also:
-
-
Constructor Details
-
TokenizerME
Initializes aTokenizerMEby downloading a default model.- Parameters:
language- The language of the tokenizer.- Throws:
IOException- Thrown if the model cannot be downloaded or saved.
-
TokenizerME
Instantiates aTokenizerMEwith an existingTokenizerModel.- Parameters:
model- TheTokenizerModelto be used.
-
TokenizerME
Instantiates aTokenizerMEwith an existingTokenizerModel.- Parameters:
model- TheTokenizerModelto be used.abbDict- TheDictionaryto be used. It must fit the language of themodel.
-
-
Method Details
-
probs
public double[] probs()The sequence was determined based on the previous call totokenizePos(String).- Specified by:
probsin interfaceopennlp.tools.ml.Probabilistic- Returns:
- an array with the same number of probabilities as tokens were sent to the computational method
when
tokenizePos(String)was last called; if not applicable, an empty array
-
getTokenProbabilities
Deprecated, for removal: This API element is subject to removal in a future version.Useprobs()instead.- Returns:
- the probabilities associated with the most recent calls to
tokenizePos(String); if not applicable, an empty array
-
tokenizePos
Tokenizes the string.- Specified by:
tokenizePosin interfaceopennlp.tools.tokenize.Tokenizer- Parameters:
d- The string to be tokenized.- Returns:
- A
Spanarray containing individual tokens as elements.
-
clearThreadLocalState
public void clearThreadLocalState()Removes thread-local state to prevent classloader leaks in container environments. Call when the thread is returned to a pool or the tokenizer is no longer needed. -
train
public static TokenizerModel train(opennlp.tools.util.ObjectStream<opennlp.tools.tokenize.TokenSample> samples, TokenizerFactory factory, opennlp.tools.util.TrainingParameters mlParams) throws IOException Trains a model for theTokenizerME.- Parameters:
samples- The samples used for the training.factory- ATokenizerFactoryto get resources from.mlParams- The machine learningtrain parameters.- Returns:
- A trained
TokenizerModel. - Throws:
IOException- Thrown during IO on a temp file created during training, or if reading from theObjectStreamfails.
-
useAlphaNumericOptimization
public boolean useAlphaNumericOptimization()- Returns:
trueif the tokenizer uses alphanumeric optimization,falseotherwise.
-
tokenize
- Specified by:
tokenizein interfaceopennlp.tools.tokenize.Tokenizer
-
setKeepNewLines
public void setKeepNewLines(boolean arg0)
-
probs()instead.