The Apache OpenNLP library is an open source machine learning based toolkit for the processing of natural language text.
It includes a sentence detector, a tokenizer, a name finder, a parts-of-speech (POS) tagger, a chunker, and a parser. It has proficient APIs that can be easily integrated with a Java program.
The goal of the OpenNLP project will be to create a mature toolkit. An additional goal is to provide a large number of pre-built models for a variety of languages, as well as the annotated text resources that those models are derived from.
Key Features
- Tokenization. OpenNLP offers multiple tokenizer implementations:
- Whitespace Tokenizer – A whitespace tokenizer, non whitespace sequences are identified as tokens.
-
- Simple Tokenizer – A character class tokenizer, sequences of the same character class are tokens.
-
- Learnable Tokenizer – A maximum entropy tokenizer, detects token boundaries based on probability model.
- Sentence segmentation.
- Part-of-speech tagging – marks tokens with their corresponding word type based on the token itself and the context of the token.
- Named entity extraction – the Name Finder can detect named entities and numbers in text.
- Chunking – consists of dividing a text in syntactically correlated parts of words, like noun groups, verb groups, but does not specify their internal structure, nor their role in the main sentence.
- Parsing – offers two different parser implementations, the chunking parser and the treeinsert parser. OpenNLP has a command line tool which is used to train the models available from the model download page on various corpora.
- Coreference resolution – links multiple mentions of an entity in a document together. The OpenNLP implementation is currently limited to noun phrase mentions, other mention types cannot be resolved.
- Maximum entropy.
- Perceptron based machine learning.
Website: opennlp.apache.org
Support: Documentation, GitHub
Developer: The Apache Software Foundation
License: Apache License Version 2.0
Apache OpenNLP is written in Java. Learn Java with our recommended free books and free tutorials.
Related Software
| Natural Language Processing | |
|---|---|
| PyTorch-Transformers | Library of state-of-the-art pre-trained models |
| Natural Language Toolkit | Suite of open source Python modules, data sets and tutorials |
| Stanford CoreNLP | Extensible annotation-based NLP pipeline |
| spaCy | Industrial strength natural language processing |
| scikit-learn | Machine learning library for Python |
| Gensim | Python-based vector space modeling and topic modeling toolkit |
| flair | Simple framework for state-of-the-art NLP |
| Apache OpenNLP | Machine learning based toolkit |
| DL4J | Deploy and train deep learning models |
| Apache Lucene | Full-featured information retrieval software library |
| UIMA | Implementation of the UIMA specification |
| tidytext | Text mining using dplyr, ggplot2, and other tidy tools |
| text2vec | Framework with API for text analysis and NLP |
| quanteda | R package for Quantitative Analysis of Textual Data |
| Moses | Statistical machine translation system |
Read our verdict in the software roundup.
| Java Natural Language Processing Tools | |
|---|---|
| CoreNLP | Annotation-based NLP pipeline that provides core natural language analysis |
| OpenNLP | Machine learning based toolkit |
| DL4J | Deploy and train deep learning models |
| Lucene | High-performance, full-featured information retrieval software library |
| UIMA | Open source implementation of the UIMA specification |
| Tika | Content analysis toolkit |
| MALLET | Statistical natural language processing, document classification and more |
| CogComp-NLP | State-of-the-art Natural Language Processing (NLP) tools |
| ReVerb | Automatically identifies and extracts binary relationships from sentences |
| NLP4J | NLP framework for JVM languages |
| GATE | Full-lifecycle solution for a broad range of NLP tasks |
Read our verdict in the software roundup.
Explore our comprehensive directory of recommended free and open source software. Our carefully curated collection spans every major software category.This directory is part of our ongoing series of informative articles for Linux enthusiasts. It features hundreds of detailed reviews, along with open source alternatives to proprietary solutions from major corporations such as Google, Microsoft, Apple, Adobe, IBM, Cisco, Oracle, and Autodesk. You’ll also find interesting projects to try, hardware coverage, free programming books and tutorials, and much more. Discovered a useful open source Linux program that we haven’t covered yet? Let us know by completing this form. |

