Creating and Testing a Custom OpenNLP Dictionary

Overview

Apache OpenNLP is an open source Java library for natural language processing. It provides an API for use cases such as named entity recognition, sentence detection, POS tagging, tokenization, and dictionaries.

In this post, we’ll look at how to create an OpenNLP dictionary and embed and use it on the Business Bot platform.

Target audience

This post has been prepared for beginners to make them understand how to use the OpenNLP library, and thus help them in building text processing services using this library.

Prerequisites

For this post, it is assumed that the reader has prior knowledge of the OpenNLP library. To test the OpenNLP dictionary, the free Business Bot Platform – Community Edition can be downloaded.

OpenNLP Dictionary

What is an OpenNLP dictionary?

This is a dictionary-based name finder that searches text for names within a dictionary. For example, a dictionary can be created for internal abbreviations, street names and company names, as well as places. By default, Apache does not provide dictionaries with the OpenNLP library. However, there is an option to download ready-made dictionaries from third-party vendors.

Why a custom dictionary?

The pre-built dictionaries may not be available for a desired language, may not recognize important entities or may not be available for a desired domain. These are the typical reasons for creating custom dictionaries.

Creating the OpenNLP dictionary?

The following section demonstrates how to use the OpenNLP DictionaryNameFinder class to create a named entity recognizer. To do this, create a dictionary that includes city names in the US.

To do this, create a new file (e.g. cities.dict). The file extension dict stands for Dictionary. The dictionary is built in XML format. In the XML file, you can specify whether the OpenNLP class should be case-sensitive.

<?xml version="1.0" encoding="UTF-8"?>
<dictionary case_sensitive="false">

<entry>
<token>Washington</token>
</entry>


<entry>
<token>Waterloo</token>
</entry>

<entry>
<token>Watford</token>
<token>City</token>
</entry>

</dictionary>

In the example, we have added the three cities Washington, Waterloo and Watford City. Each entry starts with <Entry> and ends with </Entry>. Between these tags, the <Token> tag is defined. A token is the smallest unit. To allow the dictionary class to find the entry Watford City, the two words must be defined in two separate tokens, i.e. <token>Watford</token> and <token>City</token>. A single token such as <token>Watford</token> is incorrect and will cause OpenNLP’s find() method to fail to find the entry, because the class will split the entire search string, Watford City into the three separate tokens.

Testing the OpenNLP dictionary on the Business Bot platform

To add your own dictionary, proceed as follows:

  1. Login to the Business Bot Platform (Installation instruction here)
  2. In the navigation bar, click on Natural Language ProcessingNLP ModelsAdd Custom Dictionary
  3. Now select the name, type and language of the dictionary and enter a short description
  4. Upload the dictionary file by drag & drop in the right panel and check the output in the panel (the file must not be larger than 100 MB and must have the file type *.dict). Then click on Submit to register the dictionary.

Once you have added the dictionary to the platform, you can use the NLP API Tester to check if entries are found in the dictionary. The NLP API Tester makes it easy to send HTTP requests to the NLP models and dictionaries and evaluate the response. HTTP requests can be made dynamic by inserting variables.

The result of the HTTP request shows that the city Waterloo was found in the query “Where do you live in Waterloo ?“.

{
"sentences": [
{
"chunks": [
{
"start": 0,
"end": 1,
"label": "ADVP",
"body": "Where"
},
{
"start": 2,
"end": 3,
"label": "NP",
"body": "you"
},
{
"start": 3,
"end": 4,
"label": "VP",
"body": "live"
},
{
"start": 4,
"end": 5,
"label": "PP",
"body": "in"
},
{
"start": 5,
"end": 6,
"label": "NP",
"body": "Waterloo"
}
],
"namedEntities": [],
"tokens": [
{
"probability": 0,
"tag": "WRB",
"body": "Where"
},
{
"probability": 0,
"tag": ".",
"body": "?"
},
{
"probability": 0,
"tag": "VB",
"body": "live"
},
{
"probability": 0,
"tag": "IN",
"body": "in"
},
{
"probability": 0,
"tag": "VBP",
"body": "do"
},
{
"probability": 0,
"tag": "PRP",
"body": "you"
},
{
"probability": 0,
"tag": "NNP",
"body": "Waterloo"
}
],
"body": "Where do you live in Waterloo ?",
"dictionaries": [
{
"probability": 1,
"name": "Waterloo",
"type": "Location"
}
]
}
],
"predictedLanguage": "en"
}

The platform allows you to add as many dictionaries as you want, so that places, names, abbreviations and many more can be recognized in the user query. If you already have a list of terms (e.g. in Excel or in a database) and want to transform it into the OpenNLP XML format, we can help you with our automated software tool.

Conclusion

You can enhance the user experience if your chatbot understands what the user wants. Dictionaries help to better understand the context of the user’s request and formulate an appropriate response.

Would you like to use Chatbots and NLP in your company? Contact us, we are looking forward to support you around the topic NLP-based chatbots in the corporate environment.

Leave a Reply

Your email address will not be published. Required fields are marked *