AI is a sought-after topic, but most developers face two hurdles that prevent them from programming anything with it.
- It is a complex field in which a lot of experience is needed to achieve good results
- Although there are good network topologies and models for a problem, there is often a lack of training data (corpora) without which most neural networks cannot achieve good results
Especially in the up-and-coming natural language processing (nlp) sector, there is a lack of data in many areas. With this blogpost we are going to discuss a simple yet powerful solution to address this problem in the context of a conversational AI. ?
Leon presented a simple solution on our blog a few weeks ago: With AI as a Service reliable language processing systems can be developed in a short time whithout having to hassle around with datasets and neural networks. However, there is one significant drawback due to this type of technology: Dependence on the operator of the service. On one hand the service can be linked with costs, furthermore the own possibly sensitive data has to be passed on to the service operator. Especially for companies this is usually a show stopper. That’s where Rasa enters the stage.
The Rasa Stack
Rasa is an open source (see Github) conversational AI that is fully free for everyone and can be used in-house. There is no dependence on a service from Rasa or any other company. It consists of a two-part stack whose individual parts seem to perform similar tasks at first glance, but on a closer look you see that both try to solve their own problems. Rasa NLU is the language understanding AI we are going to dig deeper into soon. It is used to understand what the user is trying to say and which additional information he provides. Rasa Core is the context-aware AI for conversational flow, which is used to build dialog systems e.g. chatbots like this. It uses the information from Rasa NLU to find out what the user wants and what other information is needed to achieve it. For example, for a weather report you need both the date and the place.
Digging deeper into Rasa NLU
The following paragraphs deal with the development of language understanding. Its basics are already extensively documented, which is why I will keep this brief. Instead, the optimization possibilities are to be presented more extensively. If you have never coded something using Rasa, it makes sense to work through the restaurant example (see also Github code template) to get a basic understanding of the framework.
The processing pipeline is the core element of Rasa NLU. The decisions you make there have a huge influence on the system’s quality. In the restaurant example the pipeline is already given: Two NLU frameworks spaCy and skLearn are used for text processing. Good results can be achieved with very few domain-specific training data (10 – 20 formulations per intent). You can get this amount of data easily using Rasa Trainer. It is so small because transfer learning combines your own training data with spaCy’s own high-quality models to create a neural net. Besides spaCy, there are other ways to process your data, which we will discover now!
Unlock the full potential
Instead of spaCy you can also use MIT Information Extraction. MITIE can also be used for intent recognition and named entity recognition (NER). Both backends perform the same tasks and are therefore interchangeable. The difference lies in the algorithms and models they use. Therefore you are not bound to only spaCy or mitie, but you can also use scikit-learn for intent classification.
Which backend works best for your project is individual and should be tested. As you will see in the next paragraph, the pipeline offers some precious showpieces that work particularly well. The already included cross validation should be used to evaluate the quality of the system.
The processing pipeline
You should understand how the pipeline works to develop a good system for your special problem.
- The tokenizer: is used to transform input words, sentences or paragraphs into single word tokens. Hence, unnecessary punctuation is removed and stop words can also be removed.
- The featurizer is used to create input vectors from the tokens. They can be used as features for the neural net. The simplest form of an input vector is one-hot.
- The intent classifier is a part of the neural net, which is responsible for decision making. It decides which intent is most likely meant by the user. This is called multiclass classification.
- Finally named entity recognition can be used to extract information like e-mails from a text. In terms of Rasa (and dialogue systems) this is called entity extraction.
In the following example (from Rasa) you can see how the single parts work together to provide information about intent and entity:
{
"text": "I am looking for Chinese food",
"entities": [
{"start": 8, "end": 15, "value": "chinese", "entity": "cuisine", "extractor": "ner_crf", "confidence": 0.864}
],
"intent": {"confidence": 0.6485910906220309, "name": "restaurant_search"},
"intent_ranking": [
{"confidence": 0.6485910906220309, "name": "restaurant_search"},
{"confidence": 0.1416153159565678, "name": "affirm"}
]
}
As mentioned by Rasa itself intent_classifier_tensorflow_embedding can be used for intent classification. It is based on the StarSpace: Embed All The Things! paper published by Facebook Research. They present a completely new way for meaning similarity, which generates awesome results! ?
For named entity recognition you have to make a decision: Either you use common pre-trained entities, or you use custom entities like “type_of_coffee”. Pre-trained entities can be one of the following:
- ner_spaCy: Places, Dates, People, Organisations
- ner_duckling: Dates, Amounts of Money, Durations, Distances, Ordinals
Those two algorithms perform very well in recognition of the given types, but if you need custom entities they perform rather bad. Instead you should use ner_mitie or ner_crf and collect some more training data than usual. If your entities have a specific structure, which is parsable by a regex make sure to integrate intent_entity_featurizer_regex to your pipeline! In this Github Gist I provided a short script, which helps you to create training samples for a custom entity. You can just pass some sentences for an intent into it and combine it with sample values of your custom entity. It will then create some training samples for each of your sample values.
That’s it 🙂 If you have any questions about Rasa or this blogpost don’t hesitate to contact me! Have a nice week and stay tuned for our next post.
Greets,
Domi