Python Archive - CraftCoders.app https://craftcoders.app/category/python/ Jira and Confluence apps Wed, 14 Aug 2024 12:27:54 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.3 https://craftcoders.app/wp-content/uploads/2020/02/cropped-craftcoders-blue-logo-1-32x32.png Python Archive - CraftCoders.app https://craftcoders.app/category/python/ 32 32 Quickstart: Get ready for 3rd Spoken CALL Shared Task https://craftcoders.app/quickstart-get-ready-for-3rd-spoken-call-shared-task/ Mon, 03 Sep 2018 20:40:20 +0000 https://craftcoders.app/?p=575 Read More]]> These days is Interspeech conference 2018 where I’m invited as a speaker and as they write on their website

Interspeech is the world’s largest and most comprehensive conference on the science and technology of spoken language processing.

Coming Wednesday the results and systems of 2nd Spoken CALL Shared Task (ST2) are presented and discussed in a special session of the conference. Chances are that these discussions will lead to a third edition of the shared task.

With this blog post, I want to address all newcomers and provide a short comprehensible introduction to the most important challenges you will face if you want to participate at Spoken CALL shared task. If you like “hard fun”, take a look at my research group’s tutorial paper. There will be unexplained technical terms and many abbreviations combined with academical language in a condensed font for you 🙂

What is this all about?

The Spoken CALL Shared Task aims to create an automated prompt-response system for German-speaking children to learn English. The exercise for a student is to translate a request or sentence into English using voice. The automated system should ideally accept a correct response or reject a student response if faulty and offer relevant support or feedback. There are a number of prompts (given as text in German, preceded by a short animated clip in English), namely to make a statement or ask a question regarding a particular item. A baseline system (def) is provided by the website of the project. The final output of the system should be a judgment call as to the correctness of the utterance. A wide range of answers is to be allowed in response, adding to the difficulty of giving automated feedback. Incorrect responses are due to incorrect vocabulary usage, incorrect grammar, or bad pronunciation and quality of the recording.

How to get started

A day may come when you have to dig into papers when you have to understand how others built their systems, but it is not this day. As someone who is new to the field of natural language processing (NLP) you have to understand the basics of machine learning and scientific work first. Here are the things we will cover with this post:

  1. Machine learning concepts
  2. Running the baseline system
  3. Creating your own system

Machine learning concepts

When my research group and I first started to work on the shared task we read the papers and understood barely anything. So, we collected all the technical terms we didn’t understand and created a dictionary with short explanations. Furthermore,  we learned different concepts you should take a look at:

Training data usage

For the 2nd edition, there was a corpus (=training data) containing 12,916 data points (in our case speech-utterances), that we can use to create a system. A machine learning algorithm needs training data to extract features from it. These features can be used for classification and the more varying data you have, the better the classification will be.


But you can’t use all that data for training. You have to keep a part of your data points aside so you can validate that your system can classify data it has never seen before. This is called validation set and the other part is called training set. A rookie mistake (which we made) is to use the test set as validation set. The test set is a separate corpus, which you should use at the very end of development only to compare your system with others. For a more detailed explanation take a look at this blog post.

If you don’t have a separate validation set (like in our case) you can use cross-validation instead, which is explained here. Furthermore, you should try to have an equal distribution between correct and incorrect utterances in your sets. If this is not the case, e.g. if you have 75% correct utterances and 25% incorrect utterances in your training set, your system will tend to accept everything during validation.

Metrics

Metrics are used to measure how well a system performs. They are based on the system’s results, which generally displayed as a confusion matrix:

 

  • TP: True positive (a correct utterance has been classified as correct?)
  • FP: False positive (a faulty utterance has been classified as correct?)
  • TN: True negative (a faulty utterance has been classified as incorrect?)
  • FN: False negative (a correct utterance has been classified as incorrect?)

Based on the confusion matrix there are four often used metrics: Accuracy, Precision, Recall and F1. When to use which is explained thoroughly here. For the shared task, there’s a special metric called D-score. It is used to evaluate the system’s performance respecting a bias to penalize different classification mistakes differently. More details about D-score can be found in our tutorial paper.

Running the baseline system

If you open the data download page you can see an important differentiation: On the one hand you can download the speech processing system (also asr system or Kaldi system) and on the other hand you can download the text processing system. Basically, you have to independent baseline systems you can work on.

For the asr system to work you have to install several applications. This is one of the pain points. Be careful here! Kaldi is the speech processing framework our baseline system is built on. The things you need for Kaldi are Python, Tensorflow, CUDA, and cuDNN. The latter two are for Nvidia graphics cards. cuDNN depends on CUDA so check out that the versions you install match. Furthermore, Kaldi and Tensorflow should be able to use the installed Nvidia software versions. To find out if everything went well you can try Kaldi’s yes/no example as described in:

kaldi/egs/yesno/README.md

The text processing system can be run using just python and is pretty minimal 😉 At least during ST2 it was. You can either check if there’s a new official baseline system for text processing, or you can use one of our CSU-K systems as a basis:

https://github.com/Snow-White-Group/CSU-K-Toolkit

Creating your own system

To create your own system you first have to decide whether you want to start with text- or speech processing. If you are a complete beginner in the field, I would advise you to start which text processing because it is easier. If you want to start with speech processing take a look at Kaldi for dummies, which will teach you the basics.

The Kaldi system takes the training data audio files as input and produces text output which looks like this:

user-008_2014-05-11_17-57-14_utt_015 CAN I PAY BY CREDIT CARD 
user-023_2014-11-03_09-47-09_utt_009 I WANT A COFFEE 
user-023_2014-11-03_09-47-09_utt_010 I WOULD LIKE A COFFEE 
user-023_2014-11-03_09-47-09_utt_011 I WANT THE STEAK 

The asr output can be used as text processing input. The text processing system produces a classification (language: correct/incorrect, meaning: correct/incorrect) from the given sentence as output.

Now you should be at a point where you understand most of the things described in the papers, except for the concrete used architectures and algorithms. Read through them to collect ideas and dig deeper into the things that seem interesting to you 🙂 Furthermore here are some keywords you should have heard of:

  • POS Tagging, dependency parsing
  • NER Tagging
  • Word2Vec, Doc2Vec
  • speech recognition
  • information retrieval

I hope you are motivated to participate and if so, see you at the next conference 🙂

Greets,

Domi

 

 

]]>
Rasa Core & NLU: Conversational AI for dummies https://craftcoders.app/rasa-core-nlu-conversational-ai-for-dummies/ https://craftcoders.app/rasa-core-nlu-conversational-ai-for-dummies/#respond Mon, 23 Jul 2018 10:47:42 +0000 https://craftcoders.app/?p=395 Read More]]> AI is a sought-after topic, but most developers face two hurdles that prevent them from programming anything with it.

  1. It is a complex field in which a lot of experience is needed to achieve good results
  2. Although there are good network topologies and models for a problem, there is often a lack of training data (corpora) without which most neural networks cannot achieve good results

Especially in the up-and-coming natural language processing (nlp) sector, there is a lack of data in many areas. With this blogpost we are going to discuss a simple yet powerful solution to address this problem in the context of a conversational AI. ?

Leon presented a simple solution on our blog a few weeks ago: With AI as a Service reliable language processing systems can be developed in a short time whithout having to hassle around with datasets and neural networks. However, there is one significant drawback due to this type of technology: Dependence on the operator of the service. On one hand the service can be linked with costs, furthermore the own possibly sensitive data has to be passed on to the service operator. Especially for companies this is usually a show stopper. That’s where Rasa enters the stage.

The Rasa Stack

Rasa is an open source (see Github) conversational AI that is fully free for everyone and can be used in-house. There is no dependence on a service from Rasa or any other company. It consists of a two-part stack whose individual parts seem to perform similar tasks at first glance, but on a closer look you see that both try to solve their own problems. Rasa NLU is the language understanding AI we are going to dig deeper into soon. It is used to understand what the user is trying to say and which additional information he provides. Rasa Core is the context-aware AI for conversational flow, which is used to build dialog systems e.g. chatbots like this. It uses the information from Rasa NLU to find out what the user wants and what other information is needed to achieve it. For example, for a weather report you need both the date and the place.

Digging deeper into Rasa NLU

The following paragraphs deal with the development of language understanding. Its basics are already extensively documented, which is why I will keep this brief. Instead, the optimization possibilities are to be presented more extensively. If you have never coded something using Rasa, it makes sense to work through the restaurant example (see also Github code template) to get a basic understanding of the framework.

The processing pipeline is the core element of Rasa NLU. The decisions you make there have a huge influence on the system’s quality. In the restaurant example the pipeline is already given: Two NLU frameworks spaCy and skLearn are used for text processing. Good results can be achieved with very few domain-specific training data (10 – 20 formulations per intent). You can get this amount of data easily using Rasa Trainer. It is so small because transfer learning combines your own training data with spaCy’s own high-quality models to create a neural net. Besides spaCy, there are other ways to process your data, which we will discover now!

Unlock the full potential

Instead of spaCy you can also use MIT Information Extraction. MITIE can also be used for intent recognition and named entity recognition (NER). Both backends perform the same tasks and are therefore interchangeable. The difference lies in the algorithms and models they use. Therefore you are not bound to only spaCy or mitie, but you can also use scikit-learn for intent classification.

Which backend works best for your project is individual and should be tested. As you will see in the next paragraph, the pipeline offers some precious showpieces that work particularly well. The already included cross validation should be used to evaluate the quality of the system.

The processing pipeline

You should understand how the pipeline works to develop a good system for your special problem.

  1. The tokenizer: is used to transform input words, sentences or paragraphs into single word tokens. Hence, unnecessary punctuation is removed and stop words can also be removed.
  2. The featurizer is used to create input vectors from the tokens. They can be used as features for the neural net. The simplest form of an input vector is one-hot.
  3. The intent classifier is a part of the neural net, which is responsible for decision making. It decides which intent is most likely meant by the user. This is called multiclass classification.
  4. Finally named entity recognition can be used to extract information like e-mails from a text. In terms of Rasa (and dialogue systems) this is called entity extraction.

In the following example (from Rasa) you can see how the single parts work together to provide information about intent and entity:

{
    "text": "I am looking for Chinese food",
    "entities": [
        {"start": 8, "end": 15, "value": "chinese", "entity": "cuisine", "extractor": "ner_crf", "confidence": 0.864}
    ],
    "intent": {"confidence": 0.6485910906220309, "name": "restaurant_search"},
    "intent_ranking": [
        {"confidence": 0.6485910906220309, "name": "restaurant_search"},
        {"confidence": 0.1416153159565678, "name": "affirm"}
    ]
}

As mentioned by Rasa itself intent_classifier_tensorflow_embedding can be used for intent classification. It is based on the StarSpace: Embed All The Things! paper published by Facebook Research. They present a completely new way for meaning similarity, which generates awesome results! ?

For named entity recognition you have to make a decision: Either you use common pre-trained entities, or you use custom entities like “type_of_coffee”. Pre-trained entities can be one of the following:

  • ner_spaCy: Places, Dates, People, Organisations
  • ner_duckling: Dates, Amounts of Money, Durations, Distances, Ordinals

Those two algorithms perform very well in recognition of the given types, but if you need custom entities they perform rather bad. Instead you should use ner_mitie or ner_crf and collect some more training data than usual. If your entities have a specific structure, which is parsable by a regex make sure to integrate intent_entity_featurizer_regex to your pipeline! In this Github Gist I provided a short script, which helps you to create training samples for a custom entity. You can just pass some sentences for an intent into it and combine it with sample values of your custom entity. It will then create some training samples for each of your sample values.

That’s it 🙂 If you have any questions about Rasa or this blogpost don’t hesitate to contact me! Have a nice week and stay tuned for our next post.

Greets,
Domi

]]>
https://craftcoders.app/rasa-core-nlu-conversational-ai-for-dummies/feed/ 0
Analyzing a Telegram Groupchat with Machine Learning using an unsupervised NLP approach https://craftcoders.app/analyzing-a-telegram-groupchat-with-machine-learning-using-an-unsupervised-nlp-approach/ https://craftcoders.app/analyzing-a-telegram-groupchat-with-machine-learning-using-an-unsupervised-nlp-approach/#respond Mon, 04 Jun 2018 08:00:09 +0000 https://billigeplaetze.com/?p=91 Read More]]> Hey, guys!. When it came to finding a topic for my first blog post, I first thought of another topic… but then I remembered my student research project and thought it was actually fun! The student research project was about NLP (Natural language Processing), maybe I’ll tell you more about it in another blog post.
Almost as interesting as NLP is our Telegram group! Since last week we have our own stickers (based on photos of us). Well, this escalated quickly 😛 This kind of communication is interesting and I wonder if my basic knowledge of NLP is enough to get some interesting insights from our group conversations?…
That’s exactly what we’re about to find out! In this blog post, we will extract a telegram chat, create a domain-specific Word2Vec model and analyze it! You can follow me step by step!
But before we start, we should think about how we want to achieve our goal. Our goal is to find some interesting insights from a chat group without any mathematical shenanigans. To achieve our goal we have to do three things:

  • Convert a telegram chat to a format suitable for the computer,
  • train an own Word2Vec model based on the data set obtained in the first step and
  • use the Word2Vec model to find insights.

Exporting A Telegram Chat

That’s the easy part! First, we need a Unix/Linux system because we will use the Telegram messenger CLI for extracting the chat. I personally use a Windows 10 installation, so I installed an OpenSUSE WSL (Windows Subsystem for Linux) from the store. Once OpenSUSE is installed, we start OpenSUSE and clone the repo:

git clone --recursive https://github.com/vysheng/tg.git && cd tg

As you can see from the README of the project, we need additional libraries to build the project. That’s why we simply install all dependencies:

sudo zypper in lua-devel libconfig-devel readline-devel libevent-devel libjansson-devel python-devel libopenssl-devel

Then we can build the project:

./configure
make

Good job! We have already come quite a bit closer to our goal! Now we still have to
Install gram-history-dump. This is best done in another window so that we can run Telegram messenger CLI! Once you have installed telegram-history-dump you start the Telegram messenger CLI and follow these instructions:

telegram-cli --json -P 9009

Now we can execute the telegram-history-dump to get a JSON from our chats (ensure the Telegram messenger CLI is running):

ruby telegram-history-dump.rb

Very good. Our telegram chats are now saved as .jsonl. Personally, I prefer to work with .csv files. Therefore I adapted a small script which converts the .jsonl into a .csv file. You can download the script from our GitHub Repo and use it as follows:

python3 jsonl2csv.py <your>.jsonl chat.csv

Finished! Now we have gained a small dataset from Telegram. Time for the next step!

Train an own Word2Vec model

Word2Vec is an unsupervised machine learning algorithm developed by a team of researchers led by Tomas Mikolov at Google, which contains a word as input and returns a vector representation of this word as output. Other than a simple WordCount vector, a Word2Vec vector contains semantic characteristics of the word. The special thing about Word2Vec is that it is able to learn this representation completely unsupervised, all the algorithm needs is a corpus, which is the whole lexicon, on which it is trained. Word2Vec then tries to learn the meaning of a word based on the words that appear near that word. The construction of a vector based on Word2Vec can also be imagined as a weighting of dimensions of different meanings.

However, I will not go into the exact functionality of Word2Vec now, because that could fill an own blog post! But I think that I will soon write a blog post about it. Subscribe to our blog! Today we will only look at how we can create our own Word2Vec model with the help of gensim and then investigate it a little bit. For this, we will also apply typical steps of data preprocessing of NLP tasks. I will try to explain them as briefly as possible. However, I must ask you to read the details somewhere else or otherwise this blog post would be far too long. I prepared this blog post with Kaggle. I recommend using Kaggle if you want to try it yourself. Each code snippet corresponds to a cell in Kaggle.

So let’s start with the annoying part. First, we need to import a number of Python libraries. These include all the functions we need for our task

import pandas as pd
import csv
from nltk.tokenize.casual import TweetTokenizer
import nltk
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from nltk.corpus import stopwords

Good work… again :). Now we are finally ready and can get to the fun part. First, we have to read our previously created dataset. The perfect data type for this is pandas dataframe, which is considered to be the standard for data science tasks in python. The great thing: we can create a dataframe directly from a CSV file. Additionally, we need a list. In this list, we will store a list with the tokens of each set in our corpus. Simplified we can now think that every word is assigned to a token. For example, the word “hello” is mapped to the hello-token

df = pd.read_csv('../input/chat.csv',  delimiter=' ',  quotechar='|', quoting=csv.QUOTE_MINIMAL)
sentences = []

Now we have to break down each sentence or more specific each telegram message into tokens and add them to our corpus. Therefore we iterate over our dataframe and get the value in ‘text’. We can ignore the other columns for this example. Now we filter out all messages labeled with ‘no text’. These are mostly gifs we’ve sent.

One problem we have to deal with is the fact that we have a very very small dataset. Therefore we try to keep the variety of possible tokens as small as possible. First of all we remove all stopwords. In the context of NLP, stopwords are words that do not provide a semantic contribution to a statement. For example, articles. Articles are of great importance for grammar, but none for meaning. Word2Vec is an algorithm that maps the semantic meaning of a word into a vector, therefore we can remove the stopwords. To generate the tokens we use a ready-made Tokenizer from the NLTK package. This is specially designed to create tokens from tweets. Since tweets are probably very similar to chat messages, we use them to generate tokens.

deu_stops = stopwords.words('german')
sno = nltk.stem.SnowballStemmer('german')
for index, row in df.iterrows():
    text = row['text']
    if not text == 'no text':
        tokens = TweetTokenizer(preserve_case=True).tokenize(text)
        tokens_without_stopwords = [sno.stem(x) for x in tokens if x not in deu_stops]
        corpus.append(tokens_without_stopwords)

I’m afraid I must show you some magic now. The following lines define some parameters, which are needed for the training of our Word2Vec model. But since I haven’t explained the functionality of Word2Vec in detail in this blog post, it doesn’t make sense to explain these parameters now, see them as God-given 🙂

num_features = 300
min_word_count = 3
num_workers = 2
window_size = 1
subsampling = 1e-3

Ready! Okay, maybe not yet, but now we can generate our Word2Vec model with a single line of code:

model = Word2Vec(
corpus,
workers=num_workers,
size=num_features,
min_count=min_word_count,
window=window_size,
sample=subsampling)

Done! We have created our own Word2Vec model. Nice! So better save it quickly:

model.init_sims(replace=True)
model_name = "billigeplaetze_specific_word2vec_model"
model.save(model_name)

Finding insights

So finally it is time… We can look for interesting facts. The first thing we want to find out is how broad is our vocabulary? Allegedly the vocabulary correlates with the education so how educated are the Billige Plätze? I suspect nothing good :O.
Attentive readers will ask themselves why we first created a Word2Vec model to determine the vocabulary. Yes, we could have used the number of unique tokens earlier, but what interests us more is how many words our Word2Vec model contains. This will be less than the number of tokens, as many messages are too small to be used for training and words can get lost.

len(model.wv.vocab)
>>> 3517

Fascinating! The Word2Vec model we created based on the group chat of Billige Plätze uses only 3517 words. Attention, emojis also fall in here!

Now we are also able to answer the important questions in life! Who’s more of a Dude, Leon or me? What do you think? Ah, no matter we can use our Word2Vec model!

print(model.wv.distance('leon','dude'))
print(model.wv.distance('cem','dude'))
>>> 0.00175533877401
>>> 0.000874914626021

I always knew 🙂 Nice. But we can do more. In the Billige Plätze group, we often use the word ‘penis’ but with which other words is the word ‘penis’ associated? Let’s ask our word2Vec Model!

model.wv.most_similar('penis')
>>> [('gleich', 0.996833324432373),
 ('Marc', 0.9967494010925293),
 ('abend', 0.9967489242553711),
 ('hast', 0.9967312812805176),
 ('gemacht', 0.9967218637466431),
 ('gerade', 0.996717631816864),
 ('kannst', 0.9967092275619507),
 ('Ja', 0.9967082142829895),
 (':P', 0.9967063069343567),
 ('?', 0.9967035055160522)]

We are even able to visualize our Word2Vec model to make the “core language” visible! We use t-SNE for dimensionality reduction to get two-dimensional vectors that we can then visualize in a scatter plot. With sklearn, the code is very manageable!

vocab = list(model.wv.vocab)
X = model[vocab]
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)
df = pd.DataFrame(X_tsne, index=vocab, columns=['x', 'y']).sample(1000)

and now we can plot it with ploty


plt.rcParams["figure.figsize"] = (20,20)
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

ax.scatter(df['x'], df['y'])
for word, pos in df.iterrows():
    ax.annotate(word, pos)

We can see very clearly how certain words gather and others are very far away. So you can see which words are related to each other and which are not.

The End

So people, let’s recap what we’ve done: First we used the WSL (Windows Subsystem for Linux) to run a Linux program to export our telegram chats! We then used the NLTK (Natural Langue Toolkit) to convert the exported chat messages into tokens. With the help of gensim we trained our own Word2Vec model based on these tokens. Finally, we used Ploty and Sklearn to visualize our model!

I hope you guys enjoyed it as much as I did! See you soon on our blog again!

]]>
https://craftcoders.app/analyzing-a-telegram-groupchat-with-machine-learning-using-an-unsupervised-nlp-approach/feed/ 0