Elena Gagliardoni is a Deep Learning Scientist (NLP) at HARMAN International and part of the Italian team of the Samsung project Bixby. Bixby is Samsung’s virtual assistant that supports touch, tap, and voice commands in everything from mobile phones to smart fridges. She took a deep dive through the techniques and workflows are NLP at our recent Deep Learning Webinar. We’ll share some of her presentation here, but you’ll gain the full benefit from watching the video below.
What is NLP?
This is the era of unstructured data. More than 80% of data is unstructured in nature (considering social media, email, phone conversations, photos, videos, etc.). They can be a potential gold mine of information. Natural Language Processing (NLP) is that set of algorithms, techniques and tools to process and understand natural language-based data. NLP allows machines to machines to understand, analyze and manipulate human language.
Elena shared a number of resources and tools through her presentation and notes that “There is a huge amount of beautiful tool and libraries available to help with natural language processing. And the absolutely the most use the programming language is at the moment is Python.”
What is the natural language processing pipeline?
Natural language processing requires a series of detailed steps which Elena walked us through:
Understand the problem and collect data
The first steps in NLP are to define the problem domain and select appropriate data (corpus) for that domain. Elena notes that choosing the appropriate data is very popular as language is context-driven:
“For example, if you want to create a healthcare chatbot, you shouldn’t be using data sets that come from accounts in banking or financial because the two applications belong to two different domains. When you collect the datasets you need data from the right domain, as you need this to create the vocabulary. Vocabulary is a set of words for them the modal and that’s your goal. Thus, it is extremely important to define the domain and collect the right corpus.
If you’re looking for domain-specific data, she suggested taking a look at data scraping libraries such as Beautiful Soup and Scrapy.
Text preprocessing and text cleaning
The goal is text preprocessing and text cleaning is to clean your data in order to extract as much as much useful information as possible for the model reducing the variance of data.
Elene notes that most of your time will be spent on this step, and that “at the end of the day, this part of the pipeline is details as we need to have a vocabulary gathered which is expressive but small.
Resources: Elena suggests taking a look at Python libraries and NLTK, a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum. and the natural language toolkit (NLTK).
Lowercasing the text and removing punctuation are two common steps for producing variance of our corpus and thus a reduction in vocabulary size.
Remove stopwords: stopwords are commonly occurring word such as the, a, I, is,
Tokenization : Tokenization is “a really unique step that is mandatory in all of the pre-processing steps.” It means splitting your text into a series of tokens. The token is the smallest unit that will be broadcast inside your machine learning model.
Libraries such as SpaCey and Keris are great resources here.
Stemming and Lemmatization: Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word. Stemming follows an algorithm with steps to perform on the words which makes it faster.
Text normalisation: The last step in text preprocessing and cleaning is text normalization. This requires reoccurring actions such as adapting numbers and normalize text from reoccurring typos and writing regex that recognises emails, phone numbers, credit cards etc. which are then tagged to anonymous data.
Text enrichment – POS-Tagging and NER
POS-Tagging Part of speech tagging aims to assign parts of speech to each word of a given text (such as nouns ,verbs, adjectives, based on its definition and its context/
POS tagging is very important as words often occur in different senses e.g. She saw a bear vs your efforts will bear fruit
spaCy offers many pretrained POS taggers and NER for dozens of languages
NER (Named Entity Recognition) is the first step towards information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc,.Some examples are London, Rome, and Apple.
Word embeddings
Word embedding has evolved over time:
Before 2013, words were represented by one-hot vector., However, there is no natural notion of similarity for one-hot vectors – for example comparing Seattle hotel and Seattle motel. The solution was to learn to encode similarity into the vectors themselves using a learning algorithm based on neural networks. This meant that a word’s meaning is often given by the words nearby known as distributional semantics.
In this kind of work, we need to build a dense vector chosen for each word, so that it is similar to vectors of words that appear in different contexts.
However, in 2013 Google proposed a new model for word embedding called word2vec. Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its input is a text corpus and its output is a set of vectors: feature vectors that represent words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep neural networks can understand. It’s been followed by gloVe and fastTest.
After 2018, ELMo was created. Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word in it an embedding. It uses a bi-directional LSTM trained on a specific task to be able to create those embedding. These word embeddings are helpful in achieving state-of-the-art (SOTA) results in several NLP tasks.
Congratulations! Now we can finally apply our machine learning algorithms! Want to learn more? What the video above!