• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
Codemotion Magazine

Codemotion Magazine

We code the future. Together

  • Discover
    • Live
    • Tech Communities
    • Hackathons
    • Coding Challenges
    • For Kids
  • Watch
    • Talks
    • Playlists
    • Edu Paths
  • Magazine
    • Backend
    • Frontend
    • AI/ML
    • DevOps
    • Dev Life
    • Soft Skills
    • Infographics
  • Talent
    • Discover Talent
    • Jobs
  • Partners
  • For Companies
Home » AI/ML » Machine Learning » BERT: how Google changed NLP (and how to benefit from this)
AI/ML

BERT: how Google changed NLP (and how to benefit from this)

Natural Language Processing has significantly evolved during the years. A brief overview of the history behind NLP, arriving at today's state-of-the-art algorithm BERT, and demonstrating how to use it in Python.

May 4, 2020 by Paolo Caressa

bert google
Table Of Contents
  1. Introduction
  2. Understanding natural language: from linguistics to word embedding
    • Linguistic rules
    • Information Retrieval and Neural Networks
    • Unsupervised Algorithms: word2vec and BERT
    • Codemotion Online Tech ConferenceShaping the Future with Deep Learning
  3. BERT comes into play
  4. BERT: let’s play with it
  5. Conclusions

Introduction

One very broad and highly active field of research in AI (artificial intelligence) is NLP: Natural Language Processing. Scientists have been trying to teach machines how to understand and even write natural languages (such as English or Chinese) since the very beginning of computer science and artificial intelligence. One of the founding fathers of artificial intelligence, Alan Turing, suggested this as a possible application for the “learning machines” he imagined as early as the late 1940s (as discussed in a previous article). Other pioneers, such as Claude Shannon, who founded the mathematical theory of information and communication, have also suggested natural languages as a playground for the application of information technology and computer science.

The world has moved on since the days of these early pioneers, and today we use NLP solutions without even realizing it. We live in the world Turing dreamt of, but are scarcely aware of doing so!

The history of NLP is long and complex, involving several techniques once considered state of the art that now are barely remembered. Certain turning points in this history changed the field forever, and focused the attention of thousands of researchers on a single path forward. In recent years, the resources required to experiment and forge new paths in NLP have largely only been available outwith academia. Such resources are most available to private hi-tech companies: hardware and large groups of researchers are more easily allocated to a particular task by Google, Facebook and Amazon than by the average university, even in the United States. Consequently, more and more new ideas arise out of big companies rather than universities. In NLP, at least two such ideas followed this pattern: the word2vec and BERT algorithms.

The former is a word embedding algorithm devised by Tomas Mikolov and others in 2013 (the original C++ code can be found here). Another important class of algorithms – BERT – was published by Google researchers in 2018. Within just a few months these algorithms replaced previous NLP algorithms in the Google Search Engine. In both cases, the researchers released their solutions as open source, disclosing results, datasets and of course, the full code.

Such rapid progress and impact on widely-used products is amazing and worthy of deeper analysis. This article will offer hints for developers who wish to play with this new tool.

If you are interested in NLP, especially in conjunction with Deep Learning, don’t miss the opportunity to attend our Deep Learning Online Conference! Find out more information at this link.

Understanding natural language: from linguistics to word embedding

Linguistic rules

Before machine learning methods became effective and popular in the AI community, i.e., before the 1980s, natural language processing usually involved taking advantage of linguistic and logical theories (and even of the philosophy of language). The distinction between syntax and semantics, for example, is a consequence typical of that approach, in which the primary concern was trying to represent grammar rules in an effective way. The software then used these rules to analyse and represent natural language texts, in order to classify or summarize them, or to find answers to questions, translate from one language to another one, and so on.

The results achieved were generally poor when compared with the effort required to set up such implementations. The computational element was minimal, consisting of building logical representations of sentences (using languages like Prolog or Lisp), and then applying some expert systems to analyse them.

Information Retrieval and Neural Networks

However, a different type of representation had already been devised for natural language in the IR (Information Retrieval) community. IR was a hot topic in the 1960s, when computers were used mainly to store large archives of data and documents. Given that these computers were rather slow, clever techniques were needed to retrieve information from inside electronic documents. In particular, algebraic models called vector space models were developed, in which documents were associated with vectors in an N-dimensional space. This provided not only a practical indexing of documents, but also offered the chance to locate similar documents in the same region of the space. For example, imagine associating a document with a point on a plane. Ideally, nearby points would correspond with similar documents, so that Euclidean distance can be used to discriminate between similar documents, etc.

These techniques allowed users to deal numerically with documents, sentences and words. Once one linguistic object is mapped to a vector, the powerful weapons of numerical analysis and optimization may be applied to deal with them. Neural networks offer a good example of one of the more powerful of such weapons. Indeed, a neural network is simply an optimization algorithm that works on vectors, and thus on points in an N-dimensional space.

Nonetheless, generally speaking neural network are supervised learning algorithms: a net, newly set up, is a blank space – to be properly used it needs to be trained on a set of data, i.e., a dataset which contains both input records and the correct answers for that record.

If we are interested in NLP tasks more sophisticated than classification, e.g., summarization, translation, text generation, etc., the use of standard neural networks requires a lot of labelled training data, which are difficult to produce. For example, to create translations, examples of texts in both the source and target language are needed – and a lot of them, since natural languages have tens of thousands of words, all of which can be arranged in extremely variable sentence combinations, several orders of magnitude more in number. That’s even before one considers that deep learning algorithms require millions of training cases to be properly trained.

Unsupervised Algorithms: word2vec and BERT

To deal with texts, it is better to devise unsupervised algorithms, which can be fed with raw data taken from the Internet, for example. BERT was initially fed with Wikipedia pages.

How can we design an unsupervised neural network? Neural networks are intrinsically supervised, but if we make some assumptions, we can transform unsupervised datasets into supervised ones. This works for data arranged in series, such as texts, which may be considered series of words.

Consider a text; via some tokenizing process, imagine having the sequence of words which constitute it: w1,…,wN. If the text is long – an article, a short story, a book – most words will appear more than once, but in a certain order. The idea behind word embedding associates each word with a certain vector (this is different from the vector space model of IR which associates a vector with an entire document). In this way, the position of vectors within the n-dimensional space should reflect the contextual relationships between words.

Another way to look at this idea: given a text and its series of consecutive words w1,…,wN, imagine the set of pairs (w1,w2),(w2,w3),…,(wN−1,wN) as a training set, by means of which the network is able to learn the function y = f(x) such that f(wi) = wi+1. Thus, the network learns to predict the word that will follow another given word. This is a non-linear generalization of certain simple Markov algorithms, such as those used to write bullshit generators.

A generalization of this process consists of considering each word in a text as surrounded by a ‘window’ in an effort to teach the network how to guess the missing word in a sentence. This process is used in the word2vec algorithm to train a recurrent neural network. Such algorithms are consequently context-free, as they simply associate vectors with single words.

More sophisticated algorithms involve the consideration of context, in which the vector associated to a word is varies according to context, rather than remaining the same regardless of the other words surrounding it in the sentence.

Google’s BERT is such an algorithm.

Codemotion Online Tech Conference
Shaping the Future with Deep Learning

The impact of AI on our lives is already tangible, and is destined to grow exponentially. From Healthcare to Finance, from Marketing to Manufacturing, these and many more fields are experiencing a fast-paced revolution.

On May 27th, join our free online event to learn from some of the most prominent DL experts in the world how you too can be a part of this change.

JOIN FOR FREE

BERT comes into play

BERT is an acronym of Bidirectional Encoder Representations from Transformers. The term bidirectional means that the context of a word is given by both the words that follow it and by the words preceding it. This technique makes this algorithm hard to train but very effective. Exploring the surrounding text around words is computationally expensive but allows a deeper understanding of words and sentences.

Unidirectional context-oriented algorithm already exist. A neural network can be trained to predict which word will follow a sequence of given words, once trained on a huge dataset of sentences. However, predicting that word from both the previous and following words is not an easy task. The only way to do so effectively is to mask some words in a sentence and predict them too, e.g., the sentence “the quick brown fox jumps over the lazy dog” might be masked as “the X brown fox jumps over the Y dog” with label (X = quick, Y = lazy) to become a labelled record in a training set of sentences. One can easily derive a training set from a bundle of unsupervised texts by simply masking 15% of words (as BERT does), and training the neural network to deduce the missing words from the remaining ones.

Notice that BERT is truly a deep learning algorithm, while context-free algorithms such as word2vec, based on shallow recurrent networks, may not be. However, as such, BERT’s training is very expensive, due to its transformer aspect. Training on a huge body of text – for example, all English-language Wikipedia pages – is an Herculean effort that requires decidedly nontrivial computational power.

As a result, BERT’s creators disentangled the training phase from the tuning phase required to properly apply the algorithm to a specific task. The algorithm has to be trained once overall, and then fine tuned specifically for each context.

Luckily, BERT comes with several pre-trained representations already computed. On the GitHub page of the project a number of pre-trained models can be found. However, such representations are not enough to solve a specific problem. The model must be fine tuned to the desired task. The following demonstrates how this is done, using a test example:

BERT: let’s play with it

For our purely explanatory purposes, we will use Python to play with a standard text dataset, the Deeply Moving dataset maintained at Stanford University, which contains short movie reviews from the ‘Rotten Tomatoes’ website. The dataset can be downloaded from this page.

We assume that the dataset is stored inside a directory stanfordSentimentTreebank. The sentences are stored inside the file datasetSentences.txt as pairs (index, sentence) one per line. A splitting training/testing set can also be found inside the file datasetSplit.txt its rows containing pairs (index, ds) being ds = 1 for training and ds = 2 for testing sentences. The file sentiment_labels.txt contains the labels attached to each sentence in the dataset and also to each phrase inside it as pairs (index, phrase), with 0 being the most negative, and 1 the most positive, sentiment. The list of all phrases along with their indexes is stored in dictionary.txt.

To keep things simple, we’ll focus only on sentences, so that our datasets are built on filtering scores for sentences and leaving out any remaining phrases. The dataset will also need to be built from those files – a simple process that is demonstrated below. BERT will then be applied to perform a test sentiment analysis on this dataset.

Whatever the task, it is not necessary to pre-train the BERT model, but only to fine-tune a pre-trained model on the specific dataset that relates to the problem we want to use BERT to study. We will try to use such a pre-trained model to perform our simple classification task: more exciting use cases may be found on the GitHub page of the project mentioned above, as well as elsewhere on the Web.

First, we choose the pre-trained model: in the BERT GitHub repository there are several choices available, we will use the one known as ‘BERT-tiny’, aka bert_en_uncased_L-2_H-128_A-2.

This pre-trained representation has been obtained by converting training texts into lowercase, with accent markers stripped out. Moreover, the model is set up as a network with 2 layers, 128 hidden, for a total of 4.4K parameters to train. This is the tiniest model, others include the ‘BERT-mini’ (4 layers, 256 hidden), ‘BERT-small’ (4 layers, 512 hidden), ‘BERT-Medium’ (8 layers, 512 hidden) and ‘BERT-base’ (12 layers, 768 hidden). Of course, the larger the network architecture, the more computational effort is needed to fine-tune these models. As the purpose of this article is purely explanatory, we’ll stick to the tiniest option to enable anyone to run the following code snippet, even if no GPUs are available.

The pre-trained model can be downloaded from the repository and extracted into a local folder. This folder will contain the following files:

  • bert_config.json
  • bert_model.ckpt.data-00000-of-00001
  • bert_model.ckpt.index
  • vocab.txt

The first file contains all the configuration necessary to build a network layer to use this BERT model, while the latter files are needed to properly tokenize our texts. The largest file contains the model, which may be loaded from the BERT library using the methods demonstrated below.

To remain focused on the model, the assumption will be that our code is run inside a directory which also contains those files, and where the directory stanfordSentimentTreebank with our dataset is also stored. This is necessary before running the following programs:

Before setting up the model, our dataset is tokenized according to the format expected by the BERT layers; this can be done via the FullTokenizer class from the BERT package. Next, the tokenizer is fed with each sentence in our datsaset. The tokenizer result, which is a list of strings, between “[CLS]” and “[SEP]” is enclosed, as required by the BERT algorithm implementation.

The output of our model will be simply a number between 0 and 1.

import bert import numpy as np import os BERT_PATH = "bert_layer" VOCAB_TXT = os.path.join(BERT_PATH, "vocab.txt") DATASET_DIR = "stanfordSentimentTreebank" SENTENCES = os.path.join(DATASET_DIR, "datasetSentences.txt") SCORES = os.path.join(DATASET_DIR, "sentiment_labels.txt") SPLITTING = os.path.join(DATASET_DIR, "datasetSplit.txt") DICTIONARY = os.path.join(DATASET_DIR, "dictionary.txt") MAX_LENGTH = 64 # Length of word vectors which the model accepts as input tokenizer = bert.bert_tokenization.FullTokenizer(VOCAB_TXT, do_lower_case = True) training_set, training_scores = [], [] testing_set, testing_scores = [], [] # For each sentence in the dataset, tokenize it and stores it inside either # the training or the testing set, according to the value in `datasetSplit.txt` def read_file(filename): with open(filename) as f: f.readline() # skips the heading line return f.readlines() sentences = read_file(SENTENCES) scores = read_file(SCORES) splitting = read_file(SPLITTING) dictionary = read_file(DICTIONARY) # let scores[i] = score being i an int index and score a float score. scores = {int(s[:s.index("|")]): float(s[s.index("|")+1:]) for s in scores} # let splitting[i] = int denoting the kind of dataset (1=training, 2=testing). splitting = {int(s[:s.index(",")]): int(s[s.index(",")+1:]) for s in splitting} # let dictionary[s] = phrase index of the corresponding string dictionary = {s[:s.index("|")]: int(s[s.index("|")+1:]) for s in dictionary} # Now looks for each sentence inside the dictionary, retrieves the index and looks for # the index in the scores, creating a list of sentences and scores for s in sentences: i = int(s[:s.index("\t")]) # sentence index, to be matched in splitting s = s[s.index("\t") + 1:][:-1] # extract the sentence (strip the ending "\n") if s not in dictionary: continue ph_i = dictionary[s] # associated phrase index # Now tokenizes the sentence and put it into the BERT format s = tokenizer.tokenize(s) if len(s) > MAX_LENGTH - 2: s = s[:MAX_LENGTH - 2] s = tokenizer.convert_tokens_to_ids(["[CLS]"] + s + ["[SEP]"]) if len(s) < MAX_LENGTH: s += [0] * (MAX_LENGTH - len(s)) # Decides in which dataset to store the data if splitting[i] == 1: training_set.append(s) training_scores.append(scores[ph_i]) else: testing_set.append(s) testing_scores.append(scores[ph_i]) training_set, training_scores = np.array(training_set), np.array(training_scores) testing_set, testing_scores = np.array(testing_set), np.array(testing_scores)
Code language: PHP (php)

Let’s use the BERT model that we downloaded from the GitHub repository. As usual in these kinds of models, fine tuning requires setting some hyper-parameters, i.e., parameters external to the model, such as the learning rate, the batch size, the number of epochs. Finding the right combination is the nightmare of every ML practitioner, but in BERT’s case, we have some suggestions from its inventors:

  • Batch size: 8, 16, 32, 64, 128 (in general, the larger the BERT model, the smaller the size)
  • Learning rate: 3e-4, 1e-4, 5e-5, 3e-5
  • Epochs: 4

As usual, we opt for the Adam optimizer, even if it is more expensive computationally. Nothing special is added to the BERT network layer provided by Google, but two dimensions of tensors representing the BERT output are pooled into one via the GlobalAveragePooling1D method – another trick that emerged from the Google research of the 2010s. Next, the BERT output is provided to a fully connected layer, the result of which is turned into the output of the network. The summary method of the Model Keras class allows us to show the shapes of the layers, and the verbose option in the training method is turned on to see the performances along epochs.

Of course, this is a very simple architecture, that may be further complicated at will to fit more complex purposes, for example. But, if time and/or resources permit, it is better to start with a larger pre-trained BERT model.

from tensorflow.keras.models import Model from tensorflow.keras.layers import Dense, Dropout, GlobalAveragePooling1D, Input, Lambda from tensorflow.keras.optimizers import Adam # Loads the bert pre-trained layer to plug into our network bert_params = bert.params_from_pretrained_ckpt(BERT_PATH) bert_layer = bert.BertModelLayer.from_params(bert_params, name = "BERT") bert_layer.apply_adapter_freeze() # We arrange our layers by composing them as functions, with the input layer as inmost one input_layer = Input(shape=(MAX_LENGTH,), dtype = 'int32', name = 'input_ids') output_layer = bert_layer(input_layer) #output_layer = Dense(128, activation = "tanh")(output_layer) #output_layer = Dropout(0.5)(output_layer) #output_layer = Lambda(lambda x: x[:, :, 0])(output_layer) # we drop the second dimension #output_layer = Dropout(0.5)(output_layer) #output_layer = Dense(1)(output_layer) #output_layer = Dropout(0.5)(output_layer) output_layer = GlobalAveragePooling1D()(output_layer) output_layer = Dense(128, activation = "relu")(output_layer) output_layer = Dense(1, activation = "relu")(output_layer) neural_network = Model(inputs = input_layer, outputs = output_layer) neural_network.build(input_shape = (None, MAX_LENGTH)) neural_network.compile(loss = "mse", optimizer = Adam(learning_rate = 3e-5)) neural_network.summary() neural_network.fit( training_set, training_scores, batch_size= 128, shuffle = True, epochs = 4, validation_data = (testing_set, testing_scores), verbose = 1 )
Code language: PHP (php)

And here is the output:

Model: "model_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_ids (InputLayer) [(None, 64)] 0 _________________________________________________________________ BERT (BertModelLayer) (None, 64, 128) 4369152 _________________________________________________________________ global_average_pooling1d_2 ( (None, 128) 0 _________________________________________________________________ dense_4 (Dense) (None, 128) 16512 _________________________________________________________________ dense_5 (Dense) (None, 1) 129 ================================================================= Total params: 4,385,793 Trainable params: 4,385,793 Non-trainable params: 0 _________________________________________________________________ Train on 8117 samples, validate on 3169 samples Epoch 1/4 8117/8117 [==============================] - 100s 12ms/sample - loss: 0.0766 - val_loss: 0.0661 Epoch 2/4 8117/8117 [==============================] - 99s 12ms/sample - loss: 0.0629 - val_loss: 0.0635 Epoch 3/4 8117/8117 [==============================] - 101s 12ms/sample - loss: 0.0592 - val_loss: 0.0594 Epoch 4/4 8117/8117 [==============================] - 94s 12ms/sample - loss: 0.0550 - val_loss: 0.0571 <tensorflow.python.keras.callbacks.History at 0x1ba0ef8cf48>
Code language: PHP (php)

Now we have a simple pre-trained BERT model fine-tuned and trained on our dataset. Let’s use it to check some sentences about imaginary movies: the network will order them from the most negative to the most positive.

some_sentences = [ "The film is not bad but the actors should take acting lessons", "Another chiefwork by a master of western movies", "This film is just disappointing: do not waste time on it", "Well directed but poorly acted", "The movie is well directed and greatly acted", "A honest zombie movie with actually no new ideas", ] print("The following sentences will be sorted from the most negative upward") for s in some_sentences: print("\t", s) ranked = [] # pairs (rank, sentence) for s in some_sentences: t = tokenizer.tokenize(s) if len(t) > MAX_LENGTH - 2: t = t[:MAX_LENGTH - 2] t = tokenizer.convert_tokens_to_ids(["[CLS]"] + t + ["[SEP]"]) if len(t) < MAX_LENGTH: t += [0] * (MAX_LENGTH - len(t)) p = neural_network.predict(np.array([t]))[0][0] ranked.append((p, s)) print("Network ranking from negative to positive") for r in sorted(ranked): print("\t", r[1])
Code language: PHP (php)

Here is the output:

The following sentences will be sorted from the most negative upward The film is not bad but the actors should take acting lessons Another chiefwork by a master of western movies This film is just disappointing: do not waste time on it Well directed but poorly acted The movie is well directed and greatly acted A honest zombie movie with actually no new ideas Network ranking from negative to positive This film is just disappointing: do not waste time on it A honest zombie movie with actually no new ideas The film is not bad but the actors should take acting lessons Another chiefwork by a master of western movies Well directed but poorly acted The movie is well directed and greatly acted
Code language: JavaScript (javascript)

Not bad for the tiniest model with the minimum Keras wrapping…

Of course this test exercise may be improved on in several respects. A categorical multiclass classification could substitute for our numerical guessing, e.g. by mapping the positivity probability p resulting in the dataset using the following cut-offs:

  • very negative if 0 ≤ x ≤ 0.2
  • negative if 0.2 < x ≤ 0.4
  • neutral if 0.4 < x ≤ 0.6
  • positive if 0.6 < x ≤ 0.8
  • very positive if 0.8 < x ≤ 1

In this case, the output layer should be a softmax, and the compilation of the net should use categorical cross entropy, etc. One would also want to include more layers and dropouts, and so on.

Conclusions

Despite its simplicity of use, BERT models outperform previous NLP tools in several respects: they may be used not only to classify but also to predict, to translate, to summarize, and to improve the automatic understanding of natural languages. The availability of multiple pre-trained models, even in languages other than English, keeps improving, and, while fine tuning tasks are much harder than one might image after reading this description, I hope that the importance of this class of algorithms for practical and theoretical purposes in machine learning has been sufficiently highlighted by the previous discussion.

facebooktwitterlinkedinreddit
Share on:facebooktwitterlinkedinreddit

Tagged as:Google Python

COVID-19 & Open Source: a Shared Global Approach to Emergencies
Previous Post
API Economy: Code + API = Money
Next Post

Related articles

  • Using Machine Learning to diagnose COVID-19
  • Epidemic Intelligence, part 1: data, models and machine learning in the age of Coronavirus
  • Artificial Intelligence: “the new electricity”
  • 6 Data Science Careers That are Shaping the Future
  • Why do some machine learning models fail?
  • Making the Leap into AI/Machine Learning
  • Machine learning as a service – serving reusable ML models
  • Fantastic Data and Where to Find Them
  • Artificial Intelligence and Stupidity: can robots be smart?

Primary Sidebar

Learn new skills for 2023 with our Edu Paths!

Codemotion Edu Paths for 2023

Codemotion Talent · Remote Jobs

Game Server Developer

Whatwapp
Full remote · TypeScript · Kubernetes · SQL

Back-end Developer

Insoore
Full remote · C# · .NET · .NET-Core · Kubernetes · Agile/Scrum

Full Stack Developer

OverIT
Full remote · AngularJS · Hibernate · Oracle-Database · PostgreSQL · ReactJS

Data Engineer

ENGINEERING
Full remote · Amazon-Web-Services · Google-Cloud-Platform · Hadoop · Scala · SQL · Apache-Spark

Latest Articles

Will Low-Code Take Over the World in 2023?

Frontend

Pattern recognition, machine learning, AI algorithm

Pattern Recognition 101: How to Configure Your AI Algorithm With Regular Rules, Events, and Conditions

AI/ML

automotive software

Automotive Software Development: Can Agile and ASPICE Coexist?

DevOps

programming languages, 2023

Which Programming Languages Will Rule 2023?

Infographics

Footer

  • Magazine
  • Events
  • Community
  • Learning
  • Kids
  • How to use our platform
  • Contact us
  • Become a Contributor
  • About Codemotion Magazine
  • How to run a meetup
  • Tools for virtual conferences

Follow us

  • Facebook
  • Twitter
  • LinkedIn
  • Instagram
  • YouTube
  • RSS

DOWNLOAD APP

© Copyright Codemotion srl Via Marsala, 29/H, 00185 Roma P.IVA 12392791005 | Privacy policy | Terms and conditions

Follow us

  • Facebook
  • Twitter
  • LinkedIn
  • Instagram
  • RSS

DOWNLOAD APP

CONFERENCE CHECK-IN