{"id":4484,"date":"2020-05-04T12:23:16","date_gmt":"2020-05-04T10:23:16","guid":{"rendered":"https:\/\/www.codemotion.com\/magazine\/?p=4484"},"modified":"2022-01-05T20:04:57","modified_gmt":"2022-01-05T19:04:57","slug":"bert-how-google-changed-nlp-and-how-to-benefit-from-this","status":"publish","type":"post","link":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/","title":{"rendered":"BERT: how Google changed NLP (and how to benefit from this)"},"content":{"rendered":"\t\t\t\t<div class=\"wp-block-uagb-table-of-contents uagb-toc__align-left uagb-toc__columns-1  uagb-block-121574b9      \"\n\t\t\t\t\tdata-scroll= \"1\"\n\t\t\t\t\tdata-offset= \"30\"\n\t\t\t\t\tstyle=\"\"\n\t\t\t\t>\n\t\t\t\t<div class=\"uagb-toc__wrap\">\n\t\t\t\t\t\t<div class=\"uagb-toc__title\">\n\t\t\t\t\t\t\tTable Of Contents\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"uagb-toc__list-wrap \">\n\t\t\t\t\t\t<ol class=\"uagb-toc__list\"><li class=\"uagb-toc__list\"><a href=\"#introduction\" class=\"uagb-toc-link__trigger\">Introduction<\/a><li class=\"uagb-toc__list\"><a href=\"#understanding-natural-language-from-linguistics-to-word-embedding\" class=\"uagb-toc-link__trigger\">Understanding natural language: from linguistics to word embedding<\/a><ul class=\"uagb-toc__list\"><li class=\"uagb-toc__list\"><a href=\"#linguistic-rules\" class=\"uagb-toc-link__trigger\">Linguistic rules<\/a><li class=\"uagb-toc__list\"><li class=\"uagb-toc__list\"><a href=\"#information-retrieval-and-neural-networks\" class=\"uagb-toc-link__trigger\">Information Retrieval and Neural Networks<\/a><li class=\"uagb-toc__list\"><li class=\"uagb-toc__list\"><a href=\"#unsupervised-algorithms-word2vec-and-bert\" class=\"uagb-toc-link__trigger\">Unsupervised Algorithms: word2vec and BERT<\/a><li class=\"uagb-toc__list\"><li class=\"uagb-toc__list\"><a href=\"#codemotion-online-tech-conferenceshaping-the-future-with-deep-learning\" class=\"uagb-toc-link__trigger\">Codemotion Online Tech ConferenceShaping the Future with Deep Learning<\/a><\/li><\/ul><\/li><li class=\"uagb-toc__list\"><a href=\"#bert-comes-into-play\" class=\"uagb-toc-link__trigger\">BERT comes into play<\/a><li class=\"uagb-toc__list\"><a href=\"#bert-lets-play-with-it\" class=\"uagb-toc-link__trigger\">BERT: let&#039;s play with it<\/a><li class=\"uagb-toc__list\"><a href=\"#conclusions\" class=\"uagb-toc-link__trigger\">Conclusions<\/a><\/ul><\/ol>\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\n\n\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<p>One very broad and highly active field of <span id=\"urn:enhancement-4a5dbe86\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/research\">research<\/span> in <span id=\"urn:enhancement-6a33a218\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/artificial_intelligence\">AI<\/span> (artificial intelligence) is <strong><span id=\"urn:enhancement-ebd7d65b\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/natural_language_processing\">NLP<\/span><\/strong>: <em>Natural Language Processing<\/em>. Scientists have been trying to teach machines how to understand and even write natural languages (such as English or Chinese) since the very beginning of <span id=\"urn:enhancement-3bfba64e\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/computer_science\">computer science<\/span> and <span id=\"urn:enhancement-7f166613\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/artificial_intelligence\">artificial intelligence<\/span>. One of the founding fathers of <span id=\"urn:enhancement-95424e79\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/artificial_intelligence\">artificial intelligence<\/span>,  <strong>Alan <span id=\"urn:enhancement-4ed83b6b\" class=\"textannotation disambiguated wl-person\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/alan_turing\">Turing<\/span><\/strong>, suggested this as a possible <span id=\"urn:enhancement-ce1bafaf\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/application_software\">application<\/span> for the &#8220;learning machines&#8221; he imagined as early as the late 1940s (as <a href=\"https:\/\/www.codemotion.com\/magazine\/dev-hub\/machine-learning-dev\/artificial-intelligence-the-new-electricity\/\">discussed in a previous article<\/a>). Other pioneers, such as Claude <span id=\"urn:enhancement-e4999161\" class=\"textannotation disambiguated wl-person\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/claude_shannon\">Shannon<\/span>, who founded the mathematical theory of information and communication, have also suggested natural languages as a playground for the <span id=\"urn:enhancement-42a22175\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/application_software\">application<\/span> of <span id=\"urn:enhancement-ac92d808\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/information_technology\">information technology<\/span> and <span id=\"urn:enhancement-10e5f97f\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/computer_science\">computer science<\/span>.<\/p>\n\n\n\n<p>The world has moved on since the days of these early pioneers, and today we use <span id=\"urn:enhancement-915d68cf\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/natural_language_processing\">NLP<\/span> solutions without even realizing it. We live in the world <span id=\"urn:enhancement-153304a5\" class=\"textannotation disambiguated wl-person\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/alan_turing\">Turing<\/span> dreamt of, but are scarcely aware of doing so!<\/p>\n\n\n\n<p>The history of <span id=\"urn:enhancement-286bd469\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/natural_language_processing\">NLP<\/span> is long and complex, involving several <span id=\"urn:enhancement-610e0078\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/scientific_technique\">techniques<\/span> once considered state of the art that now are barely remembered. Certain turning points in this history changed the field forever, and focused the attention of thousands of <span id=\"urn:enhancement-5c5cd77f\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/research\">researchers<\/span> on a single path forward. In recent years, the <span id=\"urn:enhancement-a208e25b\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/resource\">resources<\/span> required to <span id=\"urn:enhancement-3ef209fc\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/experiment\">experiment<\/span> and forge new paths in <span id=\"urn:enhancement-3500d409\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/natural_language_processing\">NLP<\/span> have largely only been available outwith academia.  Such resources are most available to private hi-tech <span id=\"urn:enhancement-75746158\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/company\">companies<\/span>: hardware and large groups of <span id=\"urn:enhancement-e5929bd9\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/research\">researchers<\/span> are more easily allocated to a particular task by <span id=\"urn:enhancement-97bdeb4b\" class=\"textannotation disambiguated wl-organization\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/google\">Google<\/span>, <span id=\"urn:enhancement-237310bb\" class=\"textannotation disambiguated wl-creative-work\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/facebook\">Facebook<\/span> and <span id=\"urn:enhancement-620235a5\" class=\"textannotation disambiguated wl-organization\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/amazon-com\">Amazon<\/span> than by the average university, even in the United States. Consequently, more and more new <span id=\"urn:enhancement-3285a62c\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/idea\">ideas<\/span> arise out of big <span id=\"urn:enhancement-fe90ade3\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/company\">companies<\/span> rather than universities. In <span id=\"urn:enhancement-790d0f22\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/natural_language_processing\">NLP<\/span>, at least two such <span id=\"urn:enhancement-1ba324cd\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/idea\">ideas<\/span> followed this pattern: the <strong>word2vec<\/strong> and <strong>BERT<\/strong> algorithms.<\/p>\n\n\n\n<p>The former is a word embedding algorithm devised by Tomas Mikolov and others in 2013 (the original C++ code <a aria-label=\" (opens in a new tab)\" href=\"https:\/\/code.google.com\/archive\/p\/word2vec\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\" class=\"ek-link\">can be found here<\/a>). Another important <span id=\"urn:enhancement-cf4ff09d\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/class_computer_programming\">class<\/span> of algorithms &#8211; BERT &#8211; was published by <span id=\"urn:enhancement-b9cfaf8d\" class=\"textannotation disambiguated wl-organization\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/google\">Google<\/span> researchers in 2018. Within just a few months these algorithms <strong>replaced previous <span id=\"urn:enhancement-7b127b62\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/natural_language_processing\">NLP<\/span> algorithms in the <span id=\"urn:enhancement-74c1153b\" class=\"textannotation disambiguated wl-organization\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/google\">Google<\/span> <span id=\"urn:enhancement-eaa01a72\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/web_search_engine\">Search Engine<\/span><\/strong>. In both cases, the researchers released their solutions as <strong><span id=\"urn:enhancement-16dd4358\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/open_source\">open source<\/span><\/strong>, disclosing results, datasets and of course, the full code.<\/p>\n\n\n\n<p>Such rapid progress and impact on widely-used products is amazing and worthy of deeper analysis. This <span id=\"urn:enhancement-149a7681\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/article_publishing\">article<\/span> will offer hints for <span id=\"urn:enhancement-a6b27e27\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/software_developer\">developers<\/span> who wish to play with this new <span id=\"urn:enhancement-608f9a16\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/tool\">tool<\/span>.<\/p>\n\n\n\n<p>If you are interested in NLP, especially in conjunction with Deep Learning, don&#8217;t miss the opportunity to attend our <strong>Deep Learning Online Conference<\/strong>! Find out more information at <a href=\"https:\/\/bit.ly\/3aObhvV\" class=\"ek-link\">this link<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Understanding natural language: from linguistics to word embedding<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Linguistic rules<\/h3>\n\n\n\n<p>Before <span id=\"urn:enhancement-c10118db\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/machine_learning\">machine learning<\/span> methods became effective and popular in the <span id=\"urn:enhancement-2487bd3a\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/artificial_intelligence\">AI<\/span> community, i.e., before the 1980s, <span id=\"urn:enhancement-1763cc51\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/natural_language_processing\">natural language processing<\/span> usually involved taking advantage of <span id=\"urn:enhancement-96ad252c\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/linguistics\">linguistic<\/span> and logical theories (and even of the philosophy of language). The distinction between syntax and semantics, for example, is a consequence typical of that approach, in which the primary concern was trying to represent <span id=\"urn:enhancement-76b862b\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/grammar\">grammar<\/span> rules in an effective way. The software then used these rules to analyse and represent natural language <span id=\"urn:enhancement-8d62ce7\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/written_language\">texts<\/span>, in order to <span id=\"urn:enhancement-b44aabd1\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/statistical_classification\">classify<\/span> or summarize them, or to find answers to questions, translate from one language to another one, and so on.<\/p>\n\n\n\n<p>The <strong>results achieved were generally poor<\/strong> when compared with the effort required to set up such <span id=\"urn:enhancement-2345c054\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/implementation\">implementations<\/span>. The computational element was minimal, consisting of building logical representations of sentences (using languages like <span id=\"urn:enhancement-d7cd29fa\" class=\"textannotation disambiguated wl-creative-work\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/prolog\">Prolog<\/span> or <span id=\"urn:enhancement-5720ce3c\" class=\"textannotation disambiguated wl-creative-work\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/lisp_programming_language\">Lisp<\/span>), and then applying some <strong>expert systems<\/strong> to analyse them.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Information Retrieval and Neural Networks<\/h3>\n\n\n\n<p>However, a different type of representation had already been devised for natural language in the <span id=\"urn:enhancement-165d33b5\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/information_retrieval\">IR<\/span> (<em><span id=\"urn:enhancement-7bdcd47f\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/information_retrieval\">Information Retrieval<\/span><\/em>) <span id=\"urn:enhancement-76e30c16\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/community\">community<\/span>. <span id=\"urn:enhancement-3dc10c04\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/information_retrieval\">IR<\/span> was a hot topic in the 1960s, when computers were used mainly to store large archives of <span id=\"urn:enhancement-d28c523c\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/data\">data<\/span> and <span id=\"urn:enhancement-a7c6d07\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/document\">documents<\/span>. Given that these computers were rather slow, clever techniques were needed to retrieve information from inside electronic <span id=\"urn:enhancement-26e406b\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/document\">documents<\/span>. In particular, <em><span id=\"urn:enhancement-1e0ec772\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/algebraic_number\">algebraic<\/span><\/em> <span id=\"urn:enhancement-68711d1b\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/scientific_modelling\">models<\/span> called <em>vector space <span id=\"urn:enhancement-385bc7af\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/scientific_modelling\">models<\/span><\/em> were developed, in which <span id=\"urn:enhancement-10e3a71e\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/document\">documents<\/span> were associated with vectors in an <em>N<\/em>-dimensional space. This provided not only a practical indexing of <span id=\"urn:enhancement-adae9406\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/document\">documents<\/span>, but also offered the chance to locate similar <span id=\"urn:enhancement-79839622\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/document\">documents<\/span> in the same region of the space. For example, imagine associating a <span id=\"urn:enhancement-763ff980\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/document\">document<\/span> with a point on a plane. Ideally, nearby points would correspond with similar <span id=\"urn:enhancement-18e42c17\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/document\">documents<\/span>, so that Euclidean distance can be used to discriminate between similar <span id=\"urn:enhancement-7de3029\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/document\">documents<\/span>, etc.<\/p>\n\n\n\n<p>These techniques allowed users to <strong>deal numerically with <span id=\"urn:enhancement-a9054176\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/document\">documents<\/span>, sentences and <span id=\"urn:enhancement-acb5e38b\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/word\">words<\/span><\/strong>. Once one <span id=\"urn:enhancement-17bec22a\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/linguistics\">linguistic<\/span> object is <span id=\"urn:enhancement-365bacf3\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/map_mathematics\">mapped<\/span> to a vector, the powerful weapons of numerical analysis and optimization may be applied to deal with them. <strong>Neural networks<\/strong> offer a good example of one of the more powerful of such weapons. Indeed, a neural network is simply an optimization algorithm that works on vectors, and thus on points in an <em>N<\/em>-dimensional space.<\/p>\n\n\n\n<p>Nonetheless, generally speaking neural network are <strong>supervised <span id=\"urn:enhancement-6b7a1250\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/learning\">learning<\/span> algorithms<\/strong>: a net, newly set up, is a blank space &#8211; to be properly used <a href=\"https:\/\/www.codemotion.com\/magazine\/dev-hub\/machine-learning-dev\/understanding-ai-training\/\" class=\"ek-link\">it needs to be trained on a set of data<\/a>, i.e., a <span id=\"urn:enhancement-9c8f75cf\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/data_set\">dataset<\/span> which contains both input records and the correct answers for that record.<\/p>\n\n\n\n<p>If we are interested in <span id=\"urn:enhancement-39c1a151\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/natural_language_processing\">NLP<\/span> tasks more sophisticated than <span id=\"urn:enhancement-c2dde160\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/statistical_classification\">classification<\/span>, e.g., summarization, translation, text generation, etc., the use of standard neural networks requires a lot of labelled <span id=\"urn:enhancement-947aa947\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/training\">training<\/span> <span id=\"urn:enhancement-2ccc57d8\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/data\">data<\/span>, which are difficult to produce. For example, to create translations, examples of texts in both the source and target language are needed &#8211; and a lot of them, since natural languages have tens of thousands of <span id=\"urn:enhancement-a49a7b68\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/word\">words<\/span>, all of which can be arranged in extremely variable sentence combinations, several orders of magnitude more in number. That&#8217;s even before one considers that <strong>deep <span id=\"urn:enhancement-7ccddea8\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/learning\">learning<\/span><\/strong> algorithms require millions of <span id=\"urn:enhancement-7a58e704\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/training\">training<\/span> cases to be properly trained.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Unsupervised Algorithms: word2vec and BERT<\/h3>\n\n\n\n<p>To deal with texts, it is better to devise <strong>unsupervised algorithms<\/strong>, which can be fed with raw <span id=\"urn:enhancement-ee6a6819\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/data\">data<\/span> taken from the <span id=\"urn:enhancement-c0616df9\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/internet\">Internet<\/span>, for example. <strong>BERT was initially fed with <span id=\"urn:enhancement-d2a0c9b7\" class=\"textannotation disambiguated wl-creative-work\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/wikipedia\">Wikipedia<\/span> pages<\/strong>.<\/p>\n\n\n\n<p><strong>How can we design an unsupervised neural network?<\/strong> Neural networks are intrinsically supervised, but if we make some assumptions, we can transform unsupervised <span id=\"urn:enhancement-c1994dac\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/data_set\">datasets<\/span> into supervised ones. This works for <span id=\"urn:enhancement-fb3b3f24\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/data\">data<\/span> arranged in series, such as texts, which may be considered series of <span id=\"urn:enhancement-8232107d\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/word\">words<\/span>.<\/p>\n\n\n\n<p>Consider a text; via some tokenizing process, imagine having the sequence of <span id=\"urn:enhancement-6fe90631\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/word\">words<\/span> which constitute it: <em>w<\/em><sub>1<\/sub>,&#8230;,<em>w<sub>N<\/sub><\/em>. If the text is long &#8211; an <span id=\"urn:enhancement-64abfe1a\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/article_publishing\">article<\/span>, a short story, a book &#8211; most <span id=\"urn:enhancement-56329ade\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/word\">words<\/span> will appear more than once, but in a certain order. The <span id=\"urn:enhancement-8cea0961\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/idea\">idea<\/span> behind <strong><span id=\"urn:enhancement-a66d0ef3\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/word\">word<\/span> <span id=\"urn:enhancement-fa57ac6f\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/embedding\">embedding<\/span><\/strong> associates each <span id=\"urn:enhancement-ef8b67e9\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/word\">word<\/span> with a certain vector (this is different from the vector space model of IR which associates a vector with an entire <span id=\"urn:enhancement-63b614a\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/document\">document<\/span>). In this way, the position of vectors within the <em>n<\/em>-dimensional space should reflect the contextual relationships between <span id=\"urn:enhancement-b3cdd51f\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/word\">words<\/span>.<\/p>\n\n\n\n<p>Another way to look at this <span id=\"urn:enhancement-e2943179\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/idea\">idea<\/span>: given a text and its series of consecutive <span id=\"urn:enhancement-d01496db\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/word\">words<\/span> <em>w<\/em><sub>1<\/sub>,&#8230;,<em>w<sub>N<\/sub><\/em>, imagine the set of pairs (<em>w<\/em><sub>1<\/sub>,<em>w<\/em><sub>2<\/sub>),(<em>w<\/em><sub>2<\/sub>,<em>w<\/em><sub>3<\/sub>),&#8230;,(<em>w<sub>N<\/sub><\/em><sub>\u22121<\/sub>,<em>w<sub>N<\/sub><\/em>) as a <span id=\"urn:enhancement-33af8a8b\" class=\"textannotation disambiguated wl-thing\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/training_set\">training set<\/span>, by means of which the network is able to learn the function <em>y <\/em>= <em>f<\/em>(<em>x<\/em>) such that <em>f<\/em>(<em>w<sub>i<\/sub><\/em>) = <em>w<sub>i<\/sub><\/em><sub>+1<\/sub>. Thus, the network learns to predict the word that will follow another given word. This is a non-linear generalization of certain simple Markov algorithms, such as those used to write bullshit generators.<\/p>\n\n\n\n<p>A generalization of this process consists of considering each word in a text as surrounded by a &#8216;window&#8217; in an effort to teach the network how to guess the missing word in a sentence.  This process is used in the <strong>word2vec<\/strong> algorithm to train a <strong>recurrent neural network<\/strong>. Such algorithms are consequently <em>context-free<\/em>, as they simply associate vectors with single words.<\/p>\n\n\n\n<p>More sophisticated algorithms involve the consideration of context, in which the vector associated to a word is varies according to context, rather than remaining the same regardless of the other words surrounding it in the sentence.<\/p>\n\n\n\n<p>Google&#8217;s BERT is such an algorithm.<\/p>\n\n\n\n<div class=\"uagb-cta__outer-wrap uagb-block-35162b6b\"><div class=\"uagb-cta__content-wrap uagb-cta__block uagb-cta__icon-position-right uagb-cta__content-right uagb-cta__content-stacked-tablet uagb-cta__button-valign-middle \"><div class=\"uagb-cta__left-right-wrap\"><div class=\"uagb-cta__content\"><div class=\"uagb-cta__title-wrap\"><h3 class=\"uagb-cta__title\">Codemotion Online Tech Conference<br>Shaping the Future with Deep Learning<\/h3><\/div><div class=\"uagb-cta-text-wrap\"><p class=\"uagb-cta__desc\">The impact of AI on our lives is already tangible, and is destined to grow exponentially. From Healthcare to Finance, from Marketing to Manufacturing, these and many more fields are experiencing a fast-paced revolution. <br><br>On May 27th, join our free online event to learn from some of the most prominent DL experts in the world how you too can be a part of this change.<\/p><\/div><\/div><div class=\"uagb-cta__link-wrapper uagb-cta__block-link-style\"><div class=\"uagb-cta__button-wrapper\"><a href=\"https:\/\/bit.ly\/3aObhvV\" class=\"uagb-cta__button-link-wrapper uagb-cta__block-link uagb-cta-typeof-button\" target=\"_blank\" rel=\"noopener noreferrer\"><span class=\"uagb-cta__link-content-inner\"><span>JOIN FOR FREE<\/span><\/span><\/a><\/div><\/div><\/div><\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">BERT comes into play<\/h2>\n\n\n\n<p><strong>BERT<\/strong> is an acronym of <em><span id=\"urn:local-annotation-129823\" class=\"textannotation disambiguated\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/bert\">Bidirectional Encoder Representations from Transformers<\/span><\/em>. The term <em>bidirectional<\/em> means that the context of a word is given by both the words that follow it and by the words preceding it. This technique makes this algorithm <strong>hard to train but very effective<\/strong>. Exploring the surrounding text around words is computationally expensive but allows a deeper understanding of words and sentences.<\/p>\n\n\n\n<p>Unidirectional context-oriented algorithm already exist. A neural network can be trained to predict which word will follow a sequence of given words, once trained on a huge dataset of sentences. However, predicting that word from both the previous and following words is not an easy task. The only way to do so effectively is to mask some words in a sentence and predict them too, e.g., the sentence &#8220;<em>the quick brown fox jumps over the lazy dog<\/em>&#8221; might be masked as &#8220;<em>the X brown fox jumps over the Y dog<\/em>&#8221; with label (<em>X <\/em>=<em> quick<\/em>, <em>Y <\/em>= <em>lazy<\/em>) to become a labelled record in a training set of sentences. One can easily derive a training set from a bundle of unsupervised texts by simply masking 15% of words (as BERT does), and training the neural network to deduce the missing words from the remaining ones.<\/p>\n\n\n\n<p>Notice that <strong>BERT is truly a <span id=\"urn:local-annotation-625489\" class=\"textannotation disambiguated\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/deep_learning\">deep learning<\/span> algorithm<\/strong>, while context-free algorithms such as <em>word2vec<\/em>, based on shallow recurrent networks, may not be. However, as such, BERT&#8217;s training is very expensive, due to its transformer aspect. Training on a huge body of text &#8211; for example, all English-language Wikipedia pages &#8211;  is an Herculean effort that requires decidedly nontrivial computational power.<\/p>\n\n\n\n<p>As a result, BERT&#8217;s creators <strong>disentangled the training phase from the tuning phase<\/strong> required to properly apply the algorithm to a specific task. The algorithm has to be trained once overall, and then fine tuned specifically for each context.<\/p>\n\n\n\n<p>Luckily, BERT comes with several pre-trained representations already computed. On the <a aria-label=\"GitHub page (opens in a new tab)\" href=\"https:\/\/github.com\/google-research\/bert\" target=\"_blank\" rel=\"noreferrer noopener nofollow\" class=\"ek-link\">GitHub page<\/a> of the project a number of pre-trained models can be found. However, such representations are not enough to solve a specific problem. The model must be <strong>fine tuned<\/strong> to the desired task. The following demonstrates how this is done, using a test example:<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">BERT: let&#8217;s play with it<\/h2>\n\n\n\n<p>For our purely explanatory purposes, we will use <span id=\"urn:local-annotation-379440\" class=\"textannotation disambiguated\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/python_programming_language\">Python<\/span> to play with a standard text dataset, the <em>Deeply Moving<\/em> dataset maintained at Stanford University, which contains short movie reviews from the &#8216;Rotten Tomatoes&#8217; website. The dataset can be downloaded from <a aria-label=\" (opens in a new tab)\" href=\"https:\/\/nlp.stanford.edu\/sentiment\/code.html\" target=\"_blank\" rel=\"noreferrer noopener nofollow\" class=\"ek-link\">this page<\/a>.<\/p>\n\n\n\n<p>We assume that the dataset is stored inside a directory <code>stanfordSentimentTreebank<\/code>. The sentences are stored inside the file <code>datasetSentences.txt<\/code> as pairs (index, sentence) one per line. A splitting training\/testing set can also be found inside the file <code>datasetSplit.txt<\/code> its rows containing pairs (index, ds) being ds = 1 for training and ds = 2 for testing sentences. The file <code>sentiment_labels.txt<\/code> contains the labels attached to each sentence in the dataset and also to each phrase inside it as pairs (index, phrase), with 0 being the most negative, and 1 the most positive, sentiment. The list of all phrases along with their indexes is stored in <code>dictionary.txt<\/code>.<\/p>\n\n\n\n<p>To keep things simple, we&#8217;ll focus only on sentences, so that our datasets are built on filtering scores for sentences and leaving out any remaining phrases. The dataset will also need to be built  from those files &#8211; a simple process that is demonstrated below. BERT will then be applied to perform a test <strong>sentiment analysis<\/strong> on this dataset.<\/p>\n\n\n\n<p>Whatever the task, it is not necessary to pre-train the BERT model, but only to fine-tune a pre-trained model on the specific dataset that relates to the problem we want to use BERT to study. We will try to use such a pre-trained model to perform our simple classification task: more exciting use cases may be found on the  <span id=\"urn:local-annotation-968961\" class=\"textannotation disambiguated\" itemid=\"http:\/\/data.wordlift.io\/wl01770\/entity\/github\">GitHub<\/span> page of the project mentioned above, as well as elsewhere on the Web.<\/p>\n\n\n\n<p>First, we choose the pre-trained model: in the BERT GitHub repository there are several choices available, we will use the one known as &#8216;BERT-tiny&#8217;, aka <code>bert_en_uncased_L-2_H-128_A-2<\/code>.<\/p>\n\n\n\n<p>This pre-trained representation has been obtained by converting training texts into lowercase, with accent markers stripped out. Moreover, the model is set up as a network with 2 layers, 128 hidden, for a total of 4.4K parameters to train. This is the tiniest model, others include the &#8216;BERT-mini&#8217; (4 layers, 256 hidden), &#8216;BERT-small&#8217; (4 layers, 512 hidden), &#8216;BERT-Medium&#8217; (8 layers, 512 hidden) and &#8216;BERT-base&#8217; (12 layers, 768 hidden). Of course, the larger the network architecture, the more computational effort is needed to fine-tune these models. As the purpose of this article is purely explanatory, we&#8217;ll stick to the tiniest option to enable anyone to run the following code snippet, even if no GPUs are available.<\/p>\n\n\n\n<p>The pre-trained model can be downloaded from the repository and extracted into a local folder. This folder will contain the following files:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><code>bert_config.json<\/code><\/li><li><code>bert_model.ckpt.data-00000-of-00001<\/code><\/li><li><code>bert_model.ckpt.index<\/code><\/li><li><code>vocab.txt<\/code><\/li><\/ul>\n\n\n\n<p>The first file contains all the configuration necessary to build a network layer to use this BERT model, while the latter files are needed to properly tokenize our texts. The largest file contains the model, which may be loaded from the BERT library using the methods demonstrated below.<\/p>\n\n\n\n<p>To remain focused on the model, the assumption will be that our code is run inside a directory which also contains those files, and where the directory <code>stanfordSentimentTreebank<\/code> with our dataset is also stored. This is necessary before running the following programs:<\/p>\n\n\n\n<p>Before setting up the model, our dataset is <strong>tokenized<\/strong> according to the format expected by the BERT layers; this can be done via the <code>FullTokenizer<\/code> class from the BERT package. Next, the tokenizer is fed with each sentence in our datsaset. The tokenizer result, which is a list of strings, between &#8220;<code>[CLS]<\/code>&#8221; and &#8220;<code>[SEP]<\/code>&#8221; is enclosed, as required by the BERT algorithm implementation.<\/p>\n\n\n\n<p>The output of our model will be simply a number between 0 and 1.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-1\" data-shcb-language-name=\"PHP\" data-shcb-language-slug=\"php\"><span><code class=\"hljs language-php\">import bert\nimport numpy <span class=\"hljs-keyword\">as<\/span> np\nimport os\n \nBERT_PATH = <span class=\"hljs-string\">\"bert_layer\"<\/span>\nVOCAB_TXT = os.path.join(BERT_PATH, <span class=\"hljs-string\">\"vocab.txt\"<\/span>)\n \nDATASET_DIR = <span class=\"hljs-string\">\"stanfordSentimentTreebank\"<\/span>\nSENTENCES = os.path.join(DATASET_DIR, <span class=\"hljs-string\">\"datasetSentences.txt\"<\/span>)\nSCORES = os.path.join(DATASET_DIR, <span class=\"hljs-string\">\"sentiment_labels.txt\"<\/span>)\nSPLITTING = os.path.join(DATASET_DIR, <span class=\"hljs-string\">\"datasetSplit.txt\"<\/span>)\nDICTIONARY = os.path.join(DATASET_DIR, <span class=\"hljs-string\">\"dictionary.txt\"<\/span>)\n \nMAX_LENGTH = <span class=\"hljs-number\">64<\/span>    <span class=\"hljs-comment\"># Length of word vectors which the model accepts as input<\/span>\n \ntokenizer = bert.bert_tokenization.FullTokenizer(VOCAB_TXT, do_lower_case = <span class=\"hljs-keyword\">True<\/span>)\n \ntraining_set, training_scores = &#91;], &#91;]\ntesting_set, testing_scores = &#91;], &#91;]\n \n<span class=\"hljs-comment\"># For each sentence in the dataset, tokenize it and stores it inside either<\/span>\n<span class=\"hljs-comment\"># the training or the testing set, according to the value in `datasetSplit.txt`<\/span>\n \ndef read_file(filename):\n    with open(filename) <span class=\"hljs-keyword\">as<\/span> f:\n        f.readline()            <span class=\"hljs-comment\"># skips the heading line<\/span>\n        <span class=\"hljs-keyword\">return<\/span> f.readlines()\n \nsentences = read_file(SENTENCES)\nscores = read_file(SCORES)\nsplitting = read_file(SPLITTING)\ndictionary = read_file(DICTIONARY)\n \n<span class=\"hljs-comment\"># let scores&#91;i] = score being i an int index and score a float score.<\/span>\nscores = {int(s&#91;:s.index(<span class=\"hljs-string\">\"|\"<\/span>)]): float(s&#91;s.index(<span class=\"hljs-string\">\"|\"<\/span>)+<span class=\"hljs-number\">1<\/span>:]) <span class=\"hljs-keyword\">for<\/span> s in scores}\n \n<span class=\"hljs-comment\"># let splitting&#91;i] = int denoting the kind of dataset (1=training, 2=testing).<\/span>\nsplitting = {int(s&#91;:s.index(<span class=\"hljs-string\">\",\"<\/span>)]): int(s&#91;s.index(<span class=\"hljs-string\">\",\"<\/span>)+<span class=\"hljs-number\">1<\/span>:]) <span class=\"hljs-keyword\">for<\/span> s in splitting}\n \n<span class=\"hljs-comment\"># let dictionary&#91;s] = phrase index of the corresponding string<\/span>\ndictionary = {s&#91;:s.index(<span class=\"hljs-string\">\"|\"<\/span>)]: int(s&#91;s.index(<span class=\"hljs-string\">\"|\"<\/span>)+<span class=\"hljs-number\">1<\/span>:]) <span class=\"hljs-keyword\">for<\/span> s in dictionary}\n \n<span class=\"hljs-comment\"># Now looks for each sentence inside the dictionary, retrieves the index and looks for<\/span>\n<span class=\"hljs-comment\"># the index in the scores, creating a list of sentences and scores<\/span>\n<span class=\"hljs-keyword\">for<\/span> s in sentences:\n    i = int(s&#91;:s.index(<span class=\"hljs-string\">\"\\t\"<\/span>)])       <span class=\"hljs-comment\"># sentence index, to be matched in splitting<\/span>\n    s = s&#91;s.index(<span class=\"hljs-string\">\"\\t\"<\/span>) + <span class=\"hljs-number\">1<\/span>:]&#91;:<span class=\"hljs-number\">-1<\/span>]   <span class=\"hljs-comment\"># extract the sentence (strip the ending \"\\n\")<\/span>\n    <span class=\"hljs-keyword\">if<\/span> s not in dictionary:\n        <span class=\"hljs-keyword\">continue<\/span>\n    ph_i = dictionary&#91;s]             <span class=\"hljs-comment\"># associated phrase index<\/span>\n    <span class=\"hljs-comment\"># Now tokenizes the sentence and put it into the BERT format<\/span>\n    s = tokenizer.tokenize(s)\n    <span class=\"hljs-keyword\">if<\/span> len(s) &gt; MAX_LENGTH - <span class=\"hljs-number\">2<\/span>:\n        s = s&#91;:MAX_LENGTH - <span class=\"hljs-number\">2<\/span>]\n    s = tokenizer.convert_tokens_to_ids(&#91;<span class=\"hljs-string\">\"&#91;CLS]\"<\/span>] + s + &#91;<span class=\"hljs-string\">\"&#91;SEP]\"<\/span>])\n    <span class=\"hljs-keyword\">if<\/span> len(s) &lt; MAX_LENGTH:\n        s += &#91;<span class=\"hljs-number\">0<\/span>] * (MAX_LENGTH - len(s))\n    <span class=\"hljs-comment\"># Decides in which dataset to store the data<\/span>\n    <span class=\"hljs-keyword\">if<\/span> splitting&#91;i] == <span class=\"hljs-number\">1<\/span>:\n        training_set.append(s)\n        training_scores.append(scores&#91;ph_i])\n    <span class=\"hljs-keyword\">else<\/span>:\n        testing_set.append(s)\n        testing_scores.append(scores&#91;ph_i])\n \ntraining_set, training_scores = np.<span class=\"hljs-keyword\">array<\/span>(training_set), np.<span class=\"hljs-keyword\">array<\/span>(training_scores)\ntesting_set, testing_scores = np.<span class=\"hljs-keyword\">array<\/span>(testing_set), np.<span class=\"hljs-keyword\">array<\/span>(testing_scores)<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-1\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">PHP<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">php<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Let&#8217;s use the BERT model that we downloaded from the GitHub repository. As usual in these kinds of models, fine tuning requires setting some hyper-parameters, i.e., parameters external to the model, such as the learning rate, the batch size, the number of epochs. Finding the right combination is the nightmare of every ML practitioner, but in BERT&#8217;s case, we have some suggestions from its inventors:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Batch size: 8, 16, 32, 64, 128 (in general, the larger the BERT model, the smaller the size)<\/li><li>Learning rate: 3e-4, 1e-4, 5e-5, 3e-5<\/li><li>Epochs: 4<\/li><\/ul>\n\n\n\n<p>As usual, we opt for the Adam optimizer, even if it is more expensive computationally. Nothing special is added to the BERT network layer provided by Google, but two dimensions of tensors representing the BERT output are pooled into one via the <code>GlobalAveragePooling1D<\/code> method &#8211; another trick that emerged from the Google research of the 2010s. Next, the BERT output is provided to a fully connected layer, the result of which is turned into the output of the network. The summary method of the Model Keras class allows us to show the shapes of the layers, and the verbose option in the training method is turned on to see the performances along epochs.<\/p>\n\n\n\n<p>Of course, this is a very simple architecture, that may be further complicated at will to fit more complex purposes, for example. But, if time and\/or resources permit, it is better to start with a larger pre-trained BERT model.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-2\" data-shcb-language-name=\"PHP\" data-shcb-language-slug=\"php\"><span><code class=\"hljs language-php\">from tensorflow.keras.models import Model\nfrom tensorflow.keras.layers import Dense, Dropout, GlobalAveragePooling1D, Input, Lambda\nfrom tensorflow.keras.optimizers import Adam\n \n<span class=\"hljs-comment\"># Loads the bert pre-trained layer to plug into our network<\/span>\nbert_params = bert.params_from_pretrained_ckpt(BERT_PATH)\nbert_layer = bert.BertModelLayer.from_params(bert_params, name = <span class=\"hljs-string\">\"BERT\"<\/span>)\nbert_layer.apply_adapter_freeze()\n \n<span class=\"hljs-comment\"># We arrange our layers by composing them as functions, with the input layer as inmost one<\/span>\ninput_layer = Input(shape=(MAX_LENGTH,), dtype = <span class=\"hljs-string\">'int32'<\/span>, name = <span class=\"hljs-string\">'input_ids'<\/span>)\noutput_layer = bert_layer(input_layer)\n<span class=\"hljs-comment\">#output_layer = Dense(128, activation = \"tanh\")(output_layer)<\/span>\n<span class=\"hljs-comment\">#output_layer = Dropout(0.5)(output_layer)<\/span>\n<span class=\"hljs-comment\">#output_layer = Lambda(lambda x: x&#91;:, :, 0])(output_layer)   # we drop the second dimension<\/span>\n<span class=\"hljs-comment\">#output_layer = Dropout(0.5)(output_layer)<\/span>\n<span class=\"hljs-comment\">#output_layer = Dense(1)(output_layer)<\/span>\n<span class=\"hljs-comment\">#output_layer = Dropout(0.5)(output_layer)<\/span>\n \noutput_layer = GlobalAveragePooling1D()(output_layer)\noutput_layer = Dense(<span class=\"hljs-number\">128<\/span>, activation = <span class=\"hljs-string\">\"relu\"<\/span>)(output_layer)\noutput_layer = Dense(<span class=\"hljs-number\">1<\/span>, activation = <span class=\"hljs-string\">\"relu\"<\/span>)(output_layer)\n \nneural_network = Model(inputs = input_layer, outputs = output_layer)\nneural_network.build(input_shape = (None, MAX_LENGTH))\nneural_network.compile(loss = <span class=\"hljs-string\">\"mse\"<\/span>, optimizer = Adam(learning_rate = <span class=\"hljs-number\">3e-5<\/span>))\n \nneural_network.summary()\n \nneural_network.fit(\n    training_set,\n    training_scores,\n    batch_size= <span class=\"hljs-number\">128<\/span>,\n    shuffle = <span class=\"hljs-keyword\">True<\/span>,\n    epochs = <span class=\"hljs-number\">4<\/span>,\n    validation_data = (testing_set, testing_scores),\n    verbose = <span class=\"hljs-number\">1<\/span>\n)<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-2\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">PHP<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">php<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>And here is the output:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-3\" data-shcb-language-name=\"PHP\" data-shcb-language-slug=\"php\"><span><code class=\"hljs language-php\">Model: <span class=\"hljs-string\">\"model_1\"<\/span>\n_________________________________________________________________\nLayer (type)                 Output Shape              Param <span class=\"hljs-comment\">#   <\/span>\n=================================================================\ninput_ids (InputLayer)       &#91;(None, <span class=\"hljs-number\">64<\/span>)]              <span class=\"hljs-number\">0<\/span>         \n_________________________________________________________________\nBERT (BertModelLayer)        (None, <span class=\"hljs-number\">64<\/span>, <span class=\"hljs-number\">128<\/span>)           <span class=\"hljs-number\">4369152<\/span>   \n_________________________________________________________________\nglobal_average_pooling1d_2 ( (None, <span class=\"hljs-number\">128<\/span>)               <span class=\"hljs-number\">0<\/span>         \n_________________________________________________________________\ndense_4 (Dense)              (None, <span class=\"hljs-number\">128<\/span>)               <span class=\"hljs-number\">16512<\/span>     \n_________________________________________________________________\ndense_5 (Dense)              (None, <span class=\"hljs-number\">1<\/span>)                 <span class=\"hljs-number\">129<\/span>       \n=================================================================\nTotal params: <span class=\"hljs-number\">4<\/span>,<span class=\"hljs-number\">385<\/span>,<span class=\"hljs-number\">793<\/span>\nTrainable params: <span class=\"hljs-number\">4<\/span>,<span class=\"hljs-number\">385<\/span>,<span class=\"hljs-number\">793<\/span>\nNon-trainable params: <span class=\"hljs-number\">0<\/span>\n_________________________________________________________________\nTrain on <span class=\"hljs-number\">8117<\/span> samples, validate on <span class=\"hljs-number\">3169<\/span> samples\nEpoch <span class=\"hljs-number\">1<\/span>\/<span class=\"hljs-number\">4<\/span>\n<span class=\"hljs-number\">8117<\/span>\/<span class=\"hljs-number\">8117<\/span> &#91;==============================] - <span class=\"hljs-number\">100<\/span>s <span class=\"hljs-number\">12<\/span>ms\/sample - loss: <span class=\"hljs-number\">0.0766<\/span> - val_loss: <span class=\"hljs-number\">0.0661<\/span>\nEpoch <span class=\"hljs-number\">2<\/span>\/<span class=\"hljs-number\">4<\/span>\n<span class=\"hljs-number\">8117<\/span>\/<span class=\"hljs-number\">8117<\/span> &#91;==============================] - <span class=\"hljs-number\">99<\/span>s <span class=\"hljs-number\">12<\/span>ms\/sample - loss: <span class=\"hljs-number\">0.0629<\/span> - val_loss: <span class=\"hljs-number\">0.0635<\/span>\nEpoch <span class=\"hljs-number\">3<\/span>\/<span class=\"hljs-number\">4<\/span>\n<span class=\"hljs-number\">8117<\/span>\/<span class=\"hljs-number\">8117<\/span> &#91;==============================] - <span class=\"hljs-number\">101<\/span>s <span class=\"hljs-number\">12<\/span>ms\/sample - loss: <span class=\"hljs-number\">0.0592<\/span> - val_loss: <span class=\"hljs-number\">0.0594<\/span>\nEpoch <span class=\"hljs-number\">4<\/span>\/<span class=\"hljs-number\">4<\/span>\n<span class=\"hljs-number\">8117<\/span>\/<span class=\"hljs-number\">8117<\/span> &#91;==============================] - <span class=\"hljs-number\">94<\/span>s <span class=\"hljs-number\">12<\/span>ms\/sample - loss: <span class=\"hljs-number\">0.0550<\/span> - val_loss: <span class=\"hljs-number\">0.0571<\/span>\n&lt;tensorflow.python.keras.callbacks.History at <span class=\"hljs-number\">0x1ba0ef8cf48<\/span>&gt;<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-3\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">PHP<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">php<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Now we have a simple pre-trained BERT model fine-tuned and trained on our dataset. Let\u2019s use it to check some sentences about imaginary movies: the network will order them from the most negative to the most positive.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-4\" data-shcb-language-name=\"PHP\" data-shcb-language-slug=\"php\"><span><code class=\"hljs language-php\">some_sentences = &#91;\n    <span class=\"hljs-string\">\"The film is not bad but the actors should take acting lessons\"<\/span>,\n    <span class=\"hljs-string\">\"Another chiefwork by a master of western movies\"<\/span>,\n    <span class=\"hljs-string\">\"This film is just disappointing: do not waste time on it\"<\/span>,\n    <span class=\"hljs-string\">\"Well directed but poorly acted\"<\/span>,\n    <span class=\"hljs-string\">\"The movie is well directed and greatly acted\"<\/span>,\n    <span class=\"hljs-string\">\"A honest zombie movie with actually no new ideas\"<\/span>,\n]\n<span class=\"hljs-keyword\">print<\/span>(<span class=\"hljs-string\">\"The following sentences will be sorted from the most negative upward\"<\/span>)\n<span class=\"hljs-keyword\">for<\/span> s in some_sentences:\n    <span class=\"hljs-keyword\">print<\/span>(<span class=\"hljs-string\">\"\\t\"<\/span>, s)\nranked = &#91;]   <span class=\"hljs-comment\"># pairs (rank, sentence)<\/span>\n<span class=\"hljs-keyword\">for<\/span> s in some_sentences:\n    t = tokenizer.tokenize(s)\n    <span class=\"hljs-keyword\">if<\/span> len(t) &gt; MAX_LENGTH - <span class=\"hljs-number\">2<\/span>:\n        t = t&#91;:MAX_LENGTH - <span class=\"hljs-number\">2<\/span>]\n    t = tokenizer.convert_tokens_to_ids(&#91;<span class=\"hljs-string\">\"&#91;CLS]\"<\/span>] + t + &#91;<span class=\"hljs-string\">\"&#91;SEP]\"<\/span>])\n    <span class=\"hljs-keyword\">if<\/span> len(t) &lt; MAX_LENGTH:\n        t += &#91;<span class=\"hljs-number\">0<\/span>] * (MAX_LENGTH - len(t))\n    p = neural_network.predict(np.<span class=\"hljs-keyword\">array<\/span>(&#91;t]))&#91;<span class=\"hljs-number\">0<\/span>]&#91;<span class=\"hljs-number\">0<\/span>]\n    ranked.append((p, s))\n<span class=\"hljs-keyword\">print<\/span>(<span class=\"hljs-string\">\"Network ranking from negative to positive\"<\/span>)\n<span class=\"hljs-keyword\">for<\/span> r in sorted(ranked):\n    <span class=\"hljs-keyword\">print<\/span>(<span class=\"hljs-string\">\"\\t\"<\/span>, r&#91;<span class=\"hljs-number\">1<\/span>])<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-4\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">PHP<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">php<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Here is the output:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-5\" data-shcb-language-name=\"JavaScript\" data-shcb-language-slug=\"javascript\"><span><code class=\"hljs language-javascript\">The following sentences will be sorted <span class=\"hljs-keyword\">from<\/span> the most negative upward\n     The film is not bad but the actors should take acting lessons\n     Another chiefwork by a master <span class=\"hljs-keyword\">of<\/span> western movies\n     This film is just disappointing: <span class=\"hljs-keyword\">do<\/span> not waste time on it\n     Well directed but poorly acted\n     The movie is well directed and greatly acted\n     A honest zombie movie <span class=\"hljs-keyword\">with<\/span> actually no <span class=\"hljs-keyword\">new<\/span> ideas\nNetwork ranking <span class=\"hljs-keyword\">from<\/span> negative to positive\n     This film is just disappointing: <span class=\"hljs-keyword\">do<\/span> not waste time on it\n     A honest zombie movie <span class=\"hljs-keyword\">with<\/span> actually no <span class=\"hljs-keyword\">new<\/span> ideas\n     The film is not bad but the actors should take acting lessons\n     Another chiefwork by a master <span class=\"hljs-keyword\">of<\/span> western movies\n     Well directed but poorly acted\n     The movie is well directed and greatly acted<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-5\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">JavaScript<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">javascript<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Not bad for the tiniest model with the minimum Keras wrapping&#8230;<\/p>\n\n\n\n<p>Of course this test exercise may be improved on in several respects.  A categorical multiclass classification could substitute for our numerical guessing, e.g. by mapping the positivity probability <em>p<\/em> resulting in the dataset using the following cut-offs:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>very negative if 0 \u2264 <em>x <\/em>\u2264 0.2<\/li><li>negative if 0.2 &lt; <em>x <\/em>\u2264 0.4<\/li><li>neutral if 0.4 &lt; <em>x <\/em>\u2264 0.6<\/li><li>positive if 0.6 &lt; <em>x <\/em>\u2264 0.8<\/li><li>very positive if 0.8 &lt; <em>x <\/em>\u2264 1<\/li><\/ul>\n\n\n\n<p>In this case, the output layer should be a softmax, and the compilation of the net should use categorical cross entropy, etc. One would also want to include more layers and dropouts, and so on.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusions<\/h2>\n\n\n\n<p>Despite its simplicity of use, BERT models outperform previous NLP tools in several respects: they may be used not only to classify but also to predict, to translate, to summarize, and to improve the automatic understanding of natural languages. The availability of multiple pre-trained models, even in languages other than English, keeps improving, and, while fine tuning tasks are much harder than one might image after reading this description, I hope that the importance of this class of algorithms for practical and theoretical purposes in machine learning has been sufficiently highlighted by the previous discussion.<\/p>\n\n\n","protected":false},"excerpt":{"rendered":"<p>Introduction One very broad and highly active field of research in AI (artificial intelligence) is NLP: Natural Language Processing. Scientists have been trying to teach machines how to understand and even write natural languages (such as English or Chinese) since the very beginning of computer science and artificial intelligence. One of the founding fathers of&#8230; <a class=\"more-link\" href=\"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/\">Read more<\/a><\/p>\n","protected":false},"author":5,"featured_media":4487,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_editorskit_title_hidden":false,"_editorskit_reading_time":13,"_editorskit_is_block_options_detached":false,"_editorskit_block_options_position":"{}","_uag_custom_page_level_css":"","_genesis_hide_title":false,"_genesis_hide_breadcrumbs":false,"_genesis_hide_singular_image":false,"_genesis_hide_footer_widgets":false,"_genesis_custom_body_class":"","_genesis_custom_post_class":"","_genesis_layout":"","footnotes":""},"categories":[46],"tags":[70,68],"collections":[],"class_list":{"0":"post-4484","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-ai-ml","8":"tag-google","9":"tag-python","10":"entry"},"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v26.9 (Yoast SEO v27.5) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>BERT: how Google changed NLP - Codemotion Magazine<\/title>\n<meta name=\"description\" content=\"A brief overview of the history behind NLP, arriving at today&#039;s state-of-the-art algorithm BERT, and learning how to use it in Python.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"BERT: how Google changed NLP (and how to benefit from this)\" \/>\n<meta property=\"og:description\" content=\"A brief overview of the history behind NLP, arriving at today&#039;s state-of-the-art algorithm BERT, and learning how to use it in Python.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/\" \/>\n<meta property=\"og:site_name\" content=\"Codemotion Magazine\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Codemotion.Italy\/\" \/>\n<meta property=\"article:published_time\" content=\"2020-05-04T10:23:16+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-01-05T19:04:57+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1226\" \/>\n\t<meta property=\"og:image:height\" content=\"675\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Paolo Caressa\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@www_caressa_it\" \/>\n<meta name=\"twitter:site\" content=\"@CodemotionIT\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Paolo Caressa\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"19 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\\\/\"},\"author\":{\"name\":\"Paolo Caressa\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#\\\/schema\\\/person\\\/11b502309bc50a6923aafd79c6259f85\"},\"headline\":\"BERT: how Google changed NLP (and how to benefit from this)\",\"datePublished\":\"2020-05-04T10:23:16+00:00\",\"dateModified\":\"2022-01-05T19:04:57+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\\\/\"},\"wordCount\":2868,\"publisher\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/wp-content\\\/uploads\\\/2020\\\/05\\\/bert-google.png\",\"keywords\":[\"Google\",\"Python\"],\"articleSection\":[\"AI\\\/ML\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\\\/\",\"url\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\\\/\",\"name\":\"BERT: how Google changed NLP - Codemotion Magazine\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/wp-content\\\/uploads\\\/2020\\\/05\\\/bert-google.png\",\"datePublished\":\"2020-05-04T10:23:16+00:00\",\"dateModified\":\"2022-01-05T19:04:57+00:00\",\"description\":\"A brief overview of the history behind NLP, arriving at today's state-of-the-art algorithm BERT, and learning how to use it in Python.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/wp-content\\\/uploads\\\/2020\\\/05\\\/bert-google.png\",\"contentUrl\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/wp-content\\\/uploads\\\/2020\\\/05\\\/bert-google.png\",\"width\":1226,\"height\":675,\"caption\":\"bert google\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"AI\\\/ML\",\"item\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Machine Learning\",\"item\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/machine-learning\\\/\"},{\"@type\":\"ListItem\",\"position\":4,\"name\":\"BERT: how Google changed NLP (and how to benefit from this)\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#website\",\"url\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/\",\"name\":\"Codemotion Magazine\",\"description\":\"We code the future. Together\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#organization\",\"name\":\"Codemotion\",\"url\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/wp-content\\\/uploads\\\/2019\\\/11\\\/codemotionlogo.png\",\"contentUrl\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/wp-content\\\/uploads\\\/2019\\\/11\\\/codemotionlogo.png\",\"width\":225,\"height\":225,\"caption\":\"Codemotion\"},\"image\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/Codemotion.Italy\\\/\",\"https:\\\/\\\/x.com\\\/CodemotionIT\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#\\\/schema\\\/person\\\/11b502309bc50a6923aafd79c6259f85\",\"name\":\"Paolo Caressa\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/b55795d60b54b1cc8b605f9967dbbe68b3fcf826249490cecafd797ee4f18d4c?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/b55795d60b54b1cc8b605f9967dbbe68b3fcf826249490cecafd797ee4f18d4c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/b55795d60b54b1cc8b605f9967dbbe68b3fcf826249490cecafd797ee4f18d4c?s=96&d=mm&r=g\",\"caption\":\"Paolo Caressa\"},\"description\":\"I spent the first part of my life enjoying studies up to a math BS+MS and PhD. Next I worked both as math researcher (differential geometry) and as IT consultant (R&amp;D, feasibility studies, business analysis). Eventually I left academia and worked some years in finance (maths &amp; implementation of derivative pricings and risk management models), then again in IT (as technical consultant and writer, project manager, program manager). In the meanwhile I write books and articles on maths and computer science and I give lectures on workshops and conferences (applied maths, AI, etc.). I also serve as adjunct professor in the Engineering Department of \\\"Sapienza\\\" University of Rome (calculus and CS classes).\",\"sameAs\":[\"https:\\\/\\\/www.linkedin.com\\\/in\\\/paolocaressa\\\/\",\"https:\\\/\\\/x.com\\\/www_caressa_it\"],\"url\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/author\\\/paolo-caressa\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"BERT: how Google changed NLP - Codemotion Magazine","description":"A brief overview of the history behind NLP, arriving at today's state-of-the-art algorithm BERT, and learning how to use it in Python.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/","og_locale":"en_US","og_type":"article","og_title":"BERT: how Google changed NLP (and how to benefit from this)","og_description":"A brief overview of the history behind NLP, arriving at today's state-of-the-art algorithm BERT, and learning how to use it in Python.","og_url":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/","og_site_name":"Codemotion Magazine","article_publisher":"https:\/\/www.facebook.com\/Codemotion.Italy\/","article_published_time":"2020-05-04T10:23:16+00:00","article_modified_time":"2022-01-05T19:04:57+00:00","og_image":[{"width":1226,"height":675,"url":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google.png","type":"image\/png"}],"author":"Paolo Caressa","twitter_card":"summary_large_image","twitter_creator":"@www_caressa_it","twitter_site":"@CodemotionIT","twitter_misc":{"Written by":"Paolo Caressa","Est. reading time":"19 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/#article","isPartOf":{"@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/"},"author":{"name":"Paolo Caressa","@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/11b502309bc50a6923aafd79c6259f85"},"headline":"BERT: how Google changed NLP (and how to benefit from this)","datePublished":"2020-05-04T10:23:16+00:00","dateModified":"2022-01-05T19:04:57+00:00","mainEntityOfPage":{"@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/"},"wordCount":2868,"publisher":{"@id":"https:\/\/www.codemotion.com\/magazine\/#organization"},"image":{"@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/#primaryimage"},"thumbnailUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google.png","keywords":["Google","Python"],"articleSection":["AI\/ML"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/","url":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/","name":"BERT: how Google changed NLP - Codemotion Magazine","isPartOf":{"@id":"https:\/\/www.codemotion.com\/magazine\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/#primaryimage"},"image":{"@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/#primaryimage"},"thumbnailUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google.png","datePublished":"2020-05-04T10:23:16+00:00","dateModified":"2022-01-05T19:04:57+00:00","description":"A brief overview of the history behind NLP, arriving at today's state-of-the-art algorithm BERT, and learning how to use it in Python.","breadcrumb":{"@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/#primaryimage","url":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google.png","contentUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google.png","width":1226,"height":675,"caption":"bert google"},{"@type":"BreadcrumbList","@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/bert-how-google-changed-nlp-and-how-to-benefit-from-this\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.codemotion.com\/magazine\/"},{"@type":"ListItem","position":2,"name":"AI\/ML","item":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/"},{"@type":"ListItem","position":3,"name":"Machine Learning","item":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/"},{"@type":"ListItem","position":4,"name":"BERT: how Google changed NLP (and how to benefit from this)"}]},{"@type":"WebSite","@id":"https:\/\/www.codemotion.com\/magazine\/#website","url":"https:\/\/www.codemotion.com\/magazine\/","name":"Codemotion Magazine","description":"We code the future. Together","publisher":{"@id":"https:\/\/www.codemotion.com\/magazine\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.codemotion.com\/magazine\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.codemotion.com\/magazine\/#organization","name":"Codemotion","url":"https:\/\/www.codemotion.com\/magazine\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/logo\/image\/","url":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2019\/11\/codemotionlogo.png","contentUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2019\/11\/codemotionlogo.png","width":225,"height":225,"caption":"Codemotion"},"image":{"@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Codemotion.Italy\/","https:\/\/x.com\/CodemotionIT"]},{"@type":"Person","@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/11b502309bc50a6923aafd79c6259f85","name":"Paolo Caressa","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/b55795d60b54b1cc8b605f9967dbbe68b3fcf826249490cecafd797ee4f18d4c?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/b55795d60b54b1cc8b605f9967dbbe68b3fcf826249490cecafd797ee4f18d4c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/b55795d60b54b1cc8b605f9967dbbe68b3fcf826249490cecafd797ee4f18d4c?s=96&d=mm&r=g","caption":"Paolo Caressa"},"description":"I spent the first part of my life enjoying studies up to a math BS+MS and PhD. Next I worked both as math researcher (differential geometry) and as IT consultant (R&amp;D, feasibility studies, business analysis). Eventually I left academia and worked some years in finance (maths &amp; implementation of derivative pricings and risk management models), then again in IT (as technical consultant and writer, project manager, program manager). In the meanwhile I write books and articles on maths and computer science and I give lectures on workshops and conferences (applied maths, AI, etc.). I also serve as adjunct professor in the Engineering Department of \"Sapienza\" University of Rome (calculus and CS classes).","sameAs":["https:\/\/www.linkedin.com\/in\/paolocaressa\/","https:\/\/x.com\/www_caressa_it"],"url":"https:\/\/www.codemotion.com\/magazine\/author\/paolo-caressa\/"}]}},"featured_image_src":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google-600x400.png","featured_image_src_square":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google-600x600.png","author_info":{"display_name":"Paolo Caressa","author_link":"https:\/\/www.codemotion.com\/magazine\/author\/paolo-caressa\/"},"uagb_featured_image_src":{"full":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google.png",1226,675,false],"thumbnail":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google-150x150.png",150,150,true],"medium":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google-300x165.png",300,165,true],"medium_large":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google-768x423.png",768,423,true],"large":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google-1024x564.png",1024,564,true],"1536x1536":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google.png",1226,675,false],"2048x2048":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google.png",1226,675,false],"small-home-featured":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google.png",100,55,false],"sidebar-featured":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google-180x128.png",180,128,true],"genesis-singular-images":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google-896x504.png",896,504,true],"archive-featured":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google-400x225.png",400,225,true],"gb-block-post-grid-landscape":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google-600x400.png",600,400,true],"gb-block-post-grid-square":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/05\/bert-google-600x600.png",600,600,true]},"uagb_author_info":{"display_name":"Paolo Caressa","author_link":"https:\/\/www.codemotion.com\/magazine\/author\/paolo-caressa\/"},"uagb_comment_info":0,"uagb_excerpt":"Introduction One very broad and highly active field of research in AI (artificial intelligence) is NLP: Natural Language Processing. Scientists have been trying to teach machines how to understand and even write natural languages (such as English or Chinese) since the very beginning of computer science and artificial intelligence. One of the founding fathers of&#8230;&hellip;","lang":"en","_links":{"self":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts\/4484","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/comments?post=4484"}],"version-history":[{"count":10,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts\/4484\/revisions"}],"predecessor-version":[{"id":8208,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts\/4484\/revisions\/8208"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/media\/4487"}],"wp:attachment":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/media?parent=4484"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/categories?post=4484"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/tags?post=4484"},{"taxonomy":"collections","embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/collections?post=4484"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}