{"id":19097,"date":"2022-10-21T09:09:48","date_gmt":"2022-10-21T07:09:48","guid":{"rendered":"https:\/\/www.codemotion.com\/magazine\/?p=19097"},"modified":"2023-06-19T12:30:43","modified_gmt":"2023-06-19T10:30:43","slug":"fast-document-similarity-in-python-minhashlsh","status":"publish","type":"post","link":"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/","title":{"rendered":"Fast Document Similarity in Python (MinHashLSH)"},"content":{"rendered":"\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>Why is document similarity more important than ever? In the big data era, it is always more frequent that companies need to detect similar items in their database. Imagine platforms like <a href=\"https:\/\/www.kijiji.it\/\">Kijiji<\/a> or <a href=\"https:\/\/www.subito.it\/\">Subito<\/a>, trying to detect people that constantly duplicate announcements on the platform to boost their visibility without paying for sponsorship offered by the platform.<\/p>\n\n\n\n<p>Assume for example the platform has 1000000 announcements and it wants to detect duplicates. If we try to check one by one all the pairs we need to check 499999500000 (half a trillion) pairs. If it takes a microsecond (0.000001 s) to check a pair, it would take 499999 seconds, which is approximately six days of computation just to check document similarity.<\/p>\n\n\n\n<p>In this article, we are going to explore data mining techniques to speed up the whole process of finding and checking document similarity. The whole algorithm is inspired by <a href=\"http:\/\/www.mmds.org\/#ver21\">chapter 3 of the book &#8220;Mining of Massive Datasets&#8221;<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-dataset\">Dataset<\/h2>\n\n\n\n<p>In this document similarity guide, we will use some apartment announcements scraped from <a href=\"https:\/\/www.kijiji.it\/\">Kijiji<\/a> a few years ago. You can also download the whole complete example <a href=\"https:\/\/github.com\/nicoDs96\/Document-Similarity-using-Python-and-PySpark\/blob\/master\/LSH\/dataset_rent_rome_kijiji.tsv\" class=\"ek-link\">here<\/a>.<br>The file is a tsv file with header:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Title<\/th><th>Short Description<\/th><th>Location<\/th><th>Price (Euro)<\/th><th>Timestamp<\/th><th>Url Adv<\/th><\/tr><\/thead><tbody><tr><td>Studio accessoria\u2026<\/td><td>Affitto studio a \u2026<\/td><td>Roma<\/td><td>450<\/td><td>12 ottobre, 11:32<\/td><td>https:\/\/www.kijij\u2026<\/td><\/tr><tr><td>Negozio 169Mq per\u2026<\/td><td>Privato affitta n\u2026<\/td><td>Prenestino \/ Casi\u2026<\/td><td>1.700<\/td><td>12 ottobre, 08:45<\/td><td>https:\/\/www.kijij\u2026<\/td><\/tr><tr><td>Negozio in tiburt\u2026<\/td><td>Negozio c\/1 roma \u2026<\/td><td>Tiburtino \/ Colla\u2026<\/td><td>6.000<\/td><td>17 October, 21:20<\/td><td>https:\/\/www.kijij\u2026<\/td><\/tr><tr><td>Studio medico via\u2026<\/td><td>Studio medico avv\u2026<\/td><td>Trieste \/ Somalia\u2026<\/td><td>200<\/td><td>17 October, 20:22<\/td><td>https:\/\/www.kijij\u2026<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>To read the <a href=\"https:\/\/github.com\/nicoDs96\/Document-Similarity-using-Python-and-PySpark\/blob\/master\/LSH\/dataset_rent_rome_kijiji.tsv\">dataset<\/a> we use the <a href=\"https:\/\/pandas.pydata.org\/\">pandas<\/a> library:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-1\" data-shcb-language-name=\"JavaScript\" data-shcb-language-slug=\"javascript\"><span><code class=\"hljs language-javascript\"><span class=\"hljs-keyword\">import<\/span> pandas <span class=\"hljs-keyword\">as<\/span> pd\ndataset=pd.read_csv(<span class=\"hljs-string\">\"dataset_rent_rome_kijiji.tsv\"<\/span>, sep=<span class=\"hljs-string\">\"\\t\"<\/span>)<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-1\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">JavaScript<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">javascript<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<h2 class=\"wp-block-heading\" id=\"h-fast-document-similarity-step-by-step\">Fast document similarity: step by step<\/h2>\n\n\n\n<p>In the following section we are going to see:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>how to transform a document into a set using k-shingling<\/li>\n\n\n\n<li>how to represent a set in a compressed way computing its signature in such a way that set similarity is preserved, using MinHashing.<\/li>\n\n\n\n<li>how to implement a hashing function that hashes similar items in the same bucket, using LSH (locality-sensitive hashing).<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-shingling-of-document\">Shingling of Document<\/h2>\n\n\n\n<p>In literature, representing documents as sets is considered an effective way to identify lexically similar documents. The <a href=\"https:\/\/en.wikipedia.org\/wiki\/W-shingling\">k-shingles<\/a> method represents a document as a set of the substrings of length k.<br>For example, if your document is &#8216;I love pizza Margherita, a 6-shingle representation of the document based on characters, including spaces, can be <code>{'I love', ' love ', 'love p', 'ove pi', ...}<\/code>. According to the use case, you can compose shingles of words instead of characters or eliminate whitespaces and others.<br>In <a href=\"https:\/\/www.codemotion.com\/magazine\/ai-ml\/big-data\/logging-in-python-a-broad-gentle-introduction\/\" target=\"_blank\" aria-label=\"Python (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\">Python<\/a>, we can implement a class that, given the parameter k and a document read into a string, returns the k-shingle set of a document:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-2\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\"><span class=\"hljs-class\"><span class=\"hljs-keyword\">class<\/span> <span class=\"hljs-title\">shingler<\/span>:<\/span>\n    <span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">__init__<\/span><span class=\"hljs-params\">(self, k)<\/span>:<\/span>\n        <span class=\"hljs-keyword\">if<\/span> k &gt; <span class=\"hljs-number\">0<\/span>:\n            self.k = int(k)\n        <span class=\"hljs-keyword\">else<\/span>:\n            self.k = <span class=\"hljs-number\">10<\/span>   \n    <span class=\"hljs-comment\">#inner class utility<\/span>\n    <span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">process_doc<\/span><span class=\"hljs-params\">(self, document)<\/span>:<\/span>\n        <span class=\"hljs-keyword\">return<\/span> re.sub(<span class=\"hljs-string\">\"( )+|(\\n)+\"<\/span>,<span class=\"hljs-string\">\" \"<\/span>,document).lower()\n\n    <span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">get_shingles<\/span><span class=\"hljs-params\">(self, document)<\/span>:<\/span>\n        shingles = set()\n        document= self.process_doc(document)\n        <span class=\"hljs-keyword\">for<\/span> i <span class=\"hljs-keyword\">in<\/span> range(<span class=\"hljs-number\">0<\/span>, len(document)-self.k+<span class=\"hljs-number\">1<\/span> ):\n            shingles.add(document&#91;i:i+self.k])\n        <span class=\"hljs-keyword\">return<\/span> shingles<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-2\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>And use it as follows:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-3\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>doc = <span class=\"hljs-string\">\"I love pizza Margherita!                     xd 1111@\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\"<\/span>\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>shingler_instance = shingler(<span class=\"hljs-number\">10<\/span>)\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>shingles_set = shingler_instance.get_shingles(doc)\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>shingles_set\n{<span class=\"hljs-string\">' xd 1111@ '<\/span>, <span class=\"hljs-string\">'herita! xd'<\/span>, <span class=\"hljs-string\">'izza margh'<\/span>, <span class=\"hljs-string\">'e pizza ma'<\/span>, <span class=\"hljs-string\">'rgherita! '<\/span>, <span class=\"hljs-string\">'ta! xd 111'<\/span>, <span class=\"hljs-string\">'ita! xd 11'<\/span>, <span class=\"hljs-string\">' pizza mar'<\/span>, <span class=\"hljs-string\">'argherita!'<\/span>, <span class=\"hljs-string\">'rita! xd 1'<\/span>, <span class=\"hljs-string\">'zza marghe'<\/span>, <span class=\"hljs-string\">'a! xd 1111'<\/span>, <span class=\"hljs-string\">'a margheri'<\/span>, <span class=\"hljs-string\">' love pizz'<\/span>, <span class=\"hljs-string\">'erita! xd '<\/span>, <span class=\"hljs-string\">'love pizza'<\/span>, <span class=\"hljs-string\">'za margher'<\/span>, <span class=\"hljs-string\">'margherita'<\/span>, <span class=\"hljs-string\">'gherita! x'<\/span>, <span class=\"hljs-string\">' margherit'<\/span>, <span class=\"hljs-string\">'i love piz'<\/span>, <span class=\"hljs-string\">'pizza marg'<\/span>, <span class=\"hljs-string\">'ve pizza m'<\/span>, <span class=\"hljs-string\">'ove pizza '<\/span>, <span class=\"hljs-string\">'! xd 1111@'<\/span>}<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-3\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Now we can check if two documents are similar using the Jaccard Similarity, a popular set similarity indicator:<br>$$ J(s1, s2) = \\frac{|s1 \\cap s2|}{|s1 \\cup s2|} $$<\/p>\n\n\n\n<p>In general, we might want to hash shingles to reduce their size, but even by hashing, they can take in a mean 4 times the size of the original document. So &#8220;what&#8217;s the point?&#8221; you might think. Let me introduce you to MinHashing.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-minhash-the-signature-of-a-set\">Minhash &#8211; The signature of a set<\/h2>\n\n\n\n<p>The minash of a set can be seen as its signature, which is a unique and short representation of the set with a fixed length. The magic of MinHashing for a set is that it preserves Jaccard similarity (more or less).<\/p>\n\n\n\n<p>We can represent a set with its characteristic matrix: a matrix whose columns are sets and rows are elements. The matrix contains a 1 in all the cells that correspond to an element contained in a set.<br>For example:<br>S1 = {a, b}, S2 = {a, c} S3 = {a,b,c,d}<br>| |S1|S2|S3|<br>|-| -| -| -|<br>|a| 1| 1| 1|<br>|b| 1| 0| 1|<br>|c| 0| 1| 1|<br>|d| 0| 0| 1|<\/p>\n\n\n\n<p>Ideally, the minhash function takes a random permutation of the rows and for each set return the first element with a 1 in the characteristic matrix with permuted rows:<br>if the permutation is <code>badc<\/code>, the characteristic matrix is<br>| |S1|S2|S3|<br>|-| -| -| -|<br>|b| 1| 0| 1|<br>|a| 1| 1| 1|<br>|d| 0| 0| 1|<br>|c| 0| 1| 1|<\/p>\n\n\n\n<p>and the minhash values are<br>Minhash(S1) = b<br>Minhash(S2) = a<br>Minhash(S3) = b<\/p>\n\n\n\n<p>We obtain a signature of size n for the set if we compute minhash for n random permutations of the rows of the characteristic matrix.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-minhash-in-practice\">Minhash in practice<\/h3>\n\n\n\n<p>The permutation approach is not feasible in practice, hence once again we use hash functions to approximate. We define a family of hash functions as<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-4\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\"><span class=\"hljs-keyword\">import<\/span> hashlib\n<span class=\"hljs-comment\"># the family of hash functions, in this case, is the same function (sha1) applied with a different salt.<\/span>\n<span class=\"hljs-class\"><span class=\"hljs-keyword\">class<\/span> <span class=\"hljs-title\">hashFamily<\/span>:<\/span>\n    <span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">__init__<\/span><span class=\"hljs-params\">(self, i)<\/span>:<\/span>\n        self.resultSize = <span class=\"hljs-number\">8<\/span> <span class=\"hljs-comment\"># how many bytes we want back<\/span>\n        self.maxLen = <span class=\"hljs-number\">20<\/span> <span class=\"hljs-comment\"># how long can our salt be (in decimal)<\/span>\n        self.salt = str(i).zfill(self.maxLen)&#91;-self.maxLen:]\n\n    <span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">get_hash_value<\/span><span class=\"hljs-params\">(self, el_to_hash)<\/span>:<\/span>\n        <span class=\"hljs-keyword\">return<\/span> int(hashlib.sha1(str(el_to_hash).encode(<span class=\"hljs-string\">'utf-8'<\/span>) + self.salt.encode(<span class=\"hljs-string\">'utf-8'<\/span>)).hexdigest()&#91;-self.resultSize:], <span class=\"hljs-number\">16<\/span>)\n\n<span class=\"hljs-comment\"># <span class=\"hljs-doctag\">NOTE:<\/span> we use sha1 to avoid installing and importing an external library, sacrificing performances. No crypto-hash is required for this use case.<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-4\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Then we compute a minhash signature for a set with the following algorithm:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Take the first hash function, and apply it to all of the shingle values in a document. Find the minimum hash value produced and use it as the first component of the MinHash signature.<\/li>\n\n\n\n<li>Now take the second hash function, and again find the minimum resulting hash value, and use this as the second component.<\/li>\n\n\n\n<li>And so on\u2026<\/li>\n<\/ol>\n\n\n\n<p>So if we have 10 random hash functions, we\u2019ll get a MinHash signature with 10 values for each set. We\u2019ll use the same 10 hash functions for every document in the dataset and generate their signatures as well.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-5\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\"><span class=\"hljs-keyword\">from<\/span> random <span class=\"hljs-keyword\">import<\/span> randint, seed\n<span class=\"hljs-class\"><span class=\"hljs-keyword\">class<\/span> <span class=\"hljs-title\">minhashSigner<\/span>:<\/span>\n    <span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">__init__<\/span><span class=\"hljs-params\">(self, sig_size)<\/span>:<\/span>\n        self.sig_size=sig_size\n        <span class=\"hljs-comment\"># Init the hash family of functions using a random salt<\/span>\n        self.hash_functions = &#91;hashFamily(randint(<span class=\"hljs-number\">0<\/span>,<span class=\"hljs-number\">10000000000<\/span>)) <span class=\"hljs-keyword\">for<\/span> i <span class=\"hljs-keyword\">in<\/span> range(<span class=\"hljs-number\">0<\/span>,sig_size)]\n\n    <span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">compute_set_signature<\/span><span class=\"hljs-params\">(self, set_)<\/span>:<\/span>\n        set_sig = &#91;]\n        <span class=\"hljs-keyword\">for<\/span> h_funct <span class=\"hljs-keyword\">in<\/span> self.hash_functions:\n            min_hash = math.inf\n            <span class=\"hljs-keyword\">for<\/span> el <span class=\"hljs-keyword\">in<\/span> set_:\n                h = h_funct.get_hash_value(el)\n                <span class=\"hljs-keyword\">if<\/span> h &lt; min_hash:\n                    min_hash = h\n\n            set_sig.append(min_hash)\n\n        <span class=\"hljs-keyword\">return<\/span> set_sig\n    <span class=\"hljs-comment\">#return a list of lists that can be seen as the signature matrix<\/span>\n    <span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">compute_signature_matrix<\/span><span class=\"hljs-params\">(self, set_list)<\/span>:<\/span>\n        signatures = &#91;]\n        <span class=\"hljs-keyword\">for<\/span> s <span class=\"hljs-keyword\">in<\/span> set_list:\n            signatures.append( self.compute_set_signature(s) )\n        <span class=\"hljs-keyword\">return<\/span> signatures<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-5\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Example usage:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-6\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>doc = <span class=\"hljs-string\">\"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\"<\/span>\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>doc2 = <span class=\"hljs-string\">\"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident bla bla bla.\"<\/span>\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>shingler_inst = shingler(<span class=\"hljs-number\">10<\/span>)\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>shinglings = shingler_inst.get_shingles(doc)\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>shinglings2 = shingler_inst.get_shingles(doc2)\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>minhash_sig = minhashSigner(<span class=\"hljs-number\">20<\/span>).compute_signature_matrix(&#91;shinglings,shinglings2])\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>minhash_sig\n&#91;&#91;<span class=\"hljs-number\">1783860<\/span>, <span class=\"hljs-number\">5303198<\/span>, <span class=\"hljs-number\">4845898<\/span>, <span class=\"hljs-number\">3935831<\/span>, <span class=\"hljs-number\">198103<\/span>, <span class=\"hljs-number\">5887007<\/span>, <span class=\"hljs-number\">8061377<\/span>, <span class=\"hljs-number\">4526224<\/span>, <span class=\"hljs-number\">8467452<\/span>, <span class=\"hljs-number\">7766250<\/span>, <span class=\"hljs-number\">15505788<\/span>, <span class=\"hljs-number\">825368<\/span>, <span class=\"hljs-number\">12316643<\/span>, <span class=\"hljs-number\">4502308<\/span>, <span class=\"hljs-number\">6403903<\/span>, <span class=\"hljs-number\">9259449<\/span>, <span class=\"hljs-number\">8533874<\/span>, <span class=\"hljs-number\">31889076<\/span>, <span class=\"hljs-number\">940623<\/span>, <span class=\"hljs-number\">278359<\/span>], &#91;<span class=\"hljs-number\">1783860<\/span>, <span class=\"hljs-number\">20203690<\/span>, <span class=\"hljs-number\">4845898<\/span>, <span class=\"hljs-number\">3935831<\/span>, <span class=\"hljs-number\">198103<\/span>, <span class=\"hljs-number\">12350289<\/span>, <span class=\"hljs-number\">8061377<\/span>, <span class=\"hljs-number\">4526224<\/span>, <span class=\"hljs-number\">8467452<\/span>, <span class=\"hljs-number\">7766250<\/span>, <span class=\"hljs-number\">15505788<\/span>, <span class=\"hljs-number\">825368<\/span>, <span class=\"hljs-number\">12316643<\/span>, <span class=\"hljs-number\">6122847<\/span>, <span class=\"hljs-number\">6403903<\/span>, <span class=\"hljs-number\">9259449<\/span>, <span class=\"hljs-number\">8533874<\/span>, <span class=\"hljs-number\">31889076<\/span>, <span class=\"hljs-number\">940623<\/span>, <span class=\"hljs-number\">278359<\/span>]]\n&gt;&gt;&gt;\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>set_sig_1 = set(minhash_sig&#91;<span class=\"hljs-number\">0<\/span>])\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>set_sig_2 = set(minhash_sig&#91;<span class=\"hljs-number\">1<\/span>])\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>jaccard_similarity_sig = len(set_sig_1.intersection(set_sig_2))\/len(set_sig_1.union(set_sig_2))\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>jaccard_similarity_sig\n<span class=\"hljs-number\">0.7391304347826086<\/span>\n&gt;&gt;&gt;\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>jaccard_similarity_shingle_set = len(set(shinglings).intersection(shinglings2))\/len(set(shinglings).union(shinglings2))\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>jaccard_similarity_shingle_set\n<span class=\"hljs-number\">0.8285077951002228<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-6\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>From the example, we can see that the Jaccard similarity between set signatures is near to the similarity between shingle sets. Of course, we can achieve a lower approximation error by tuning the shingle size and signature size.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-locality-sensitive-hashing\">Locality Sensitive Hashing<\/h2>\n\n\n\n<p>Quoting <a href=\"http:\/\/www.mmds.org\/#ver21\">&#8220;Mining of Massive Datasets&#8221;<\/a><\/p>\n\n\n\n<p>&#8216;<em>Locality-sensitive hashing (also known as near-neighbor search) is a general theory focused on how to approximatively find similar pairs without investigating all of them. The principle is that we are going to hash items several times in such a way that similar items are more likely to be hashed to the same bucket than dissimilar items are.&#8217;<\/em><br><em><br><\/em>When you have a minhashed set a popular choice for LSH is the &#8220;banding technique&#8221;, that is:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>get the signature matrix and divide it into bands<\/li>\n\n\n\n<li>hash each column of the band. The elements that hash to the same bucket are likely to be similar.<\/li>\n\n\n\n<li>repeat for all the bands. You can use the same hash function of the previous band but you must use different buckets.<\/li>\n<\/ol>\n\n\n\n<p>In python:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-7\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\"><span class=\"hljs-class\"><span class=\"hljs-keyword\">class<\/span> <span class=\"hljs-title\">lsh<\/span>:<\/span>\n    <span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">__init__<\/span><span class=\"hljs-params\">(self, threshold=<span class=\"hljs-number\">0.8<\/span>)<\/span>:<\/span>\n        self.threshold = threshold\n    <span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">get_signature_matrix_bands<\/span><span class=\"hljs-params\">(self, sig_matrix, bands_nr, sign_len)<\/span>:<\/span> \n        <span class=\"hljs-comment\">#bands_nr = b<\/span>\n        <span class=\"hljs-comment\">#sign_len = n<\/span>\n        r = int(sign_len\/bands_nr) <span class=\"hljs-comment\">#number of rows in each band<\/span>\n        bands = {} <span class=\"hljs-comment\"># {band_nr: &#91;col_1,col_2,...]} where col_1 is all the values of Sig(S_i) for band b.<\/span>\n        <span class=\"hljs-keyword\">for<\/span> i <span class=\"hljs-keyword\">in<\/span> range(<span class=\"hljs-number\">0<\/span>,bands_nr):\n            bands&#91;i] = &#91;]\n        <span class=\"hljs-comment\"># put Subsets of the columns of signature matrix into the appropriate bucket and cosider a column <\/span>\n        <span class=\"hljs-comment\"># as a unique block so that we can hash the entire column.<\/span>\n        <span class=\"hljs-comment\"># Basically a band is a list of element, where each element is a subset of a signature of a given set.<\/span>\n        <span class=\"hljs-keyword\">for<\/span> signature <span class=\"hljs-keyword\">in<\/span> sig_matrix: \n            <span class=\"hljs-keyword\">for<\/span> i <span class=\"hljs-keyword\">in<\/span> range(<span class=\"hljs-number\">0<\/span>, bands_nr):\n                idx = i*r    \n                bands&#91;i].append(<span class=\"hljs-string\">' '<\/span>.join(str(x) <span class=\"hljs-keyword\">for<\/span> x <span class=\"hljs-keyword\">in<\/span> signature&#91;idx:idx+r]) )          \n        <span class=\"hljs-keyword\">return<\/span> bands\n    <span class=\"hljs-comment\">#band is a list <\/span>\n    <span class=\"hljs-comment\"># construct a dictionary {hash(band_column): doc_id that produced this hash}<\/span>\n    <span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">get_band_buckets<\/span><span class=\"hljs-params\">(self, band, hash_funct)<\/span>:<\/span>\n        buckets = {}\n        <span class=\"hljs-keyword\">for<\/span> doc_id <span class=\"hljs-keyword\">in<\/span> range(<span class=\"hljs-number\">0<\/span>,len(band)):\n            value = hash_funct.get_hash_value( band&#91;doc_id] )\n            <span class=\"hljs-keyword\">if<\/span> value <span class=\"hljs-keyword\">not<\/span> <span class=\"hljs-keyword\">in<\/span> buckets:\n                buckets&#91;value] = &#91;doc_id]\n            <span class=\"hljs-keyword\">else<\/span>:\n                buckets&#91;value].append(doc_id)      \n        <span class=\"hljs-keyword\">return<\/span> buckets\n    <span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">get_candidates_list<\/span><span class=\"hljs-params\">(self, buckets)<\/span>:<\/span>\n        candidates = set()\n        <span class=\"hljs-comment\"># buckets is a dictionary containing key=bucket, value= list of doc_ids that hashed to bucket<\/span>\n        <span class=\"hljs-keyword\">for<\/span> bucket,candidate_list <span class=\"hljs-keyword\">in<\/span> buckets.items():\n            <span class=\"hljs-keyword\">if<\/span> len(candidate_list) &gt; <span class=\"hljs-number\">1<\/span>:\n                <span class=\"hljs-keyword\">for<\/span> i <span class=\"hljs-keyword\">in<\/span> range(<span class=\"hljs-number\">0<\/span>,len(candidate_list)<span class=\"hljs-number\">-1<\/span>):\n                    <span class=\"hljs-keyword\">for<\/span> j <span class=\"hljs-keyword\">in<\/span> range(i+<span class=\"hljs-number\">1<\/span>,len(candidate_list)):  \n                        pair = tuple(sorted( (candidate_list&#91;i],candidate_list&#91;j]) ))\n                        candidates.add(pair)\n\n        <span class=\"hljs-keyword\">return<\/span> candidates <span class=\"hljs-comment\">#ie a set of couples, each couple is a candidate pair<\/span>\n    <span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">check_candidates<\/span><span class=\"hljs-params\">(self, candidates_list, threshold, sigs)<\/span>:<\/span>\n        similar_docs = set() <span class=\"hljs-comment\">#set of tuples<\/span>\n        <span class=\"hljs-comment\"># similar_pair is a couple containing doc_ids of documents that hashed to same bucket<\/span>\n        <span class=\"hljs-keyword\">for<\/span>  similar_pair <span class=\"hljs-keyword\">in<\/span> candidates_list:\n            <span class=\"hljs-comment\">#for all the pairs of document in the list check similarity of their signatures<\/span>\n            doc_id_1 = similar_pair&#91;<span class=\"hljs-number\">0<\/span>]\n            doc_id_2 = similar_pair&#91;<span class=\"hljs-number\">1<\/span>]\n            signature_1 = set(sigs&#91;doc_id_1]) <span class=\"hljs-comment\">#get the i-th column from signature matrix where i is doc_id in the collision list<\/span>\n            signature_2 = set(sigs&#91;doc_id_2])\n            js = len(signature_1.intersection(signature_2)) \/len(signature_1.union(signature_2))\n            <span class=\"hljs-keyword\">if<\/span> js &gt;= threshold:\n                similar_docs.add( tuple(sorted((doc_id_1,doc_id_2) )) )          \n        <span class=\"hljs-keyword\">return<\/span> similar_docs\n    <span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">get_similar_items<\/span><span class=\"hljs-params\">(self, sig_matrix, bands_nr, sign_len)<\/span>:<\/span>\n        similar_docs = set()\n        <span class=\"hljs-comment\">#divide signature matrix into bands<\/span>\n        bands = lsh_instance.get_signature_matrix_bands(sig_matrix,bands_nr,sign_len)\n        <span class=\"hljs-comment\">#for all the bands<\/span>\n        <span class=\"hljs-keyword\">for<\/span> band_id, elements <span class=\"hljs-keyword\">in<\/span> bands.items():\n            <span class=\"hljs-comment\">#produce the buckets for the given band (band_id) with a random hash function<\/span>\n            buckets = lsh_instance.get_band_buckets(elements, hash_funct=hashFamily(randint(<span class=\"hljs-number\">0<\/span>,<span class=\"hljs-number\">10000000000<\/span>)))\n            <span class=\"hljs-comment\">#Get all the candidate pairs<\/span>\n            candidates = lsh_instance.get_candidates_list(buckets)\n            <span class=\"hljs-comment\">#Check all candidate pairs' signatures<\/span>\n            <span class=\"hljs-keyword\">for<\/span> sim_tuple <span class=\"hljs-keyword\">in<\/span> lsh_instance.check_candidates(candidates, self.threshold, sig_matrix):\n                similar_docs.add( sim_tuple)\n        <span class=\"hljs-keyword\">return<\/span> similar_docs <span class=\"hljs-comment\">#return all the similar signatures that respect the threshold<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-7\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>The class performs the following tasks:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Divide signature matrix into bands with <code>get_signature_matrix_bands(sig_matrix, bands_nr, sign_len)<\/code>. The method will return a list of strings where each string represents a column of the signature matrix for the given band. It is done to hash the entire column next when producing buckets.<\/li>\n\n\n\n<li>Compute buckets for each band with <code>get_band_buckets(band, hash_funct)<\/code>, which will take as input a band <code>b<\/code> and a hash function <code>h<\/code> and will return a dict containing as key the bucket ids and as value a list of document ids that hashed to that bucket for the given band <code>b<\/code>: ${bucket_{id}:[doc_i, doc_k, \u2026] } $.<\/li>\n\n\n\n<li>With <code>get_candidates_list(buckets)<\/code> we are going to take as input a list of buckets (whose union is the signature matrix) and will produce as output a set of tuples: those tuples represent documents that are hashed in the same bucket.<\/li>\n\n\n\n<li>With <code>check_candidates(candidates_list, threshold, sigs)<\/code> we take all the candidates from all the bands and check if effectively their signatures produce a match with approximate Jaccard similarity greater than the threshold we give as a parameter.<\/li>\n\n\n\n<li>With <code>get_similar_items(sig_matrix, bands_nr, sign_len)<\/code> we put together all the methods listed above and return the ids of similar documents that respect the threshold value.<\/li>\n<\/ol>\n\n\n\n<p>The similarity threshold is passed as a value to the constructor of the class, default is 0.8.<\/p>\n\n\n\n<p>Example usage:<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-8\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\"><span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>shingler_inst = shingler(<span class=\"hljs-number\">10<\/span>)\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>doc = <span class=\"hljs-string\">\"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\"<\/span>\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>doc2 = <span class=\"hljs-string\">\"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident bla bla bla.\"<\/span>\n&gt;&gt;&gt;\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>shinglings = shingler_inst.get_shingles(doc)\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>shinglings = shingler_inst.get_hashed_shingles(shinglings)\n&gt;&gt;&gt;\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>shingle2 = shingler_inst.get_shingles(doc2)\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>shingle2 = shingler_inst.get_hashed_shingles(shingle2)\n&gt;&gt;&gt;\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>signer = minhashSigner(<span class=\"hljs-number\">100<\/span>)\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>signature_matrix = signer.compute_signature_matrix(&#91;shinglings,shingle2])\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>lsh_instance = lsh(threshold=<span class=\"hljs-number\">0.7<\/span>)\n<span class=\"hljs-meta\">&gt;&gt;&gt; <\/span>lsh_instance.get_similar_items(signature_matrix,<span class=\"hljs-number\">10<\/span>,<span class=\"hljs-number\">100<\/span>)\n{(<span class=\"hljs-number\">0<\/span>, <span class=\"hljs-number\">1<\/span>)}<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-8\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<h2 class=\"wp-block-heading\" id=\"h-benchmark-time\">Benchmark Time<\/h2>\n\n\n\n<p>Now that we have implemented the whole code for fast approximate duplicate document detection, let&#8217;s test it out against brute force detection and see what happens.<br>As the first step, we load the <a href=\"https:\/\/github.com\/nicoDs96\/Document-Similarity-using-Python-and-PySpark\/blob\/master\/LSH\/dataset_rent_rome_kijiji.tsv\">dataset<\/a> as a pandas data frame.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-9\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\">print(<span class=\"hljs-string\">\"Loading dataset...\"<\/span>)\ndataset=pd.read_csv(<span class=\"hljs-string\">\"dataset_rent_rome_kijiji.tsv\"<\/span>, sep=<span class=\"hljs-string\">\"\\t\"<\/span>)\ndataset&#91;<span class=\"hljs-string\">'doc_id'<\/span>]=dataset.index\ndoc_nr = dataset&#91;<span class=\"hljs-string\">'doc_id'<\/span>].max()\nprint(<span class=\"hljs-string\">\"Dataset loaded correctly.\"<\/span>)<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-9\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Then transform documents into k-shingles.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-10\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\">print(<span class=\"hljs-string\">\"Producing Shingles...\"<\/span>)\nstart_time = time.time()\n<span class=\"hljs-comment\">#an array where the index i represent the document_id and the element shingling_list&#91;i] the hashed shingles for document document_id<\/span>\nshingling_list = &#91;<span class=\"hljs-literal\">None<\/span>] * (doc_nr +<span class=\"hljs-number\">1<\/span>) \nshingling_size = <span class=\"hljs-number\">10<\/span>\nsignature_size = <span class=\"hljs-number\">50<\/span>\nbands_nr = <span class=\"hljs-number\">10<\/span>\n\nshingler_inst = shingler(shingling_size)\n\n<span class=\"hljs-comment\">#produce hashed shinglings for all documents<\/span>\n<span class=\"hljs-keyword\">for<\/span> index, row <span class=\"hljs-keyword\">in<\/span> dataset.iterrows():\n    doc = row&#91;<span class=\"hljs-string\">'Title'<\/span>]+<span class=\"hljs-string\">\" \"<\/span>+row&#91;<span class=\"hljs-string\">'Short Description'<\/span>]\n    i = row&#91;<span class=\"hljs-string\">'doc_id'<\/span>]\n\n    shinglings = shingler_inst.get_hashed_shingles( shingler_inst.get_shingles(doc) )\n    shingling_list&#91;i] = shinglings\n\nend_time = time.time()\nprint(<span class=\"hljs-string\">\"Shingles produced in:\\t %.2f seconds.\"<\/span>%(end_time - start_time))<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-10\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>And we sign all the shingles and get the signature matrix.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-11\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\">signer = minhashSigner(signature_size)\nstart_time = time.time()\nprint(<span class=\"hljs-string\">\"Computing signature matrix...\"<\/span>)\n<span class=\"hljs-comment\">#produce a signature for each shingle set<\/span>\nsignature_matrix = signer.compute_signature_matrix( shingling_list )\nend_time = time.time()\nprint(<span class=\"hljs-string\">\"Signature Matrix computed in:\\t %.2f seconds.\"<\/span> %(end_time - start_time))<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-11\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>In the end, we compute similar pairs with LSH.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-12\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\">lsh_instance = lsh(threshold=<span class=\"hljs-number\">0.8<\/span>)\nstart_time = time.time()\nprint(<span class=\"hljs-string\">\"Computing LSH similarity...\"<\/span>)\nlsh_similar_itemset = lsh_instance.get_similar_items(signature_matrix, bands_nr, signature_size)\nend_time = time.time()\nlsh_computation_time = end_time - start_time\nprint(<span class=\"hljs-string\">\"LSH Similarity computed in:\\t %.2f seconds.\\nSimilar Elements Found: %d\"<\/span> %(lsh_computation_time,len(lsh_similar_itemset)))<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-12\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>We compute similar pairs using brute force<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-13\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\"><span class=\"hljs-class\"><span class=\"hljs-keyword\">class<\/span> <span class=\"hljs-title\">bfsc<\/span><span class=\"hljs-params\">()<\/span>:<\/span>\n    <span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">compare_shingles_set_js<\/span><span class=\"hljs-params\">(self, set1, set2)<\/span>:<\/span>\n        <span class=\"hljs-keyword\">return<\/span> len(set1.intersection(set2))\/len(set1.union(set2))\n\nbrute_force_comparer = bfsc()\nbrute_force_similar_items = set()\nstart_time = time.time()\n\n<span class=\"hljs-keyword\">for<\/span> i <span class=\"hljs-keyword\">in<\/span> range(<span class=\"hljs-number\">0<\/span>,doc_nr<span class=\"hljs-number\">-1<\/span>):\n    <span class=\"hljs-keyword\">for<\/span> j <span class=\"hljs-keyword\">in<\/span> range(i+<span class=\"hljs-number\">1<\/span>,doc_nr):\n        similarity = brute_force_comparer.compare_shingles_set_js(set(shingling_list&#91;i]),set(shingling_list&#91;j]))\n        <span class=\"hljs-keyword\">if<\/span> similarity &gt;= <span class=\"hljs-number\">0.8<\/span>:\n            brute_force_similar_items.add( (i,j) ) \n\nend_time = time.time()\nbf_computation_time = end_time - start_time      <\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-13\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Check for similar pairs<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>LSH<\/li>\n<\/ol>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-14\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\">docs = lsh_similar_itemset.pop()\nprint(<span class=\"hljs-string\">\"Doc 1:\"<\/span>)\nprint(dataset.iloc&#91;docs&#91;<span class=\"hljs-number\">0<\/span>]] )\nprint(<span class=\"hljs-string\">\"Is similar to\\nDoc2:\"<\/span>)\nprint(dataset.iloc&#91;docs&#91;<span class=\"hljs-number\">1<\/span>]] )<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-14\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<ol class=\"wp-block-list\" start=\"2\">\n<li>Brute Force<\/li>\n<\/ol>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-15\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\">docs = brute_force_similar_items.pop()\nprint(<span class=\"hljs-string\">\"Doc 1:\"<\/span>)\nprint(dataset.iloc&#91;docs&#91;<span class=\"hljs-number\">0<\/span>]] )\nprint(<span class=\"hljs-string\">\"Is similar to\\nDoc2:\"<\/span>)\nprint(dataset.iloc&#91;docs&#91;<span class=\"hljs-number\">1<\/span>]] )<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-15\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Print stats<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-16\" data-shcb-language-name=\"Python\" data-shcb-language-slug=\"python\"><span><code class=\"hljs language-python\">report = (<span class=\"hljs-string\">'LSH\\n%.2f seconds to execute\\n'<\/span>\n    <span class=\"hljs-string\">'%d similar documents found\\n\\n'<\/span>\n    <span class=\"hljs-string\">'Bruteforce search\\n%.2f seconds to execute\\n'<\/span>\n    <span class=\"hljs-string\">'%d similar documents found.\\n\\n'<\/span>\n    <span class=\"hljs-string\">'They find %d common similarities.'<\/span>)\n\nprint(report %(lsh_computation_time, len(lsh_similar_itemset),bf_computation_time,len(brute_force_similar_items),len(brute_force_similar_items.intersection(lsh_similar_itemset)) ))<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-16\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Python<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">python<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<h2 class=\"wp-block-heading\" id=\"h-conclusion\">Conclusion<\/h2>\n\n\n\n<p>In this article about checking document similarity we have seen:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>how to transform a document into a set using k-shingling<\/li>\n\n\n\n<li>how to represent a set in a compressed way computing its signature in such a way that set similarity is preserved, using MinHashing.<\/li>\n\n\n\n<li>how to implement a hashing function that hashes similar items in the same bucket, using LSH (locality-sensitive hashing).<\/li>\n\n\n\n<li>how to compare the approximate method with the brute force method.<\/li>\n<\/ol>\n\n\n\n<p>All the theory behind the code here can be found in <a href=\"http:\/\/www.mmds.org\/#ver21\">chapter 3 of the book &#8220;Mining of Massive Datasets&#8221;<\/a><\/p>\n\n\n\n<p>You can find the full example on <a href=\"https:\/\/github.com\/nicoDs96\/Document-Similarity-using-Python-and-PySpark\/blob\/master\/LSH\/DM_HW2_Ex2.ipynb\">GitHub<\/a>. I wrote also a notebook with an Apache Spark implementation, available <a href=\"https:\/\/github.com\/nicoDs96\/Document-Similarity-using-Python-and-PySpark\/blob\/master\/LSH\/DM_HW2_Ex2.ipynb\">on GitHub<\/a>.<\/p>\n\n\n\n<p>Despite it is interesting to understand how that algorithm works, it is always a good idea to check for robust libraries to put this algorithm in production. A lot of how implementations exist depends on the use case, here are a few:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/reference\/api\/pyspark.ml.feature.MinHashLSH.html#pyspark.ml.feature.MinHashLSH\">Apache Spark Implementation<\/a><\/li>\n\n\n\n<li><a href=\"http:\/\/ekzhu.com\/datasketch\/\">datasketch<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/proxy-nyc.hidemyass-freeproxy.com\/proxy\/it-it\/aHR0cHM6Ly93d3cudWJlci5jb20vYmxvZy9sc2gv\">LSH usage @ Uber<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/idealista\/tlsh-js\">Idealista implementation<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/spotify\/annoy\">Spotify implementation<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/engineering.fb.com\/2017\/03\/29\/data-infrastructure\/faiss-a-library-for-efficient-similarity-search\/\">Facebook implementation<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/erikbern\/ann-benchmarks\">Benchmarks of some implementations<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Why is document similarity more important than ever? In the big data era, it is always more frequent that companies need to detect similar items in their database. Imagine platforms like Kijiji or Subito, trying to detect people that constantly duplicate announcements on the platform to boost their visibility without paying for sponsorship offered by&#8230; <a class=\"more-link\" href=\"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/\">Read more<\/a><\/p>\n","protected":false},"author":149,"featured_media":11283,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_editorskit_title_hidden":false,"_editorskit_reading_time":10,"_editorskit_is_block_options_detached":false,"_editorskit_block_options_position":"{}","_uag_custom_page_level_css":"","_genesis_hide_title":false,"_genesis_hide_breadcrumbs":false,"_genesis_hide_singular_image":false,"_genesis_hide_footer_widgets":false,"_genesis_custom_body_class":"","_genesis_custom_post_class":"","_genesis_layout":"","footnotes":""},"categories":[36],"tags":[9952,68],"collections":[],"class_list":{"0":"post-19097","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-backend","8":"tag-languages","9":"tag-python","10":"entry"},"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v26.9 (Yoast SEO v26.9) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Fast Document Similarity in Python (MinHashLSH) - Codemotion Magazine<\/title>\n<meta name=\"description\" content=\"Learn how to detect similar documents in a database using Python with Minhsash Locality Sensitive Hashing. Code examples included!\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Fast Document Similarity in Python (MinHashLSH)\" \/>\n<meta property=\"og:description\" content=\"Learn how to detect similar documents in a database using Python with Minhsash Locality Sensitive Hashing. Code examples included!\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/\" \/>\n<meta property=\"og:site_name\" content=\"Codemotion Magazine\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Codemotion.Italy\/\" \/>\n<meta property=\"article:published_time\" content=\"2022-10-21T07:09:48+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-06-19T10:30:43+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1011\" \/>\n\t<meta property=\"og:image:height\" content=\"568\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Nicola Di Santo\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@CodemotionIT\" \/>\n<meta name=\"twitter:site\" content=\"@CodemotionIT\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Nicola Di Santo\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/\"},\"author\":{\"name\":\"Nicola Di Santo\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/5f1b31a32d6dc2927a3b7b0afd9035e0\"},\"headline\":\"Fast Document Similarity in Python (MinHashLSH)\",\"datePublished\":\"2022-10-21T07:09:48+00:00\",\"dateModified\":\"2023-06-19T10:30:43+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/\"},\"wordCount\":1463,\"publisher\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438.jpg\",\"keywords\":[\"Languages\",\"Python\"],\"articleSection\":[\"Backend\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/\",\"url\":\"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/\",\"name\":\"Fast Document Similarity in Python (MinHashLSH) - Codemotion Magazine\",\"isPartOf\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438.jpg\",\"datePublished\":\"2022-10-21T07:09:48+00:00\",\"dateModified\":\"2023-06-19T10:30:43+00:00\",\"description\":\"Learn how to detect similar documents in a database using Python with Minhsash Locality Sensitive Hashing. Code examples included!\",\"breadcrumb\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/#primaryimage\",\"url\":\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438.jpg\",\"contentUrl\":\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438.jpg\",\"width\":1011,\"height\":568,\"caption\":\"similar documents, python, document similarity\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.codemotion.com\/magazine\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Backend\",\"item\":\"https:\/\/www.codemotion.com\/magazine\/backend\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Fast Document Similarity in Python (MinHashLSH)\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#website\",\"url\":\"https:\/\/www.codemotion.com\/magazine\/\",\"name\":\"Codemotion Magazine\",\"description\":\"We code the future. Together\",\"publisher\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.codemotion.com\/magazine\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#organization\",\"name\":\"Codemotion\",\"url\":\"https:\/\/www.codemotion.com\/magazine\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2019\/11\/codemotionlogo.png\",\"contentUrl\":\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2019\/11\/codemotionlogo.png\",\"width\":225,\"height\":225,\"caption\":\"Codemotion\"},\"image\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Codemotion.Italy\/\",\"https:\/\/x.com\/CodemotionIT\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/5f1b31a32d6dc2927a3b7b0afd9035e0\",\"name\":\"Nicola Di Santo\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/bf7c4c51a5b2debad1d8ef4b1fdf58512204bfc3591af08a94a1e0052a85da54?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/bf7c4c51a5b2debad1d8ef4b1fdf58512204bfc3591af08a94a1e0052a85da54?s=96&d=mm&r=g\",\"caption\":\"Nicola Di Santo\"},\"url\":\"https:\/\/www.codemotion.com\/magazine\/author\/nicola-di-santo\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Fast Document Similarity in Python (MinHashLSH) - Codemotion Magazine","description":"Learn how to detect similar documents in a database using Python with Minhsash Locality Sensitive Hashing. Code examples included!","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/","og_locale":"en_US","og_type":"article","og_title":"Fast Document Similarity in Python (MinHashLSH)","og_description":"Learn how to detect similar documents in a database using Python with Minhsash Locality Sensitive Hashing. Code examples included!","og_url":"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/","og_site_name":"Codemotion Magazine","article_publisher":"https:\/\/www.facebook.com\/Codemotion.Italy\/","article_published_time":"2022-10-21T07:09:48+00:00","article_modified_time":"2023-06-19T10:30:43+00:00","og_image":[{"width":1011,"height":568,"url":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438.jpg","type":"image\/jpeg"}],"author":"Nicola Di Santo","twitter_card":"summary_large_image","twitter_creator":"@CodemotionIT","twitter_site":"@CodemotionIT","twitter_misc":{"Written by":"Nicola Di Santo","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/#article","isPartOf":{"@id":"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/"},"author":{"name":"Nicola Di Santo","@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/5f1b31a32d6dc2927a3b7b0afd9035e0"},"headline":"Fast Document Similarity in Python (MinHashLSH)","datePublished":"2022-10-21T07:09:48+00:00","dateModified":"2023-06-19T10:30:43+00:00","mainEntityOfPage":{"@id":"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/"},"wordCount":1463,"publisher":{"@id":"https:\/\/www.codemotion.com\/magazine\/#organization"},"image":{"@id":"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/#primaryimage"},"thumbnailUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438.jpg","keywords":["Languages","Python"],"articleSection":["Backend"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/","url":"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/","name":"Fast Document Similarity in Python (MinHashLSH) - Codemotion Magazine","isPartOf":{"@id":"https:\/\/www.codemotion.com\/magazine\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/#primaryimage"},"image":{"@id":"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/#primaryimage"},"thumbnailUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438.jpg","datePublished":"2022-10-21T07:09:48+00:00","dateModified":"2023-06-19T10:30:43+00:00","description":"Learn how to detect similar documents in a database using Python with Minhsash Locality Sensitive Hashing. Code examples included!","breadcrumb":{"@id":"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/#primaryimage","url":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438.jpg","contentUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438.jpg","width":1011,"height":568,"caption":"similar documents, python, document similarity"},{"@type":"BreadcrumbList","@id":"https:\/\/www.codemotion.com\/magazine\/backend\/fast-document-similarity-in-python-minhashlsh\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.codemotion.com\/magazine\/"},{"@type":"ListItem","position":2,"name":"Backend","item":"https:\/\/www.codemotion.com\/magazine\/backend\/"},{"@type":"ListItem","position":3,"name":"Fast Document Similarity in Python (MinHashLSH)"}]},{"@type":"WebSite","@id":"https:\/\/www.codemotion.com\/magazine\/#website","url":"https:\/\/www.codemotion.com\/magazine\/","name":"Codemotion Magazine","description":"We code the future. Together","publisher":{"@id":"https:\/\/www.codemotion.com\/magazine\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.codemotion.com\/magazine\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.codemotion.com\/magazine\/#organization","name":"Codemotion","url":"https:\/\/www.codemotion.com\/magazine\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/logo\/image\/","url":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2019\/11\/codemotionlogo.png","contentUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2019\/11\/codemotionlogo.png","width":225,"height":225,"caption":"Codemotion"},"image":{"@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Codemotion.Italy\/","https:\/\/x.com\/CodemotionIT"]},{"@type":"Person","@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/5f1b31a32d6dc2927a3b7b0afd9035e0","name":"Nicola Di Santo","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/bf7c4c51a5b2debad1d8ef4b1fdf58512204bfc3591af08a94a1e0052a85da54?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/bf7c4c51a5b2debad1d8ef4b1fdf58512204bfc3591af08a94a1e0052a85da54?s=96&d=mm&r=g","caption":"Nicola Di Santo"},"url":"https:\/\/www.codemotion.com\/magazine\/author\/nicola-di-santo\/"}]}},"featured_image_src":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438-600x400.jpg","featured_image_src_square":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438-600x568.jpg","author_info":{"display_name":"Nicola Di Santo","author_link":"https:\/\/www.codemotion.com\/magazine\/author\/nicola-di-santo\/"},"uagb_featured_image_src":{"full":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438.jpg",1011,568,false],"thumbnail":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438-150x150.jpg",150,150,true],"medium":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438-300x169.jpg",300,169,true],"medium_large":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438-768x431.jpg",768,431,true],"large":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438.jpg",1011,568,false],"1536x1536":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438.jpg",1011,568,false],"2048x2048":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438.jpg",1011,568,false],"small-home-featured":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438.jpg",100,56,false],"sidebar-featured":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438-180x128.jpg",180,128,true],"genesis-singular-images":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438-896x504.jpg",896,504,true],"archive-featured":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438-400x225.jpg",400,225,true],"gb-block-post-grid-landscape":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438-600x400.jpg",600,400,true],"gb-block-post-grid-square":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2020\/10\/python-developer-community-e1659350197438-600x568.jpg",600,568,true]},"uagb_author_info":{"display_name":"Nicola Di Santo","author_link":"https:\/\/www.codemotion.com\/magazine\/author\/nicola-di-santo\/"},"uagb_comment_info":0,"uagb_excerpt":"Why is document similarity more important than ever? In the big data era, it is always more frequent that companies need to detect similar items in their database. Imagine platforms like Kijiji or Subito, trying to detect people that constantly duplicate announcements on the platform to boost their visibility without paying for sponsorship offered by&#8230;&hellip;","lang":"en","_links":{"self":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts\/19097","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/users\/149"}],"replies":[{"embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/comments?post=19097"}],"version-history":[{"count":27,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts\/19097\/revisions"}],"predecessor-version":[{"id":21449,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts\/19097\/revisions\/21449"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/media\/11283"}],"wp:attachment":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/media?parent=19097"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/categories?post=19097"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/tags?post=19097"},{"taxonomy":"collections","embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/collections?post=19097"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}