{"id":13860,"date":"2021-02-26T09:00:00","date_gmt":"2021-02-26T08:00:00","guid":{"rendered":"https:\/\/www.codemotion.com\/magazine\/?p=13860"},"modified":"2022-01-05T20:03:21","modified_gmt":"2022-01-05T19:03:21","slug":"data-version-control-ml-outcomes","status":"publish","type":"post","link":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/","title":{"rendered":"How to Implement Data Version Control and Improve Machine Learning Outcomes"},"content":{"rendered":"\n<p><a href=\"https:\/\/pixabay.com\/photos\/machine-learning-typewriter-5290464\/\"><\/a>It may surprise you to hear this, but machine learning is in something of a crisis.&nbsp;<\/p>\n\n\n\n<p>In recent years, machine learning researchers have found it increasingly difficult to reproduce the findings made by algorithms. A key problem that has been endemic to the reproducibility crisis is the way in which these researchers work together.&nbsp;<\/p>\n\n\n\n<p>When handling the large data sets necessary in machine learning, minor alterations to the data can substantially reduce the reproducibility of results if these changes go unnoticed by the rest of the team.&nbsp;<\/p>\n\n\n\n<p>Data version control looks to fix that by offering researchers an easily accessible platform that operates as the single point of truth.&nbsp;<\/p>\n\n\n\t\t\t\t<div class=\"wp-block-uagb-table-of-contents uagb-toc__align-left uagb-toc__columns-1  uagb-block-95434f26      \"\n\t\t\t\t\tdata-scroll= \"1\"\n\t\t\t\t\tdata-offset= \"30\"\n\t\t\t\t\tstyle=\"\"\n\t\t\t\t>\n\t\t\t\t<div class=\"uagb-toc__wrap\">\n\t\t\t\t\t\t<div class=\"uagb-toc__title\">\n\t\t\t\t\t\t\tTable Of Contents\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"uagb-toc__list-wrap \">\n\t\t\t\t\t\t<ol class=\"uagb-toc__list\"><li class=\"uagb-toc__list\"><a href=\"#but-before-we-go-any-further\" class=\"uagb-toc-link__trigger\">But before we go any further&#8230;<\/a><li class=\"uagb-toc__list\"><a href=\"#machine-learning-explained\" class=\"uagb-toc-link__trigger\">Machine learning explained<\/a><li class=\"uagb-toc__list\"><a href=\"#a-quick-example\" class=\"uagb-toc-link__trigger\">A quick example<\/a><li class=\"uagb-toc__list\"><a href=\"#dvc-puts-you-and-your-team-on-the-same-page\" class=\"uagb-toc-link__trigger\">DVC puts you and your team on the same page\u00a0<\/a><li class=\"uagb-toc__list\"><a href=\"#conclusion\" class=\"uagb-toc-link__trigger\">Conclusion<\/a><\/ol>\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\n\n\n<h2 class=\"wp-block-heading\" id=\"h-but-before-we-go-any-further\">But before we go any further&#8230;<\/h2>\n\n\n\n<p>We understand that, for most people, there\u2019s a common initial response when they hear the words \u201cmachine learning\u201d &#8211; one of instant panic, followed by an urgent need for the bathroom and a hasty exit through the window if needs be.&nbsp;<\/p>\n\n\n\n<p>We wouldn\u2019t blame you for it either.&nbsp;&nbsp;<\/p>\n\n\n\n<p>On the one hand, you may look socially awkward for a second. On the other, you could face the all-too-real prospect of getting ensnared in the argument. And somehow, your go-to response of \u201cwell, both sides make a very good point\u201d doesn\u2019t seem likely to cut it.&nbsp;<\/p>\n\n\n\n<p>The good news, if you could stop inching towards the window for just a moment, is that you\u2019re likely familiar with the problem of reproducibility, even if you don\u2019t realize it. It\u2019s the same issue that\u2019s plagued academia for years.&nbsp;<\/p>\n\n\n\n<p>You may remember headlines like: \u201c<a href=\"https:\/\/www.forbes.com\/sites\/quora\/2017\/02\/09\/how-the-reproducibility-crisis-in-academia-is-affecting-scientific-research\/?sh=3cf6973e3dad\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\">90% of academic research<\/a> is not reproducible.\u201d That is, when one university tries to replicate the findings of another, their experiments frequently fail to yield the same results.<\/p>\n\n\n\n<p>Granted, it sounds bad. But not, like, really, really bad. It\u2019s probably just two English professors arguing over a <a href=\"https:\/\/www.ringcentral.co.uk\/video-call.html\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\">video call<\/a> about whether Shakespeare actually was a woman. It\u2019s not like cancer research is plagued by the same issue.&nbsp;<\/p>\n\n\n\n<p>Actually and unfortunately, it is.&nbsp;<\/p>\n\n\n\n<p>In 2011, the oncology division of Amgen tried to replicate the findings of <a href=\"https:\/\/www.forbes.com\/sites\/quora\/2017\/02\/09\/how-the-reproducibility-crisis-in-academia-is-affecting-scientific-research\/?sh=3cf6973e3dad\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\">53 research papers<\/a>. They could only reproduce six studies (or just 11%).&nbsp;<\/p>\n\n\n\n<p>Now consider machine learning. This 70-year old science will soon be responsible for everything from self-driving vehicles to <a href=\"https:\/\/www.cleverfiles.com\/howto\/cybersecurity-tips-to-protect-data.html\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\">cybersecurity <\/a>to improving <a href=\"https:\/\/www.bamboohr.com\/hr-101-guide\/chapter-2-culture\/\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\">company culture<\/a> to beating <a href=\"https:\/\/www.codemotion.com\/magazine\/articles\/stories\/developers-can-help-fight-the-coronavirus\/\" class=\"ek-link\">coronavirus<\/a>.&nbsp;<\/p>\n\n\n\n<p>And the incredible thing about machine learning is that it\u2019s already used widely today.&nbsp;<\/p>\n\n\n\n<p>Heck, machine learning will have helped determine the SEO of this very article &#8211; a phenomenon thousands of <a href=\"https:\/\/www.digitalsilk.com\/what-is-digital-agency\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\">digital agencies<\/a> across the world have sought to understand and exploit through things like <a href=\"https:\/\/www.bigcommerce.com\/ecommerce-answers\/what-is-domain-authority\/\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\">domain authority<\/a>.&nbsp;<\/p>\n\n\n\n<p>Today, machine learning is deeply embedded in pretty much anything you could imagine that\u2019s data-related. But should the reproducibility crisis continue unsolved, it\u2019s the <a href=\"https:\/\/www.codemotion.com\/magazine\/dev-hub\/machine-learning-dev\/future-machine-learning-edge\/\" target=\"_blank\" rel=\"noopener\" class=\"ek-link\">future of machine learning<\/a> that\u2019s at significant risk.&nbsp;<\/p>\n\n\n\n<p>But let\u2019s take a second here to pump the breaks on the doom and gloom train.&nbsp;<\/p>\n\n\n\n<p>There are many brilliant people working on the problem of reproducibility right now. Faith should rightly be placed in them to fix the issue. And their work is already yielding results. Data version control (DVC) is one of those handy solutions.&nbsp;&nbsp;<\/p>\n\n\n\n<p>But what exactly is it, how does it work, how do you implement it, and (for the others in the room who may need a refresher) what is machine learning?&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-machine-learning-explained\">Machine learning explained<\/h2>\n\n\n\n<p>First things first, if you\u2019re already well-acquainted with machine learning and looking for a bonafide authority on the topics covered with an in-depth explainer, you can find one <a href=\"https:\/\/towardsdatascience.com\/why-git-and-git-lfs-is-not-enough-to-solve-the-machine-learning-reproducibility-crisis-f733b49e96e8\">here<\/a>.&nbsp;<\/p>\n\n\n\n<p>In this article, we\u2019ll be sticking to the top-level stuff. We\u2019re more than happy to have you, but consider yourself warned! There\u2019ll be no wincing at our beginner-level explanations here.&nbsp;<\/p>\n\n\n\n<p>Now, where were we?&nbsp;<\/p>\n\n\n\n<p>Right. Machine learning. When it comes to this rather nebulous topic, we favor the definition offered by MIT:<\/p>\n\n\n\n<p><em>\u201c<\/em><a href=\"https:\/\/www.technologyreview.com\/2018\/11\/17\/103781\/what-is-machine-learning-we-drew-you-another-flowchart\/\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\"><em>Machine-learning algorithms<\/em><\/a><em> use statistics to find patterns in massive amounts of data. And data, here, encompasses a lot of things &#8211; numbers, words, images, clicks, what have you. If it\u2019s digital and digitally stored, it can be fed into a machine-learning algorithm\u2026 Frankly, this process is quite basic: find the pattern, apply the pattern.\u201d<\/em><\/p>\n\n\n\n<p>It can be a bit bizarre to think that this complex subject can be boiled down to such a basic idea: \u201cfind the pattern, apply the pattern.\u201d But it\u2019s true. Sure, machine learning can\u2019t help you find the <a href=\"https:\/\/project-management.com\/top-3-software-alternatives-to-trello\/\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\">top 3 alternatives to Trello<\/a> or <a href=\"https:\/\/www.ringcentral.com\/us\/en\/blog\/vonage-alternatives\/\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\">Vonage business alternatives<\/a>, but as we\u2019ve discussed, its role in society is only set to increase.<\/p>\n\n\n\n<p>The only thing we\u2019d like to add to MIT\u2019s definition is that, generally speaking, there are two broad categories of machine learning algorithms: supervised and unsupervised.<\/p>\n\n\n\n<p>The difference between the two comes down to how you \u201ctrain\u201d your ML algorithm. Training, in this case, refers to the process of feeding your algorithm data to help it learn how to identify certain patterns.&nbsp;<\/p>\n\n\n\n<p>In supervised learning, you label the data you feed your algorithm. The majority of machine learning falls into this group, and it\u2019s where most new machine learning engineers will begin.&nbsp;<\/p>\n\n\n\n<p>The name of this sub-branch is derived from the idea that, during training, you\u2019re teaching the algorithm by asking it to identify patterns you already know. You also assign certain labels to different pieces of data, rather than letting the algorithm group the data.<\/p>\n\n\n\n<p>Once the training wheels come off, then you can start feeding the algorithms unseen pieces of data. The really cool thing here is that the algorithm will decide for itself how to group and label this new information.&nbsp;<\/p>\n\n\n\n<p>Unsupervised learning is exactly how you imagine. It\u2019s the same process, only the data hasn\u2019t been labeled and there are no training wheels. From day one, the algorithm deals with unseen data. The idea being that unsupervised <a href=\"https:\/\/www.codemotion.com\/magazine\/dev-hub\/machine-learning-dev\/epidemic-intelligence-data-models-and-machine-learning-in-the-age-of-coronavirus\/\" target=\"_blank\" rel=\"noopener\" class=\"ek-link\">machine learning can uncover previously unknown data<\/a> patterns.&nbsp;<\/p>\n\n\n\n<p>The problem then becomes how to make sure the outputs are actually correct. It\u2019s for this reason that unsupervised learning is generally left by the wayside, and why we\u2019ll be primarily referring to supervised learning in this article.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-a-quick-example\">A quick example<\/h2>\n\n\n\n<p>With that being said, if you want a down-to-earth, real-world example of what machine learning can do, look no further than the four walls of your own home.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Let\u2019s say that for the past three years you\u2019ve loved your white walls. Now, however, you want to trade them in for orange ones. So you start scouting for a new home. Being the clever machine learning engineer you are, you don\u2019t want to rely on the prices provided by some seedy estate agent.&nbsp;<\/p>\n\n\n\n<p>So what do you do?&nbsp;<\/p>\n\n\n\n<p>Well, you could get a machine learning algorithm to offer a pretty darn good estimation.&nbsp;<\/p>\n\n\n\n<p>To get as close to an estimate as possible, you\u2019d need to feed it all sorts of information. This data could be related to the prices of other houses in the area, the number of transport links, and nearby shops and parks.&nbsp;<\/p>\n\n\n\n<p>When it comes to machine learning, there\u2019s no such thing as too much data.&nbsp;<\/p>\n\n\n\n<p>Your machine learning algorithm would then dutifully munch through the data and spit out a pretty exact valuation of the house you\u2019re interested in.&nbsp;<\/p>\n\n\n\n<p>At least, that\u2019s the theory.&nbsp;<\/p>\n\n\n\n<p>Remember when we said there\u2019s no such thing as too much data? That was a lie. There most certainly is. And it\u2019s a pretty common problem in machine learning.&nbsp;<\/p>\n\n\n\n<p>Let\u2019s return to our example, only this time, it\u2019s not just you working with your algorithm. Now you\u2019ve got a whole 30-person team to help you. (Congratulations on the promotion!)&nbsp;<\/p>\n\n\n\n<p>Up until this point, you and your team have been feeding your algorithm historical house prices between the years 2000 and 2019. However, it just so happens that the figures for 2020 have suddenly been released. That\u2019s great, right? More data is good data.&nbsp;<\/p>\n\n\n\n<p>Well, some other members of the team may feel like 2020 was such a bizarre year, all things considered, that it should be left out.<\/p>\n\n\n\n<p>So you call a meeting. Everyone attends. Everyone agrees: the data should be excluded. Except the office intern has already added the data.&nbsp;<\/p>\n\n\n\n<p>That\u2019s where data version control comes in. With this <a href=\"https:\/\/www.codemotion.com\/magazine\/dev-hub\/backend-dev\/contributing-to-open-source-projects\/\" class=\"ek-link\">open-source<\/a> tool, you can revert back to your previous collection of data.&nbsp;<\/p>\n\n\n\n<p>It doesn\u2019t sound groundbreaking, we know &#8211; it\u2019s not like we\u2019re talking about <a href=\"https:\/\/www.ringcentral.com\/how-does-virtual-phone-number-work.html\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\">virtual phone numbers<\/a> here. But consider this: in a team of thirty people, all working on the exact same data, how many times a day do you think people will make changes to it?&nbsp;<\/p>\n\n\n\n<p>What if one person changes an already outdated data file; that\u2019s a whole day\u2019s worth of work down the drain. Now multiply that 30 times over. Without DVC, your team would descend into anarchy.&nbsp;<\/p>\n\n\n\n<p>It\u2019s why teams, for a long time now, have developed their own in-house version of DVC. And whilst it\u2019s certainly worked, it\u2019s meant new team members must quickly adjust to whatever in-house tool is being used.&nbsp;<\/p>\n\n\n\n<p>That is until the good folks over at <a href=\"https:\/\/dvc.org\/\">DVC.org<\/a> came along and standardized it into the easily accessible tool we know today. The tool is easily implemented as well. DVC can be downloaded as a Python library, so all you need to do is install it using a package manager like Pip or Conda.&nbsp;<\/p>\n\n\n\n<p>DVC, once installed, establishes a singular point of truth that the whole team can work from. They can download select pieces of data from the wider whole, work on them, then upload them to the same place once their changes have been approved.&nbsp;<\/p>\n\n\n\n<p>That all sounds well and good, but how does DVC then improve machine learning outcomes?&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-dvc-puts-you-and-your-team-on-the-same-page\">DVC puts you and your team on the same page&nbsp;<\/h2>\n\n\n\n<p>To really answer that question, we first have to dive a little further into how DVC was designed.&nbsp;<\/p>\n\n\n\n<p>DVC is built around a pre-existing system called Git.&nbsp;<\/p>\n\n\n\n<p>What is Git?&nbsp;<\/p>\n\n\n\n<p>Well, it\u2019s DVC but for code.&nbsp;<\/p>\n\n\n\n<p>To elaborate a bit, let\u2019s return to our 30-person team example again. Except this time, you have 30 unsupervised software engineers working with code. Like, a lot of code, and generating more every second of every day.<\/p>\n\n\n\n<p>As you would rightly expect, code is integral to machine learning. Think of it as the skeleton of the algorithm.&nbsp;<\/p>\n\n\n\n<p>As your team begins knitting the bones of your algorithm together, you suddenly realize that you\u2019re faced with the prospect of unbridled chaos. How, you wonder, do you make sure that every programmer is working from the agreed-upon singular point of truth?&nbsp;<\/p>\n\n\n\n<p>In this case, Git is your friend. It saves every version of the code your team is working on and makes it crystal clear at which point in time it was worked upon.&nbsp;<\/p>\n\n\n\n<p>The reason why Git and DVC work so well together is that they then take this to the next level.&nbsp;<\/p>\n\n\n\n<p>Together, they make sure that the code and the data are up to date and aligned. So rather than combining an outdated set of data with your most recent batch of code, DVC and Git make sure that everything is aligned and up to date.&nbsp;&nbsp;<\/p>\n\n\n\n<p>With that being said, you might (rightly) ask why machine learning engineers don\u2019t simply store both the code and data on Git. And it\u2019s a fair point.&nbsp;<\/p>\n\n\n\n<p>Unfortunately, Git doesn\u2019t allow users to store files larger than 2 GB in size. If you ask anyone involved in data management, <a href=\"https:\/\/www.toolbox.com\/security\/data-breaches\/guest-article\/5-tips-to-safeguard-customer-data-and-avoid-a-breach\/\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\">data security<\/a>, or data science, they\u2019ll happily tell you that data files usually exceed those limitations.&nbsp;<\/p>\n\n\n\n<p>By implementing DVC, you and your team can be confident that the code and data you\u2019ve based your ML algorithm on represent your most accurate and recent work. Ultimately, your <a href=\"https:\/\/www.codemotion.com\/magazine\/dev-hub\/machine-learning-dev\/why-do-some-machine-learning-models-fail\/\" class=\"ek-link\">ML algorithm is less likely to fail<\/a>. DVC prevents any mistakes made along the way from getting into the end product, meaning your algorithm is that much more accurate for it.&nbsp;&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-conclusion\">Conclusion<\/h2>\n\n\n\n<p>We all have questions. Questions like: why are we here? Is there a god? <a href=\"https:\/\/www.ringcentral.com\/multi-line-phone-system.html\" target=\"_blank\" aria-label=\" (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"ek-link\">What is a multi-line phone system<\/a>\u2026 You know, the kind of questions that keep you up late at night racking your brain for answers.&nbsp;<\/p>\n\n\n\n<p>But when it comes to machine learning, it can be all too easy to dismiss any hope of finding the kind of answers you\u2019d like. Hopefully, we\u2019ve provided a few that satisfy your craving.<\/p>\n\n\n\n<p>Machine learning is much like any other project. Team members need to be working from the same page and singing from the same hymn sheet. Data version control offers that desired level of order, one that is vital for consistent and reproducible machine learning outcomes.&nbsp;<\/p>\n\n\n\n<p>If machine learning is going to solve the kinds of problems we need it to, DVC and other tools like it will become a necessity to establish <a href=\"https:\/\/www.codemotion.com\/magazine\/dev-hub\/machine-learning-dev\/machine-learning-as-a-service-serving-reusable-ml-models\/\" class=\"ek-link\">machine learning as a service<\/a> that delivers real value.<\/p>\n\n\n\n<p>So the next time you find yourself trapped in conversation, and the boffins in the room turn to you, take a moment to advocate for DVC. We promise you won\u2019t regret it.&nbsp;<\/p>\n\n\n\n<p>If you\u2019re looking for more insights into the world of machine learning, check out our other insights at <a href=\"https:\/\/www.codemotion.com\/\" class=\"ek-link\">codemotion.com<\/a>.&nbsp; <\/p>\n\n\n\n<p>[jwp-video n=&#8221;1&#8243;]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>It may surprise you to hear this, but machine learning is in something of a crisis.&nbsp; In recent years, machine learning researchers have found it increasingly difficult to reproduce the findings made by algorithms. A key problem that has been endemic to the reproducibility crisis is the way in which these researchers work together.&nbsp; When&#8230; <a class=\"more-link\" href=\"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/\">Read more<\/a><\/p>\n","protected":false},"author":117,"featured_media":13861,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_editorskit_title_hidden":false,"_editorskit_reading_time":9,"_editorskit_is_block_options_detached":false,"_editorskit_block_options_position":"{}","_uag_custom_page_level_css":"","_genesis_hide_title":false,"_genesis_hide_breadcrumbs":false,"_genesis_hide_singular_image":false,"_genesis_hide_footer_widgets":false,"_genesis_custom_body_class":"","_genesis_custom_post_class":"","_genesis_layout":"","footnotes":""},"categories":[35],"tags":[6257],"collections":[],"class_list":{"0":"post-13860","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-machine-learning","8":"tag-dataops","9":"entry"},"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v26.9 (Yoast SEO v27.5) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>How to Implement Data Version Control and Improve Machine Learning Outcomes - Codemotion Magazine<\/title>\n<meta name=\"description\" content=\"Machine learning findings are difficult to be reproduced unless you adopt data version control as a single point of truth. Here&#039;s how.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to Implement Data Version Control and Improve Machine Learning Outcomes\" \/>\n<meta property=\"og:description\" content=\"Machine learning findings are difficult to be reproduced unless you adopt data version control as a single point of truth. Here&#039;s how.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/\" \/>\n<meta property=\"og:site_name\" content=\"Codemotion Magazine\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Codemotion.Italy\/\" \/>\n<meta property=\"article:published_time\" content=\"2021-02-26T08:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-01-05T19:03:21+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Victorio Duran III\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@CodemotionIT\" \/>\n<meta name=\"twitter:site\" content=\"@CodemotionIT\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Victorio Duran III\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/machine-learning\\\/data-version-control-ml-outcomes\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/machine-learning\\\/data-version-control-ml-outcomes\\\/\"},\"author\":{\"name\":\"Victorio Duran III\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#\\\/schema\\\/person\\\/a3acf25992fdf0adb9e6211c5585dd98\"},\"headline\":\"How to Implement Data Version Control and Improve Machine Learning Outcomes\",\"datePublished\":\"2021-02-26T08:00:00+00:00\",\"dateModified\":\"2022-01-05T19:03:21+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/machine-learning\\\/data-version-control-ml-outcomes\\\/\"},\"wordCount\":2326,\"publisher\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/machine-learning\\\/data-version-control-ml-outcomes\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/wp-content\\\/uploads\\\/2021\\\/02\\\/CO_magazine-template-2-1.jpg\",\"keywords\":[\"DataOps\"],\"articleSection\":[\"Machine Learning\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/machine-learning\\\/data-version-control-ml-outcomes\\\/\",\"url\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/machine-learning\\\/data-version-control-ml-outcomes\\\/\",\"name\":\"How to Implement Data Version Control and Improve Machine Learning Outcomes - Codemotion Magazine\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/machine-learning\\\/data-version-control-ml-outcomes\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/machine-learning\\\/data-version-control-ml-outcomes\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/wp-content\\\/uploads\\\/2021\\\/02\\\/CO_magazine-template-2-1.jpg\",\"datePublished\":\"2021-02-26T08:00:00+00:00\",\"dateModified\":\"2022-01-05T19:03:21+00:00\",\"description\":\"Machine learning findings are difficult to be reproduced unless you adopt data version control as a single point of truth. Here's how.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/machine-learning\\\/data-version-control-ml-outcomes\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/machine-learning\\\/data-version-control-ml-outcomes\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/machine-learning\\\/data-version-control-ml-outcomes\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/wp-content\\\/uploads\\\/2021\\\/02\\\/CO_magazine-template-2-1.jpg\",\"contentUrl\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/wp-content\\\/uploads\\\/2021\\\/02\\\/CO_magazine-template-2-1.jpg\",\"width\":1200,\"height\":628,\"caption\":\"How to Implement Data Version Control and Improve Machine Learning Outcomes\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/machine-learning\\\/data-version-control-ml-outcomes\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"AI\\\/ML\",\"item\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Machine Learning\",\"item\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/ai-ml\\\/machine-learning\\\/\"},{\"@type\":\"ListItem\",\"position\":4,\"name\":\"How to Implement Data Version Control and Improve Machine Learning Outcomes\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#website\",\"url\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/\",\"name\":\"Codemotion Magazine\",\"description\":\"We code the future. Together\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#organization\",\"name\":\"Codemotion\",\"url\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/wp-content\\\/uploads\\\/2019\\\/11\\\/codemotionlogo.png\",\"contentUrl\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/wp-content\\\/uploads\\\/2019\\\/11\\\/codemotionlogo.png\",\"width\":225,\"height\":225,\"caption\":\"Codemotion\"},\"image\":{\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/Codemotion.Italy\\\/\",\"https:\\\/\\\/x.com\\\/CodemotionIT\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/#\\\/schema\\\/person\\\/a3acf25992fdf0adb9e6211c5585dd98\",\"name\":\"Victorio Duran III\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e49b85662b7ff39a1d1e0dcbe46d0084186db2854b0d0085ced53d27d82556c8?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e49b85662b7ff39a1d1e0dcbe46d0084186db2854b0d0085ced53d27d82556c8?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/e49b85662b7ff39a1d1e0dcbe46d0084186db2854b0d0085ced53d27d82556c8?s=96&d=mm&r=g\",\"caption\":\"Victorio Duran III\"},\"description\":\"Victorio is the Associate SEO Director at RingCentral, a global leader in unified communications and CTI software provider. He has over 13 years of extensive involvement on web and digital operations with diverse experience as web engineer, product manager, and digital marketing strategist.\",\"url\":\"https:\\\/\\\/www.codemotion.com\\\/magazine\\\/author\\\/victorio-duran\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"How to Implement Data Version Control and Improve Machine Learning Outcomes - Codemotion Magazine","description":"Machine learning findings are difficult to be reproduced unless you adopt data version control as a single point of truth. Here's how.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/","og_locale":"en_US","og_type":"article","og_title":"How to Implement Data Version Control and Improve Machine Learning Outcomes","og_description":"Machine learning findings are difficult to be reproduced unless you adopt data version control as a single point of truth. Here's how.","og_url":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/","og_site_name":"Codemotion Magazine","article_publisher":"https:\/\/www.facebook.com\/Codemotion.Italy\/","article_published_time":"2021-02-26T08:00:00+00:00","article_modified_time":"2022-01-05T19:03:21+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1.jpg","type":"image\/jpeg"}],"author":"Victorio Duran III","twitter_card":"summary_large_image","twitter_creator":"@CodemotionIT","twitter_site":"@CodemotionIT","twitter_misc":{"Written by":"Victorio Duran III","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/#article","isPartOf":{"@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/"},"author":{"name":"Victorio Duran III","@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/a3acf25992fdf0adb9e6211c5585dd98"},"headline":"How to Implement Data Version Control and Improve Machine Learning Outcomes","datePublished":"2021-02-26T08:00:00+00:00","dateModified":"2022-01-05T19:03:21+00:00","mainEntityOfPage":{"@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/"},"wordCount":2326,"publisher":{"@id":"https:\/\/www.codemotion.com\/magazine\/#organization"},"image":{"@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/#primaryimage"},"thumbnailUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1.jpg","keywords":["DataOps"],"articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/","url":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/","name":"How to Implement Data Version Control and Improve Machine Learning Outcomes - Codemotion Magazine","isPartOf":{"@id":"https:\/\/www.codemotion.com\/magazine\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/#primaryimage"},"image":{"@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/#primaryimage"},"thumbnailUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1.jpg","datePublished":"2021-02-26T08:00:00+00:00","dateModified":"2022-01-05T19:03:21+00:00","description":"Machine learning findings are difficult to be reproduced unless you adopt data version control as a single point of truth. Here's how.","breadcrumb":{"@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/#primaryimage","url":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1.jpg","contentUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1.jpg","width":1200,"height":628,"caption":"How to Implement Data Version Control and Improve Machine Learning Outcomes"},{"@type":"BreadcrumbList","@id":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/data-version-control-ml-outcomes\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.codemotion.com\/magazine\/"},{"@type":"ListItem","position":2,"name":"AI\/ML","item":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/"},{"@type":"ListItem","position":3,"name":"Machine Learning","item":"https:\/\/www.codemotion.com\/magazine\/ai-ml\/machine-learning\/"},{"@type":"ListItem","position":4,"name":"How to Implement Data Version Control and Improve Machine Learning Outcomes"}]},{"@type":"WebSite","@id":"https:\/\/www.codemotion.com\/magazine\/#website","url":"https:\/\/www.codemotion.com\/magazine\/","name":"Codemotion Magazine","description":"We code the future. Together","publisher":{"@id":"https:\/\/www.codemotion.com\/magazine\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.codemotion.com\/magazine\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.codemotion.com\/magazine\/#organization","name":"Codemotion","url":"https:\/\/www.codemotion.com\/magazine\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/logo\/image\/","url":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2019\/11\/codemotionlogo.png","contentUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2019\/11\/codemotionlogo.png","width":225,"height":225,"caption":"Codemotion"},"image":{"@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Codemotion.Italy\/","https:\/\/x.com\/CodemotionIT"]},{"@type":"Person","@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/a3acf25992fdf0adb9e6211c5585dd98","name":"Victorio Duran III","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/e49b85662b7ff39a1d1e0dcbe46d0084186db2854b0d0085ced53d27d82556c8?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/e49b85662b7ff39a1d1e0dcbe46d0084186db2854b0d0085ced53d27d82556c8?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/e49b85662b7ff39a1d1e0dcbe46d0084186db2854b0d0085ced53d27d82556c8?s=96&d=mm&r=g","caption":"Victorio Duran III"},"description":"Victorio is the Associate SEO Director at RingCentral, a global leader in unified communications and CTI software provider. He has over 13 years of extensive involvement on web and digital operations with diverse experience as web engineer, product manager, and digital marketing strategist.","url":"https:\/\/www.codemotion.com\/magazine\/author\/victorio-duran\/"}]}},"featured_image_src":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1-600x400.jpg","featured_image_src_square":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1-600x600.jpg","author_info":{"display_name":"Victorio Duran III","author_link":"https:\/\/www.codemotion.com\/magazine\/author\/victorio-duran\/"},"uagb_featured_image_src":{"full":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1.jpg",1200,628,false],"thumbnail":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1-150x150.jpg",150,150,true],"medium":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1-300x157.jpg",300,157,true],"medium_large":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1-768x402.jpg",768,402,true],"large":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1-1024x536.jpg",1024,536,true],"1536x1536":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1.jpg",1200,628,false],"2048x2048":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1.jpg",1200,628,false],"small-home-featured":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1.jpg",100,52,false],"sidebar-featured":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1-180x128.jpg",180,128,true],"genesis-singular-images":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1-896x504.jpg",896,504,true],"archive-featured":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1-400x225.jpg",400,225,true],"gb-block-post-grid-landscape":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1-600x400.jpg",600,400,true],"gb-block-post-grid-square":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2021\/02\/CO_magazine-template-2-1-600x600.jpg",600,600,true]},"uagb_author_info":{"display_name":"Victorio Duran III","author_link":"https:\/\/www.codemotion.com\/magazine\/author\/victorio-duran\/"},"uagb_comment_info":0,"uagb_excerpt":"It may surprise you to hear this, but machine learning is in something of a crisis.&nbsp; In recent years, machine learning researchers have found it increasingly difficult to reproduce the findings made by algorithms. A key problem that has been endemic to the reproducibility crisis is the way in which these researchers work together.&nbsp; When&#8230;&hellip;","lang":"en","_links":{"self":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts\/13860","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/users\/117"}],"replies":[{"embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/comments?post=13860"}],"version-history":[{"count":7,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts\/13860\/revisions"}],"predecessor-version":[{"id":13928,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts\/13860\/revisions\/13928"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/media\/13861"}],"wp:attachment":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/media?parent=13860"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/categories?post=13860"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/tags?post=13860"},{"taxonomy":"collections","embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/collections?post=13860"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}