In recent years, machine learning researchers have found it increasingly difficult to reproduce the findings made by algorithms. A key problem that has been endemic to the reproducibility crisis is the way in which these researchers work together.
When handling the large data sets necessary in machine learning, minor alterations to the data can substantially reduce the reproducibility of results if these changes go unnoticed by the rest of the team.
Data version control looks to fix that by offering researchers an easily accessible platform that operates as the single point of truth.
But before we go any further…
We understand that, for most people, there’s a common initial response when they hear the words “machine learning” – one of instant panic, followed by an urgent need for the bathroom and a hasty exit through the window if needs be.
We wouldn’t blame you for it either.
On the one hand, you may look socially awkward for a second. On the other, you could face the all-too-real prospect of getting ensnared in the argument. And somehow, your go-to response of “well, both sides make a very good point” doesn’t seem likely to cut it.
The good news, if you could stop inching towards the window for just a moment, is that you’re likely familiar with the problem of reproducibility, even if you don’t realize it. It’s the same issue that’s plagued academia for years.
You may remember headlines like: “90% of academic research is not reproducible.” That is, when one university tries to replicate the findings of another, their experiments frequently fail to yield the same results.
Granted, it sounds bad. But not, like, really, really bad. It’s probably just two English professors arguing over a video call about whether Shakespeare actually was a woman. It’s not like cancer research is plagued by the same issue.
Actually and unfortunately, it is.
In 2011, the oncology division of Amgen tried to replicate the findings of 53 research papers. They could only reproduce six studies (or just 11%).
And the incredible thing about machine learning is that it’s already used widely today.
Heck, machine learning will have helped determine the SEO of this very article – a phenomenon thousands of digital agencies across the world have sought to understand and exploit through things like domain authority.
Today, machine learning is deeply embedded in pretty much anything you could imagine that’s data-related. But should the reproducibility crisis continue unsolved, it’s the future of machine learning that’s at significant risk.
But let’s take a second here to pump the breaks on the doom and gloom train.
There are many brilliant people working on the problem of reproducibility right now. Faith should rightly be placed in them to fix the issue. And their work is already yielding results. Data version control (DVC) is one of those handy solutions.
But what exactly is it, how does it work, how do you implement it, and (for the others in the room who may need a refresher) what is machine learning?
Machine learning explained
First things first, if you’re already well-acquainted with machine learning and looking for a bonafide authority on the topics covered with an in-depth explainer, you can find one here.
In this article, we’ll be sticking to the top-level stuff. We’re more than happy to have you, but consider yourself warned! There’ll be no wincing at our beginner-level explanations here.
Now, where were we?
Right. Machine learning. When it comes to this rather nebulous topic, we favor the definition offered by MIT:
“Machine-learning algorithms use statistics to find patterns in massive amounts of data. And data, here, encompasses a lot of things – numbers, words, images, clicks, what have you. If it’s digital and digitally stored, it can be fed into a machine-learning algorithm… Frankly, this process is quite basic: find the pattern, apply the pattern.”
It can be a bit bizarre to think that this complex subject can be boiled down to such a basic idea: “find the pattern, apply the pattern.” But it’s true. Sure, machine learning can’t help you find the top 3 alternatives to Trello or Vonage business alternatives, but as we’ve discussed, its role in society is only set to increase.
The only thing we’d like to add to MIT’s definition is that, generally speaking, there are two broad categories of machine learning algorithms: supervised and unsupervised.
The difference between the two comes down to how you “train” your ML algorithm. Training, in this case, refers to the process of feeding your algorithm data to help it learn how to identify certain patterns.
In supervised learning, you label the data you feed your algorithm. The majority of machine learning falls into this group, and it’s where most new machine learning engineers will begin.
The name of this sub-branch is derived from the idea that, during training, you’re teaching the algorithm by asking it to identify patterns you already know. You also assign certain labels to different pieces of data, rather than letting the algorithm group the data.
Once the training wheels come off, then you can start feeding the algorithms unseen pieces of data. The really cool thing here is that the algorithm will decide for itself how to group and label this new information.
Unsupervised learning is exactly how you imagine. It’s the same process, only the data hasn’t been labeled and there are no training wheels. From day one, the algorithm deals with unseen data. The idea being that unsupervised machine learning can uncover previously unknown data patterns.
The problem then becomes how to make sure the outputs are actually correct. It’s for this reason that unsupervised learning is generally left by the wayside, and why we’ll be primarily referring to supervised learning in this article.
A quick example
With that being said, if you want a down-to-earth, real-world example of what machine learning can do, look no further than the four walls of your own home.
Let’s say that for the past three years you’ve loved your white walls. Now, however, you want to trade them in for orange ones. So you start scouting for a new home. Being the clever machine learning engineer you are, you don’t want to rely on the prices provided by some seedy estate agent.
So what do you do?
Well, you could get a machine learning algorithm to offer a pretty darn good estimation.
To get as close to an estimate as possible, you’d need to feed it all sorts of information. This data could be related to the prices of other houses in the area, the number of transport links, and nearby shops and parks.
When it comes to machine learning, there’s no such thing as too much data.
Your machine learning algorithm would then dutifully munch through the data and spit out a pretty exact valuation of the house you’re interested in.
At least, that’s the theory.
Remember when we said there’s no such thing as too much data? That was a lie. There most certainly is. And it’s a pretty common problem in machine learning.
Let’s return to our example, only this time, it’s not just you working with your algorithm. Now you’ve got a whole 30-person team to help you. (Congratulations on the promotion!)
Up until this point, you and your team have been feeding your algorithm historical house prices between the years 2000 and 2019. However, it just so happens that the figures for 2020 have suddenly been released. That’s great, right? More data is good data.
Well, some other members of the team may feel like 2020 was such a bizarre year, all things considered, that it should be left out.
So you call a meeting. Everyone attends. Everyone agrees: the data should be excluded. Except the office intern has already added the data.
That’s where data version control comes in. With this open-source tool, you can revert back to your previous collection of data.
It doesn’t sound groundbreaking, we know – it’s not like we’re talking about virtual phone numbers here. But consider this: in a team of thirty people, all working on the exact same data, how many times a day do you think people will make changes to it?
What if one person changes an already outdated data file; that’s a whole day’s worth of work down the drain. Now multiply that 30 times over. Without DVC, your team would descend into anarchy.
It’s why teams, for a long time now, have developed their own in-house version of DVC. And whilst it’s certainly worked, it’s meant new team members must quickly adjust to whatever in-house tool is being used.
That is until the good folks over at DVC.org came along and standardized it into the easily accessible tool we know today. The tool is easily implemented as well. DVC can be downloaded as a Python library, so all you need to do is install it using a package manager like Pip or Conda.
DVC, once installed, establishes a singular point of truth that the whole team can work from. They can download select pieces of data from the wider whole, work on them, then upload them to the same place once their changes have been approved.
That all sounds well and good, but how does DVC then improve machine learning outcomes?
DVC puts you and your team on the same page
To really answer that question, we first have to dive a little further into how DVC was designed.
DVC is built around a pre-existing system called Git.
What is Git?
Well, it’s DVC but for code.
To elaborate a bit, let’s return to our 30-person team example again. Except this time, you have 30 unsupervised software engineers working with code. Like, a lot of code, and generating more every second of every day.
As you would rightly expect, code is integral to machine learning. Think of it as the skeleton of the algorithm.
As your team begins knitting the bones of your algorithm together, you suddenly realize that you’re faced with the prospect of unbridled chaos. How, you wonder, do you make sure that every programmer is working from the agreed-upon singular point of truth?
In this case, Git is your friend. It saves every version of the code your team is working on and makes it crystal clear at which point in time it was worked upon.
The reason why Git and DVC work so well together is that they then take this to the next level.
Together, they make sure that the code and the data are up to date and aligned. So rather than combining an outdated set of data with your most recent batch of code, DVC and Git make sure that everything is aligned and up to date.
With that being said, you might (rightly) ask why machine learning engineers don’t simply store both the code and data on Git. And it’s a fair point.
Unfortunately, Git doesn’t allow users to store files larger than 2 GB in size. If you ask anyone involved in data management, data security, or data science, they’ll happily tell you that data files usually exceed those limitations.
By implementing DVC, you and your team can be confident that the code and data you’ve based your ML algorithm on represent your most accurate and recent work. Ultimately, your ML algorithm is less likely to fail. DVC prevents any mistakes made along the way from getting into the end product, meaning your algorithm is that much more accurate for it.
We all have questions. Questions like: why are we here? Is there a god? What is a multi-line phone system… You know, the kind of questions that keep you up late at night racking your brain for answers.
But when it comes to machine learning, it can be all too easy to dismiss any hope of finding the kind of answers you’d like. Hopefully, we’ve provided a few that satisfy your craving.
Machine learning is much like any other project. Team members need to be working from the same page and singing from the same hymn sheet. Data version control offers that desired level of order, one that is vital for consistent and reproducible machine learning outcomes.
If machine learning is going to solve the kinds of problems we need it to, DVC and other tools like it will become a necessity to establish machine learning as a service that delivers real value.
So the next time you find yourself trapped in conversation, and the boffins in the room turn to you, take a moment to advocate for DVC. We promise you won’t regret it.
If you’re looking for more insights into the world of machine learning, check out our other insights at codemotion.com.