Most data scientist job descriptions require that the candidate has at least three years of experience. After graduating, I wondered what the difference was between me, a guy with hardly any hands-on industry experience. Now, I finally reached this small milestone and I want to wrap up the four most important lessons I’ve learned.
I worked for a consultancy company (one year, eight months) and a big corporate (one year, four months). The two approaches were quite different. In the former company, the keyword was delivery; in the latter, we aim for long-term maintainability. However, I noticed common best practices that a good data scientist should follow.
1) Create a baseline model
Understanding business needs is the critical part of each data science project. Once you properly understand it, you have to translate it into a data science problem (i.e. gathering data, picking the right model approach, and, most importantly, selecting the right evaluation metrics).
To check if your “problem translation” is correct, you need to iterate your solution as soon as possible with the business.
This means that, as a data scientist, you should get a minimum viable product (i.e. a baseline model) as soon as possible.
This baseline model could be a simple machine learning model or a ruled-based inference pipeline. Here are a couple of examples:
- You want to predict how much money you will spend on a day. Your model can be the average of the money you spent on that day in the previous weeks.
- You have to implement a spam filter. Pick a simple naive Bayes approach.
- You want to score the correctness of a text: count how many out-of-vocabulary words you find in the text.
2) Keep your model as simple as possible
In academia, data scientists spend years tweaking models to increase by 0.1% the accuracy in a benchmark dataset. In the industry, a good data scientist should focus on the “good enough” concept.
Investing two weeks in developing a complex neural network to increase the accuracy by 2% is not worth it. Moreover, it may backfire: you end up with a model that is challenging to maintain (more memory required, complex debugging) and impossible to explain (and trust me, stakeholders will ask for explainability).
Accuracy/RMSE is not your only goal. You also have to keep in mind code complexity and maintainability costs.
I know; it feels good being the guy with the biggest neural network in the room. However, the guy with the biggest neural network will have the biggest headache when problems pop up!
If you want to increase the accuracy of your model, I would take a step back and improve the data fed into the model.
I highly recommend Andrew NG’s speech about data centre AI regarding this topic.
The bottom line is a robust, simple, and maintainable model with okayish performance is better than a Frankenstein with 2% more accuracy!
3) You are paid to develop software
When I landed in the industry, I was utterly clueless about the complexity of computer science. Instead, I saw coding as an instrument to apply mathematics in real data.
After three years, I learned that you can’t deliver a data science project to production without the right tools and adopting the “good code guidelines”.
In this section, I will go through the tools and the principles that I found most helpful:
Key Tools for a Data Scientist
It does not need an introduction. It allows you to collaborate with other developers by keeping track of the changes in your code. To reduce the code conflicts and simplify the production delivery workflow, you should adopt a branching strategy. I think the most suitable for data science projects is the Git Flow branching strategy. Here you can find an excellent article about branching strategies.
P.s. do not store in Git repo secrets and data (common newbies’ mistake)
2) pipenv+pyenv or conda
Data science is the nightmare of package management.
Have you ever handed over your code to a colleague and found out that the project did not work on their machine anymore?
The first step to solving the “it worked on my laptop” problem is to know which package and which version of each package you are leveraging in your project.
Pyenv+pipenv and conda are tools that allow you to control the packages and versions of the packages per project. You can create different Python environments. When you start a new environment, you do not have any library installed. Then, you can install only the ones you need (with the version you need).
Remember to create a new environment every time you start a new project. In this way, the packages used in the previous projects will not pollute your new environment.
If you want to learn more about pyenv and pipenv, I recommend the pyenv tutorial – pipenv tutorial.
Docker is the key tool to solve the “it works on my laptop” problem. Docker makes the process of building, running, managing and distributing applications simple. First, it creates a virtual copy of the operating system where you can install Python and the packages you need for your project (pipenv/pyenv or conda will help with this step). Then, you just need to pass this image to the machine where you want to execute it, and you are sure it will work! How cool is that!
I recommend this course to get familiar with Docker. I found it complete and easy to follow.
1. Write small functions with only one responsibility
Before landing in the data science industry, I used to write huge Jupiter notebooks that performed almost all the steps, from data reading to model training. After three years, I’ve completely shifted my mentality.
Debugging/rewriting a function with 100 lines of code and no-sense variable names is a nightmare. If you do not believe me, look at these two pieces of code. Which one do you better understand? If the logic of merging the two dataframes has to be changed, which script is the easiest to modify?
Code Example 1
Code Example 2
2. Test your functions
Same idea as above. When you have to change a part of code, you can be confident that your changes are not breaking something only if you have solid test coverage. Good code coverage requires time and effort, but you will not regret it.
4) Monitor your applications
Knowing what happens to your application is the key to solving problems when they pop out in production. Software developers leverage event logs to monitor their applications. They record on a text file (logging file) all the events in the application while it is running. Unfortunately, standard logs are not enough when dealing with data science projects.
The system can not go wrong only because of an edge case input or a strange system status. You should also be able to spot model-related problems such as data shifts.
Check out this article if you want to learn how to log in to Python.
I suggest logging:
- Data frames’ shape before and after every merge. Are you losing some rows due to a merge? If so, maybe there is something new in the data you should start to consider.
- Summary statistic of the data fed into your model. Have your input features’ mean, median, and variance changed?
- Summary statistic of the output produced by your model. Is your model behaving as in your test set when dealing with real data? Or does it tend to predict only one label?
Data science is a multidisciplinary and complicated topic. At the university, you play with nice and clean data, and your job is to build a fancy model. In industry, the modelling part is just a tiny part of a data science application. Understanding the business problem and translating it correctly into a data science one is crucial. A baseline model will ensure that your “translation” is correct. Remember that the output of your job as data scientist is software. It must be as easy as possible to maintain. Thus, keep your model simple, learn good code practices and monitor your applications.