One of the most significant parts of any data-driven application is data quality assessment. Before you start using your data, you must understand how good – or bad – it is.
This is why data analysis and data cleaning activities are performed – to obtain a dataset that is ready to be used within the training stage of building a machine learning model, for instance.
However, limiting data quality assessment to the initial steps of data analysis can be a hazardous choice. Indeed, depending on the data source you rely on, data trends may vary over time; consequently, models trained with historical data may perform worse.
To address this kind of issue, the ING WB Advanced Analytics (WBAA) team developed popmon, an open source Python module that allows data analysts and scientists, as well as machine learning engineers and developers, to carry out population shift monitoring.
In this article we present the key features of popmon, providing an overview of how it works and presenting some use cases in which such a tool can produce significant benefits for data-driven systems and applications.
- Population shift analysis and data quality assessment: why?
- popmon: population shift analysis in Python
- Benefits of popmon
- Use cases
Population shift analysis and data quality assessment: why?
Before going into details about popmon’s features, it is important to explain why population shift monitoring and data quality assessment are fundamental steps in any data-driven application.
If an individual or organization needs to develop a machine learning model, there are some standard steps that need to be followed. Firstly, the business problem that requires a solution needs to be defined, which in turn allows identification of the kind of data required. Then, the dataset needs to be analyzed and cleaned.
The next step is to choose or define a model that suits the dataset well. Finally, training the dataset can begin. At the end of this process, the model will be complete and can be used to make the prediction(s) needed to solve the original business problem.
What we have just described is often referred to as the machine learning development cycle. However, the reason it is known as a ‘cycle’ seems at odds with the sequence of steps described above.
What is lacking is some sort of feedback, gathered after running the model, which allows optimization, and enables the whole solution to adapt to unexpected situations.
Part of such an optimization can occur in the design process, by testing the model and carefully tuning hyperparameters until a more stable performance and the most generalized model is reached, but even this would be insufficient.
What is missing here is a process that continuously monitors new data. In any production environment, such an input data stream is far from stable. A plethora of possible issues and anomalies may affect the input, and consequently, the model’s performance. Let’s consider a simple example that reflects this issue.
Imagine a business process that needs to analyze the receipts from bank transfers. We can assume that such documents are all formatted in the same way, with a standardized layout. For some unexpected reason, the issuing institution might decide to change their receipt format.
Such a choice would imply a change within the input data stream, with a concrete risk that the previously trained model might perform worse than before. The whole business process would suddenly become ineffective, with all the consequences that follow on from this.
Such issues occur more frequently than you might imagine, especially in production environments, causing what is known as population shift. Monitoring a model’s performance is the only way to allow such an issue to be identified, and it’s easy to understand how important a tool that helps to identify this problem could be.
This is where popmon steps in to help.
popmon: population shift analysis in Python
As already mentioned, popmon is an open-source Python module that allows users to check the stability of a dataset. In particular, it allows detection of the occurrence of population shift, by analyzing data frames where data are labelled with times.
Going into a little more detail, popmon creates histograms of features binned in time-slices, and compares the stability of the profiles and distributions of those histograms, over time and with respect to a reference, by using a set of statistical tests.
popmon works with numerical, ordinal, and/or categorical features and histograms that can be multi-dimensional. The module also supports high dimensionality.
popmon can use this histogram-based representation to generate reports, or to automatically flag and alert users to changes observed over time, including shifts, peaks, outliers, anomalies and so on, using different business rules for monitoring.
This tool was developed by the ING WBAA team, and its source code is available on GitHub. popmon supports Python 3.6 and later, and can be installed using pip as follows:
Alternatively, you can install it directly from the source:
Once installed, popmon can be imported like any other Python module, and used as needed. The following code snippet demonstrates how to load a CSV file into a Pandas data frame, then use its content to generate a report:
The generated report includes the details of several statistical analyses computed on the dataset, from the histogram-based aggregation to the many statistical tests used.
The statistics provide insights into how the data changes over time, but some are more interesting to look at than others. The report also provides a traffic light visualization that allows users a very quick overview of the kinds of variables that might be affected most by instability.
The following image (taken from popmon’s GitHub repository) shows an example of visualization within a popmon report:
Benefits of popmon
There are several features that makes popmon a great tool for data monitoring. The first, and most important, thing to say is that this tool is probably one of a kind; to the best of our knowledge there is no single alternative tool that implements all the statistical tests used by popmon, along with its histogram-based internal representation.
As Max Baak, data science lead at ING WBAA, former CERN researcher, and one of the most active popmon contributors on GitHub explains, before popmon there were “no good open-source packages that allowed us to monitor input data and predictions for such shifts in a straightforward, automated way.”
For this reason, the features offered by popmon can be extremely useful to anyone who needs to monitor data over time – and everyone working on ML should do this!
Apart from this, there are several other benefits of popmon to consider.
The first of these benefits is the previously-mentioned histogram-based internal data representation. This representation is memory-efficient, as histograms are typically much smaller than the original data. How such a choice allows several statistical tests to be run on data has been explained above.
Moreover, this design decision also addresses a big issue when dealing with data: privacy. Using histograms is a form of data aggregation, which removes identifiable information from entries, therefore easing the process of data storage. This is a major plus when dealing with (and storing) sensitive information.
Support for Pandas and Apache Spark
It is also worth mentioning that Pandas is not the only module supported by popmon, even though, in the previous snippet, you will see that CSV data are loaded into a Pandas data frame.
Although Pandas is probably the most widely supported technology for dealing with data frames, popmon is not limited to this and also supports Apache Spark data frames.
Apart from imports and data loading, all other popmon API remain almost identical, even if the user opts for Apache Spark. It’s worth noticing that via these backends, any prominent data source format is supported, from CSV, JSON and Excel to Hive, Parquet, HDF5 and Apache Avro.
Dynamic Data Quality Boundaries
A final important feature of popmon is its data quality boundaries. popmon uses a traffic light system to indicate where large deviations from a certain reference have occurred.
The module allows the user to define a set of static thresholds that can automatically identify whether the degree of deviation is high (red light) or low (yellow light), or if there are no significant deviations (green light).
Setting static thresholds automatically is hard, and doing so manually for many datasets is often not feasible.
This concept can be also extended by defining dynamic thresholds that can be computed automatically, based on mean and variance, reducing the need for additional parameters (known as ‘pull’ or Z-score). This idea is depicted in the following image:
How popmon works and the benefits it offers users should be clear. We now turn to two typical use cases, including some of the real-life scenarios that pushed the ING WBAA team into developing popmon and using it in their everyday activities, including within production environments.
Using popmon to monitor machine learning performance
The first class of users that might benefit from using popmon includes data scientists and machine learning engineers. For both groups, the use of popmon is primarily focused on the monitoring stage of the machine learning development lifecycle.
When a model is developed, it is essential to continuously evaluate how predictions on the same input data stream vary over time. Given the same model, slight variations might be due to prediction noise; however, more significant trend shifts or unexpected peaks might actually represent a warning signal.
In these cases, it is a good practice to investigate what’s going on, and what exactly is affecting the predictions.
popmon is used by ING to alert developers and data scientists in case of population shift detection. Every time new data comes in, popmon can be used to evaluate changes in data quality, and eventually, to trigger real-time alerts.
When population shifts are detected, the same models that have already been trained with historical data can be enhanced with fresh training data, or completely retrained to adapt their behaviour to new data patterns.
Using popmon for data exploration
Another use of popmon is particularly interesting for data analysts – specifically in relation to the initial stage of dataset analysis, even before data preparation and data cleaning activities take place.
Thanks to the histogram-based internal representation used, as well as the many statistical tests that it implements, popmon is a great tool for data exploration, allowing in-depth investigation of possible data patterns and trends, as well as outliers and seasonality.
From this point of view, popmon might beneficially be incorporated into data ingestion pipelines to monitor incoming data in order to prevent drops in performance due to poor data quality.
Tomas Šostak, data scientist with ING WBAA and another principal contributor to popmon, summarized the situation, saying: “popmon could be very useful to anyone doing serious monitoring of ML models running in production, as well as simple exploratory data analysis.”
An additional point that should not be underestimated: popmon is easy to use and to integrate into other data-driven applications. With a few lines of code, popmon can be configured to meet user needs – not just to generate reports as shown in the snippet above, but also to create alerts for data scientists to speed up retraining activities, rendering the whole system more reactive to new data.
If you are interested in learning more about popmon, you can find more information on GitHub, including code examples, tutorials and videos.