Building a machine learning model is simple: you collect data, split them into datasets, feed the model with data from an open-source framework, refine the model, retest it, and hey presto!, magic in the making – right?
Well, if that’s your idea of ML magic, you may need to find a different jumping-off point – nearly 90% of AI/ML projects never make it to production.
In 2020, Chief Information Officers and IT teams rushed to build or expand their organizations’ digital capabilities to meet the surge in online demand caused by the pandemic. In 2021, established industry leaders continue to feel an urgent push to transition to digital as consumers demand improved online experiences.
Artificial intelligence (AI) and machine learning (ML) can massively speed up the time required to understand and adapt to customer needs, but only if the data is available to build and test models to start with.
Accelerating the pace of digital innovation and AI/ML demands a lot of data from core business systems and customer-facing applications. While most established companies have large volumes of data, slow data delivery often diminishes overall speed, as a recent survey shows.
A new survey by Pulse suggests that data management issues are a roadblock to AI/ML projects when the right data management and automation tools are not in place. While those initiatives remain in the early stages of development in many respects, more companies are looking to apply AI/ML to optimize operations, increase performance, and provide differentiated customer experiences.
AI/ML is a priority for the majority of IT executives today. Still, the size of the sprawl of application data is a huge challenge, with data residing in a number of different places, including customer-facing applications (48%), ERP systems (19%), and financial applications (19%).
Survey findings show four of the top five barriers to implementing AI/ML initiatives include ensuring data accuracy, data access, protecting personal and sensitive data, and the time it takes to refresh data in models. The most relevant barriers are:
- Data accuracy, selected by 54% of respondents;
- Data access, selected by 44% of respondents.
Asking why most people have only the slightest knowledge of this kind of problem is a reasonable question. There is no awareness at all in many cases, judging from what is often seen at public/online events or within community records.
This can occur because real problems are not always publicly discussed. This seems to be the case with go-to-production issues in AI/ML as a whole.
Production and data in the real world
The wall between staff in operations and those in production still stands today. Most presentations given by experts are based on theoretical solutions and involve university-level tools and theories.
Real-life problems are normally left out of such presentations: this level of description hasn’t yet been brought into the everyday arena. In other words, in a real production environment, these theoretical solutions don’t work.
This is what often happens with software implementations, and there is another problem – data. Many large companies currently have a poor data governance model in relation to the data propagation supply chain.
Data needs to be extracted by their systems every time, so it’s very expensive. Even more problematic, these data are extracted using an outdated method, so they usually require cleansing.
Data cleansing is crucial to many companies. Most insurance companies have collected huge amounts of annual data, the characteristics of which are changing so rapidly that two-year-old data no longer represent today’s business needs.
Programmable data infrastructure: the Delphix case study
Delphix provides a programmable data infrastructure to control, speed up and automate data distribution. The technology used allows this system to impact the business directly, without requiring a complex pilot project to start with.
To fully understand all the steps in Delphix’s value chain, you must first understand the use of each technical term. Data-related dictionaries are not yet standardized, so each company uses the same terms – and many similar phrases – for different purposes.
Sometimes the difference in meaning is small, while in other cases, the same word can mean something completely different. Misunderstanding is always just around the corner.
Delphix helps manage Data Propagation in its immutable form correctly on all levels – within the company, in relation to all stakeholders, and externally to third parties. In a nutshell, Delphix takes care of the data element of a digital transformation project.
When you have to reinvent the wheel
Let’s look at a real case study, focusing on a system that is in production already: a worldwide manufacturing company in need of optimized distribution for their products, with a focus on last-mile distribution in particular.
As we will see, to find a workable solution sometimes you need to reinvent the wheel. This is the case for the company being discussed here.
The goods to be loaded into the company’s trucks are perishable to some degree. The buyer is updated with useful commercial information.
The seller normally has no storage, so an order is placed, and the package has to be delivered in a standard two-day period.
The company has a network of many warehouses. Every day a goods loading plan is required for all warehouses in the network. To make an effective load plan, a lot of accurate information needs to be acquired – and this may not always be the case.
The starting point is the customer order and its related information points: the type of item (SKU), the number of pieces, operational data (addresses, etc.), all taken from the company’s CRM software. Other data concerns the vehicles: which trucks are to be used this time? The in-use fleet changes daily.
From a still picture to animated data representation
The day starts early in the morning: all data should arrive before 6 AM and be ready by 9 AM. Everything grinds to a halt if you don’t have all the daily load plans in place before 9 AM.
So, data is loaded into the management application. This has two basic requirements:
- A predictive ability, accessed through AI
- The capacity to elaborate load plans, to be effective at 9AM
The TMS – Transportation Management System – manages all the parameters in writing the load plan. All the plans for each truck need a list of destinations that has taken route profitability into consideration: distances, routes to all delivery points, and many other parameters lead to a list of items to be loaded and the order and position within the load of each item.
This process uses a classic LIFO procedure (Last In, First Out), where the last item loaded is the first to be delivered, and so on.
If the plan is wrong or less efficient or does not adhere strictly to LIFO, the damage can be considerable. The driver will waste time unloading and reloading items and may be forced to come back later to serve some distributors if some items for delivery early on the route are loaded in the rear of the transport space.
This company needs to improve on an annual loss of US$15 million caused by poor load data. This poor data also impacts the number of hours worked by any driver: GPS, a fatigue monitoring camera pointed at the driver’s face, and other checking devices ensure a driver’s time is strictly controlled.
Unfortunately, there are many data-related problems. And this is where Delphix plays an important role.
Delphix first addressed the data-related issues, starting with the whole pool of collected data, putting them together, feeding the algorithms throughout the data life-cycle.
The company’s algorithms take the data and make predictions. The resulting data are compared with the historical data, including load correctness and any eventual re-routing of a single truck.
Load plans are computed in two phases: the Run phase and the Live phase. All data are elaborated in the Run phase, resulting in the output of all load plans. Traditionally, two problems have arisen:
- Every computing mistake required the software to run again from scratch, resulting in a time-frame shift and a huge financial loss;
- There is no way to detect which part of the data is incorrect, so there is no way to determine responsibility for the inaccurate collection, nor any chance of avoiding the problem going forward.
As you can see from point one, there is a great deal of work to be done in the Run phase (6 AM to 9 AM).
Delphix solved this problem by introducing data consistency points. Every consistency point is an affordable picture of a well-defined stage in the whole run-phase elaboration process.
If errors are found, the computation can be performed from the latest consistency picture taken, saving most of the work previously done. This solution allows for more consistency points to be performed from 6 AM to 9 AM, so the average time wasted has been reduced enormously.
Data automation removes data blocks
Let’s look at the process in deeper detail. Automation is key to removing manual data delivery, refreshment, and the security processes that block innovation. A programmable data infrastructure enables data to be automated and managed via APIs. The characteristics of a programmable data infrastructure include:
- API data access and refresh
- Automated discovery and masking of sensitive data for compliance risk mitigation
- Immutable data time machine for a continuous record of source data changes that delivers near real-time data, plus historical data
- Versioning of source and training data for concept drift analysis
- API-first approach to integrating data operations with AI/ML tools
If data is wrong, only a limited part of the truck can be loaded, so there is a loss. With correct processing of the incoming transactions, as achieved using the Delphix approach, there is a full recording of the data versioning so the incorrect source can easily be identified and the load optimized.
This case study shows how strongly ML uses are linked to very traditional business segments.
A lot of providers produce data in the wrong format (Excel CSVs, for example). This data is impossible to analyze with traditional batch processing systems.
Delphix’s approach allows for the incorporation of all input data. An analysis can be performed by going back in time through each snapshot to identify incorrect data and the relevant responsibility chains.
Poor management results in a roadblock for data
Let’s look at a second example in which data quality looks perfect, but is not. One of the world’s top engineering firms in the oil and gas industry uses programmable data infrastructure (PDI) to deliver AI-driven insights and solutions across its global plant facilities.
The company’s goal is to boost risk management, operational efficiency and support real-time decision-making and execution.
With programmable data infrastructure, this billion-dollar business efficiently sources data spread across disparate systems and locations, spanning the globe from North and South America to Africa, the Middle East, and Asia.
Teams can effectively import training data and deploy machine learning models in the Cloud. PDI also allows the firm to continuously and efficiently deliver fresh data to a virtual database on a near real-time basis, creating a flexible approach to data and giving users access to primary data for AI that has been sourced from their most critical business systems.
Shadow Data Management
Enterprise data needs are increasingly urgent and varied: data for cloud migrations, to enable CI/CD, to train AI models, for analytics, to meet regulatory reporting requirements, and for forensics and production recovery.
However, data is often trapped in departmental and application silos and filled with privacy, compliance, and security risks.
Most companies struggle to manage thousands of repeated data operations through manual, cross-functional processes. Developers, data scientists, and SREs repeatedly request data provisioning, refresh, rollback, integration, masking, and replication.
On average, companies maintain seven downstream copies of a database for every production copy.
The cost, risk, and complexity of manual data operations is an incredible drag on a company’s transformation velocity.
The opportunity cost of data can be the difference between winning and losing market share in today’s competitive economy.
Programmable Data Infrastructure:
Data Privacy, Compliance, and Security
We chose Delphix as a case study because they are industry leaders in programmable data infrastructure.
Delphix provides an API-first data platform that spans the multi-Cloud and supports all apps, from Cloud-native to legacy mainframes. It automates a range of critical, complex data operations, including compliance with privacy regulations.
A simple task that is easy to achieve with Delphix data technology is the work that needs to be done with sensitive data values, finding and masking them for GDPR, CCPA, HIPAA compliance and more.
Data masking has never been more relevant. With breaches continuing to make headlines and the rise of challenging new data privacy regulations, businesses across all industries must manage their data with greater caution.
GDPR comprises privacy by design. Nothing could better describe the current importance of design and architecture solutions to organizations that seek a holistic approach to managing and securing data across the entire enterprise.
It is possible to integrate with DevOps tools such as Jenkins and Ansible Tower for automated data provisioning.
Conclusions
Data-related issues are widely misunderstood and undervalued in today’s models for effective digital transformation. The advantage delivered by AI-based analysis is often lost without a data cleansing strategy.
Basing data analysis on sequential images of data patterns is a great way to align all data to a pattern of continuous improvement and refinement.
If a real Programmable Data Infrastructure that leverages data correctly is needed to deliver the competitive advantage your company deserves, consider making it part of the data management process in your continuing digital transformation journey. The Delphix website is a valuable source of further information.