This article by Codemotion and Deloitte shares insights about the characteristics and benefits of Data Lakehouses – a combination of Data Lakes and Data Warehouses
An introduction to Data Lakes
[note: although “data” is technically a plural noun, in this article, as it is widely the standard in the field, it is used as a collective, singular noun]
Data Lakes are popular because of their ability to store large volumes of data cheaply and easily. A data lake is a large storage repository for data of all types, where the data is stored in its natural form, without any pre-processing or transformation. The main advantages of data lakes are that they enable organizations to store and access a much wider range of data types than traditional data warehouses, and can be used for data discovery and analytics.
However, data lakes also have some disadvantages. One common issue is data quality. Because data is often aggregated from multiple sources and stored in its raw form, data quality can be poor. This can lead to inaccurate analysis and decision-making. Real-time operations is another common issue. Because data lakes are designed for storing and analyzing data over long periods of time, they are not well-suited for real-time operations. This can cause problems, slowing down business intelligence processes where data-driven decisions must be sped up as much as possible. A third common issue is performance: data lakes can be slow and cumbersome to use, which can lead to frustration and decreased productivity. The final common issue involves costs and lock-in. Data lakes can be expensive to set up and maintain, and the data can be difficult to export or share with other systems, which can lead to lock-in and decreased flexibility.
Data Lake v. Data lakehouse. What’s the difference?
To overcome these limitations, many companies are turning to the Data Lakehouse model. The Data Lakehouse is a more structured and actively managed environment for data lakes, with features that make it easier to use, and get value from, the data. The Lakehouse model is an extension of the Data Lake concept, and addresses some of the limitations of traditional Data Lakes.
Data Lakehouses are a specific type of data lake that have been designed for real-time analysis and operations. Data Lakehouses are typically faster and more scalable than traditional data lakes, and have built-in features that support real-time ingestion and analysis, such as support for streaming data and time-series data.
Lakehouses are built on a foundation of low-cost big data storage, which enables companies to create effective Data Pipelines with state-of-the-art performance. Lakehouses also include features that are essential for managing data at scale, such as:
- ACID transactions for reliable data processing
- A global namespace for managing data across multiple data stores
- A data catalog to help find and understand data
- Data quality and governance features to ensure that data is cleansed and standardized before use
ACID transactions
To understand why ACID transactions are necessary in Data Lakehouses, it’s important to understand what ACID transactions are: a set of properties that guarantee that transactions are Atomic, Consistent, Isolated, and Durable. This means that when a transaction is executed, it is completed as a single unit and the data is left in a consistent state. Any inconsistency that may occur during the transaction is isolated from other transactions. The data is also durable, meaning that it is preserved even in the event of a system failure. In particular:
- Atoms (or individual pieces of data) are not changed until the transaction is complete, and the change is seen by all interested parties.
- The data in a transaction is always consistent, i.e., it meets all the business rules that define it.
- Transactions are completely isolated from each other, so that one transaction can’t interfere with another.
- The results of a transaction are always durable, even if the power goes out or the system crashes.
So, how can Lakehouses help companies with data pipelines to support business decisions?
How can Lakehouses help with data pipelines:
– By providing a foundation of low-cost big data storage, Lakehouses make it possible to build data pipelines that are both high-performance and low-cost.
– The global namespace feature of Lakehouses helps data management across multiple data stores, making it easy to keep data in sync.
– The data catalog feature of Lakehouses provides a single source of truth for understanding data, making it easy to find and use the data.
– The data quality and governance features of Lakehouses help ensure that data is cleansed and standardized before use, so that users can be sure it meets their business requirements.
How to Build Enhanced Data Pipelines
Data lakes provide a single repository for all data, which is essential for data-driven organizations. Ingestion pipelines are a key part of data lake infrastructure, and must be designed for scale, throughput, and reliability. In particular, this article is interested in “enhanced data pipelines”, a term used to describe a data pipeline that has been enhanced to include features such as real-time ingestion.
The purpose of an enhanced data pipeline is to improve the performance and efficiency of the data pipeline. In particular, an enhanced data pipeline can help to improve the following:
1. Performance: An enhanced data pipeline offers improved performance by reducing the time it takes to extract, cleanse, and transform the data.
2. Efficiency: An enhanced data pipeline improves efficiency by reducing the amount of storage required to store the data.
3. Scalability: An enhanced data pipeline improves scalability by allowing a pipeline to handle more data.
4. Flexibility: An enhanced data pipeline improves flexibility by allowing a pipeline to handle a variety of data formats and ingestion approaches.
How can companies build these ingestion pipelines?
Apache Hudi
One solution is to use Apache Hudi, an open-source framework developed by Uber in 2016 that helps with managing large datasets on distributed file systems. The framework also provides native support for Atomicity, Consistency, Isolation, and Durability (ACID) transactions on your Data Lake. Designed for high throughput and reliability, Apache Hudi can handle large volumes of data. It can be used to ingest data from a variety of sources, including Apache Kafka, Amazon Kinesis, and Amazon S3. Hudi is based on the Apache Beam platform, and is compatible with a variety of streaming engines, including Apache Spark, Apache Flink, and Google Cloud Dataflow.
Under the hood, Hudi leverages the widely used Spark framework and supports 2 types of tables: “Copy On Write” and “Merge On Read“.
Copy on Write
- Data is stored in columnar file format (Parquet)
- Each Write action creates a new version of files
- Most suitable for Read-heavy batch workloads as the latest version of the dataset is always available
Merge on Read
- Data is stored as a combination of columnar (Parquet) and row-based (Avro) storage files
- Row-based delta files are compacted and merge on a regular basis to build new versions of the target columnar files
- This storage type is better suited for Write-heavy streaming workloads
One of the best features of Hudi is the different query modes available when reading data in tables. The “Last Snapshot”, “Incremental” or “Point-in-time” approaches are all possible.
Alternative solutions
There are a number of alternative options for building similar Real-Time data pipelines. One popular option is using Google Dataflow SDK. Dataflow is a data processing platform that allows easy processing of data in parallel, and can be used to process data from a variety of sources, including Apache Kafka, HDFS, and MongoDB.
Another option is Delta Lake, which is another OpenSource project providing features and performance similar to those available in Hudi.
Best Practices for building Data Lakehouses
There are many best practices when developing Data Lakehouses. Here are a few of the most important ones:
1. Develop a data governance plan. Essential for any data lakehouse, this plan should define who has access to which data, who is responsible for maintaining the data, and how the data will be cleansed and standardized.
2. Create a data catalog. A key part of any data lakehouse, the catalog helps people find and understand the data stored in the data lake. The catalog should include information about the data, such as the source, format, and schema.
3. Choose the best Data Platform. A good Data Platform is essential for managing an effective Data Lakehouse. A lot of options exist, mostly cloud-native or hybrid, but it’s also possible to build a Data Platform on an OnPremises data center. The choice of solution must take into account specific concerns about Availability, Costs, Security and Interoperability.
4. Cleanse and standardize the data. The quality of the data is one of the most important aspects of a Data Lakehouse. Data should be cleansed and standardized to ensure that it is accurate and trustworthy.
5. Use big data analytics tools. Tools are essential for analyzing the data in a data lakehouse. These should include features for data exploration, visualization, and machine learning.