Enabling the Data Lakehouse

This article by Codemotion and Deloitte shares insights about the characteristics and benefits of Data Lakehouses – a combination of Data Lakes and Data Warehouses

Table Of Contents

An introduction to Data Lakes
Data Lake v. Data lakehouse. What’s the difference?
How to Build Enhanced Data Pipelines
- Apache Hudi
- Alternative solutions
Best Practices for building Data Lakehouses

An introduction to Data Lakes

[note: although “data” is technically a plural noun, in this article, as it is widely the standard in the field, it is used as a collective, singular noun]

Data Lakes are popular because of their ability to store large volumes of data cheaply and easily. A data lake is a large storage repository for data of all types, where the data is stored in its natural form, without any pre-processing or transformation. The main advantages of data lakes are that they enable organizations to store and access a much wider range of data types than traditional data warehouses, and can be used for data discovery and analytics.

However, data lakes also have some disadvantages. One common issue is data quality. Because data is often aggregated from multiple sources and stored in its raw form, data quality can be poor. This can lead to inaccurate analysis and decision-making. Real-time operations is another common issue. Because data lakes are designed for storing and analyzing data over long periods of time, they are not well-suited for real-time operations. This can cause problems, slowing down business intelligence processes where data-driven decisions must be sped up as much as possible. A third common issue is performance: data lakes can be slow and cumbersome to use, which can lead to frustration and decreased productivity. The final common issue involves costs and lock-in. Data lakes can be expensive to set up and maintain, and the data can be difficult to export or share with other systems, which can lead to lock-in and decreased flexibility.

Data Lake v. Data lakehouse. What’s the difference?

To overcome these limitations, many companies are turning to the Data Lakehouse model. The Data Lakehouse is a more structured and actively managed environment for data lakes, with features that make it easier to use, and get value from, the data. The Lakehouse model is an extension of the Data Lake concept, and addresses some of the limitations of traditional Data Lakes.

Data Lakehouses are faster, more scalable, and have several built-in features.

Data Lakehouses are a specific type of data lake that have been designed for real-time analysis and operations. Data Lakehouses are typically faster and more scalable than traditional data lakes, and have built-in features that support real-time ingestion and analysis, such as support for streaming data and time-series data.

Lakehouses are built on a foundation of low-cost big data storage, which enables companies to create effective Data Pipelines with state-of-the-art performance. Lakehouses also include features that are essential for managing data at scale, such as:

ACID transactions for reliable data processing
A global namespace for managing data across multiple data stores
A data catalog to help find and understand data
Data quality and governance features to ensure that data is cleansed and standardized before use

ACID transactions

To understand why ACID transactions are necessary in Data Lakehouses, it’s important to understand what ACID transactions are: a set of properties that guarantee that transactions are Atomic, Consistent, Isolated, and Durable. This means that when a transaction is executed, it is completed as a single unit and the data is left in a consistent state. Any inconsistency that may occur during the transaction is isolated from other transactions. The data is also durable, meaning that it is preserved even in the event of a system failure. In particular:

Atoms (or individual pieces of data) are not changed until the transaction is complete, and the change is seen by all interested parties.
The data in a transaction is always consistent, i.e., it meets all the business rules that define it.
Transactions are completely isolated from each other, so that one transaction can’t interfere with another.
The results of a transaction are always durable, even if the power goes out or the system crashes.

So, how can Lakehouses help companies with data pipelines to support business decisions?

How can Lakehouses help with data pipelines:

– By providing a foundation of low-cost big data storage, Lakehouses make it possible to build data pipelines that are both high-performance and low-cost.

– The global namespace feature of Lakehouses helps data management across multiple data stores, making it easy to keep data in sync.

– The data catalog feature of Lakehouses provides a single source of truth for understanding data, making it easy to find and use the data.

– The data quality and governance features of Lakehouses help ensure that data is cleansed and standardized before use, so that users can be sure it meets their business requirements.

How to Build Enhanced Data Pipelines

Enhanced Data Pipelines offer many benefits.

Data lakes provide a single repository for all data, which is essential for data-driven organizations. Ingestion pipelines are a key part of data lake infrastructure, and must be designed for scale, throughput, and reliability. In particular, this article is interested in “enhanced data pipelines”, a term used to describe a data pipeline that has been enhanced to include features such as real-time ingestion.

The purpose of an enhanced data pipeline is to improve the performance and efficiency of the data pipeline. In particular, an enhanced data pipeline can help to improve the following:

1. Performance: An enhanced data pipeline offers improved performance by reducing the time it takes to extract, cleanse, and transform the data.

2. Efficiency: An enhanced data pipeline improves efficiency by reducing the amount of storage required to store the data.

3. Scalability: An enhanced data pipeline improves scalability by allowing a pipeline to handle more data.

4. Flexibility: An enhanced data pipeline improves flexibility by allowing a pipeline to handle a variety of data formats and ingestion approaches.

How can companies build these ingestion pipelines?

Apache Hudi

One solution is to use Apache Hudi, an open-source framework developed by Uber in 2016 that helps with managing large datasets on distributed file systems. The framework also provides native support for Atomicity, Consistency, Isolation, and Durability (ACID) transactions on your Data Lake. Designed for high throughput and reliability, Apache Hudi can handle large volumes of data. It can be used to ingest data from a variety of sources, including Apache Kafka, Amazon Kinesis, and Amazon S3. Hudi is based on the Apache Beam platform, and is compatible with a variety of streaming engines, including Apache Spark, Apache Flink, and Google Cloud Dataflow.

Under the hood, Hudi leverages the widely used Spark framework and supports 2 types of tables: “Copy On Write” and “Merge On Read“.

Copy on Write

Data is stored in columnar file format (Parquet)
Each Write action creates a new version of files
Most suitable for Read-heavy batch workloads as the latest version of the dataset is always available

Merge on Read

Data is stored as a combination of columnar (Parquet) and row-based (Avro) storage files
Row-based delta files are compacted and merge on a regular basis to build new versions of the target columnar files
This storage type is better suited for Write-heavy streaming workloads

One of the best features of Hudi is the different query modes available when reading data in tables. The “Last Snapshot”, “Incremental” or “Point-in-time” approaches are all possible.

Alternative solutions

There are a number of alternative options for building similar Real-Time data pipelines. One popular option is using Google Dataflow SDK. Dataflow is a data processing platform that allows easy processing of data in parallel, and can be used to process data from a variety of sources, including Apache Kafka, HDFS, and MongoDB.

Another option is Delta Lake, which is another OpenSource project providing features and performance similar to those available in Hudi.

Best Practices for building Data Lakehouses

There are many best practices when developing Data Lakehouses. Here are a few of the most important ones:

1. Develop a data governance plan. Essential for any data lakehouse, this plan should define who has access to which data, who is responsible for maintaining the data, and how the data will be cleansed and standardized.

2. Create a data catalog. A key part of any data lakehouse, the catalog helps people find and understand the data stored in the data lake. The catalog should include information about the data, such as the source, format, and schema.

3. Choose the best Data Platform. A good Data Platform is essential for managing an effective Data Lakehouse. A lot of options exist, mostly cloud-native or hybrid, but it’s also possible to build a Data Platform on an OnPremises data center. The choice of solution must take into account specific concerns about Availability, Costs, Security and Interoperability.

4. Cleanse and standardize the data. The quality of the data is one of the most important aspects of a Data Lakehouse. Data should be cleansed and standardized to ensure that it is accurate and trustworthy.

5. Use big data analytics tools. Tools are essential for analyzing the data in a data lakehouse. These should include features for data exploration, visualization, and machine learning.

Reactive Programming for Distributed Systems: All You Need to Know

Why You Should Use Typescript for Your Next Project

Discover

Magazine

Talent

Companies

For Business

About

Follow Us