• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
Codemotion Magazine

Codemotion Magazine

We code the future. Together

  • Discover
    • Live
    • Tech Communities
    • Hackathons
    • Coding Challenges
    • For Kids
  • Watch
    • Talks
    • Playlists
    • Edu Paths
  • Magazine
    • Backend
    • Frontend
    • AI/ML
    • DevOps
    • Dev Life
    • Soft Skills
    • Infographics
  • Talent
    • Discover Talent
    • Jobs
  • Partners
  • For Companies
Home » AI/ML » Data Science » Working with Date Intervals in Data Warehouses and Data Lakes
Data Science

Working with Date Intervals in Data Warehouses and Data Lakes

Code examples and tips for working with date intervales in data warehouses and data lakes. Get it done right with this complete guide!

September 6, 2022 by Nicola Di Santo

Working as Data Engineer makes you work with dates and time data a lot. Especially in the recent period where companies want to be Data-Driven, the software is Event-driven, your coffee machine is data-driven, and AI and ML require tons of data to work. In this article, I will share my experience working with date intervals in data warehouses and data lakes hoping to make your queries a little bit easier.

Motivation

Very often with modern software architecture like Event Driven Systems or with more legacy data warehouse or data lake modeling techniques we might end up having tables where each record has a validity period (i.e. a start date and an end date defining the interval where the record information is true in the system).

In a data warehouse and data lake context, modeling data using a type of slow-changing dimension (SCD) is quite a popular choice to track your data history. For example, suppose you have clients registry data, and each time a client changes his shipping address then you set the end_date of the previous shipping address and the start_date of the
new record with the timestamp the modification inside the address_tbl

client_idshipping_addressstart_timestampend_timestamp
client_1383 Trusel DriveNorth Andover, MA 018452015-01-14 00:00:002017-05-11 10:00:00
client_191 Gates St. Valrico, FL 335942017-05-11 10:00:002999-01-01 23:59:59

Then you know that a client_1’s current shipping_address is the one satisfying the condition

SELECT * FROM address_tbl WHERE client_id=client_1 and address_tbl.start_date <= current_timestamp() < address_tbl.end_date

In the case of an event-driven system you have the same pieces of information represented as events in the address_evt_tbl:

client_idshipping_addressevent_timestamp
client_1383 Trusel DriveNorth Andover, MA 018452015-01-14 00:00:00
client_12015-01-14 00:00:002017-05-11 10:00:00

Then when you use that data for analytics or modeling in general you end up transforming the event table into a table with validity intervals with a query like:

WITH address_tbl AS( SELECT client_id, shipping_address, event_timestamp AS start_timestamp, COALESCE(LEAD(event_timestamp) OVER (PARTITION BY client_id ORDER BY event_timestamp), '2999-01-01 23:59:59') as end_timestamp FROM address_evt_tbl ) --As before, you know that a client_1's current shipping_address is the one satisfying the condition SELECT * FROM address_tbl WHERE client_id=client_1 and address_tbl.start_date <= current_timestamp() < address_tbl.end_date
Code language: PHP (php)

Now suppose you have another table called fav_payment_method telling your client’s favorite payment method:

client_idstart_timestampend_timestamppayment_method
client_12015-01-01 00:00:002016-01-01 00:00:00at_delivery_time
client_12016-01-01 00:00:002018-01-01 10:00:00paypal
client_12018-01-01 10:00:002019-01-01 23:00:00credit card
client_12019-01-01 23:00:002999-01-01 23:59:59Paypal

And, don’t ask your Boss why, but you have to associate each client’s shipping address with his favorite payment method and how they evolve over time. What is the
best way to perform this operation in SQL or frameworks with SQL-like interface?

Naive solution

The first idea we have is to enumerate all the possibilities. For example, we might write:

SELECT * FROM address_tbl at INNER JOIN fav_payment_method fpm ON at.client_id = fpm.client_id WHERE (at.start_timestamp >= fpm.start_timestamp AND at.start_timestamp < fpm.end_timestamp) OR -- (1) (at.end_timestamp >= fpm.start_timestamp AND at.end_timestamp < fpm.end_timestamp) OR -- (2 ) (fpm.start_timestamp >= at.start_timestamp AND fpm.start_timestamp < at.end_timestamp) OR -- (3) (fpm.end_timestamp >= at.start_timestamp AND fpm.end_timestamp < at.end_timestamp) -- (4)
Code language: HTML, XML (xml)

That can be represented graphically with the following segments (address_tbl in red and fav_payment_method in blue)

data warehouses and data lakes time invervals
data warehouses and data lakes time invervals

We can see that the query has a lot of conditions to match the records taking into account the validity periods and we are not even sure it covers all the cases. Do not mention the case when you inherit the code from a colleague who has quit the company and you have to interpret it…

Second try

But indeed we can be less naive than this. It turns out that the problem is a 1D segment intersection problem. Segments are your ranges of dates and you only need to take the periods that intersect. But how can we write that? And can we do it simply, maybe without enumerating all the possibilities? Well, we can say that two segments intersect if and only if they not(not intersect). Ah-ah, simple right?

data warehouses and data lakes time invervals

In the image we can see there isn’t an intersection when:

at.start_timestamp > fpm.end_timestamp or at.end_timestamp < fpm.start_timestamp
Code language: CSS (css)

This also means that negating this condition we obtain not(not intersect)

at.start_timestamp <= fpm.end_timestamp and at.end_timestamp >= fpm.start_timestamp
Code language: HTML, XML (xml)

And we can finally write a simple, clean and concise query that does exactly what we wanted:

SELECT * FROM address_tbl at INNER JOIN fav_payment_method fpm ON at.client_id = fpm.client_id WHERE at.start_timestamp <= fpm.end_timestamp and at.end_timestamp >= fpm.start_timestamp
Code language: HTML, XML (xml)
fpm.client_idfpm.start_timestampfpm.end_timestampfpm.payment_methodat.client_idat.shipping_addressat.start_timestampat.end_times
client_12015-01-01 00:00:002016-01-01 00:00:00at_delivery_timeclient_1383 Trusel
DriveNorth Andover,
MA 01845
2015-01-14
00:00:00
2017-05-11
10:00:00
client_12016-01-01 00:00:002018-01-01 10:00:00paypalclient_1383 Trusel
DriveNorth Andover,
MA 01845
2015-01-14
00:00:00
2017-05-11
10:00:00
client_12016-01-01 00:00:002018-01-01 10:00:00paypalclient_191 Gates St. Valrico,
FL 33594
2017-05-11
10:00:00
2999-01-01
23:59:59
client_12018-01-01 10:00:002019-01-01 23:00:00credit cardclient_191 Gates St. Valrico,
FL 33594
2017-05-11
10:00:00
2999-01-01
23:59:59
client_12019-01-01 23:00:002999-01-01 23:59:59paypalclient_191 Gates St. Valrico,
FL 33594
2017-05-11
10:00:00
2999-01-01
23:59:59

Display them nicely

So far so good, but then the final table is a mess, there are four timestamps, what should we do next? We should go one step further and give unique, continuous, and
coherent validity intervals. But how?

data warehouses and data lakes time invervals

Looking at the picture we can compute the intersection of the date segments to get the new periods for each record. The intersection of 1D segments is $$
intersection_start_timestamp = max(ad.start_timestamp, fpm.start_timestamp)$$ and $$ intersection_end_timestamp = min(ad.end_timestamp, fpm.end_timestamp)$$

And the final result is:

SELECT fpm.client_id, greatest(at.start_timestamp, fpm.start_timestamp) as start_timestamp, least(at.end_timestamp, fpm.end_timestamp) as end_timestamp, fpm.payment_method, at.shipping_address FROM address_tbl at INNER JOIN fav_payment_method fpm ON at.client_id = fpm.client_id WHERE at.start_timestamp <= fpm.end_timestamp and at.end_timestamp >= fpm.start_timestamp
Code language: JavaScript (javascript)
client_idstart_timestampend_timestamppayment_methodshipping_address
client_12015-01-01 00:00:002016-01-01 00:00:00at_delivery_time383 Trusel DriveNorth Andover, MA 01845
client_12016-01-01 00:00:002017-05-11 10:00:00paypal383 Trusel DriveNorth Andover, MA 01845
client_12017-05-11 10:00:002018-01-01 10:00:00paypal91 Gates St. Valrico, FL 33594
client_12018-01-01 10:00:002019-01-01 23:00:00credit card91 Gates St. Valrico, FL 33594
client_12019-01-01 23:00:002999-01-01 23:59:59paypal91 Gates St. Valrico, FL 33594

Note that, since at the end you obtain a table with only one start_timestamp and one end_timestamp, you can iterate this approach indefinitely and obtain a snapshot
for each version of your data.
Now you won’t suffer that much when working with date intervals in data warehouses and data lakes. Have a good query!

facebooktwitterlinkedinreddit
Share on:facebooktwitterlinkedinreddit

Tagged as:Big Data Careers Data Analysis

Terraform vs. Bicep: The Differences Between the Two Azure Tools That Build Good Infrastructure Code
Previous Post
Video: Implementing Micro-Frontends Using The Decisions Framework
Next Post

Related articles

  • Working with Date Intervals in Data Warehouses and Data Lakes
  • Data Lake vs. Data Warehouse: Which to Use?
  • Enabling the Data Lakehouse
  • Becoming a Data Scientist: Best Practises for Data Warehouse Implementation with Microsoft SQL Server
  • A comprehensive guide to Governed Data Lakes
  • Four Things I’ve Learned After Three Years as a Data Scientist
  • How Data Observability Can Boost IT Performance in the Banking Sector
  • Ruby on Rails in 2022? A Data Processing and Visualization Case Study
  • 6 Mind-Bending Trends in Data Science for 2022
  • Call for Code: A Challenge at Global Level

Primary Sidebar

Learn new skills for 2023 with our Edu Paths!

Codemotion Edu Paths for 2023

Codemotion Talent · Remote Jobs

Game Server Developer

Whatwapp
Full remote · TypeScript · Kubernetes · SQL

Back-end Developer

Insoore
Full remote · C# · .NET · .NET-Core · Kubernetes · Agile/Scrum

Full Stack Developer

OverIT
Full remote · AngularJS · Hibernate · Oracle-Database · PostgreSQL · ReactJS

Data Engineer

ENGINEERING
Full remote · Amazon-Web-Services · Google-Cloud-Platform · Hadoop · Scala · SQL · Apache-Spark

Latest Articles

Will Low-Code Take Over the World in 2023?

Frontend

Pattern recognition, machine learning, AI algorithm

Pattern Recognition 101: How to Configure Your AI Algorithm With Regular Rules, Events, and Conditions

AI/ML

automotive software

Automotive Software Development: Can Agile and ASPICE Coexist?

DevOps

programming languages, 2023

Which Programming Languages Will Rule 2023?

Infographics

Footer

  • Magazine
  • Events
  • Community
  • Learning
  • Kids
  • How to use our platform
  • Contact us
  • Become a Contributor
  • About Codemotion Magazine
  • How to run a meetup
  • Tools for virtual conferences

Follow us

  • Facebook
  • Twitter
  • LinkedIn
  • Instagram
  • YouTube
  • RSS

DOWNLOAD APP

© Copyright Codemotion srl Via Marsala, 29/H, 00185 Roma P.IVA 12392791005 | Privacy policy | Terms and conditions

Follow us

  • Facebook
  • Twitter
  • LinkedIn
  • Instagram
  • RSS

DOWNLOAD APP

CONFERENCE CHECK-IN