• Skip to primary navigation
  • Skip to main content
  • Skip to footer

Codemotion Magazine

We code the future. Together

  • Discover
    • Events
    • Community
    • Partners
    • Become a partner
    • Hackathons
  • Magazine
    • Backend
    • Frontend
    • AI/ML
    • DevOps
    • Dev Life
    • Soft Skills
    • Infographics
  • Talent
    • Discover Talent
    • Jobs
    • Manifesto
  • Companies
  • For Business
    • EN
    • IT
    • ES
  • Sign in

Nicola Di SantoSeptember 6, 2022

Working with Date Intervals in Data Warehouses and Data Lakes

Data Science
facebooktwitterlinkedinreddit

Working as Data Engineer makes you work with dates and time data a lot. Especially in the recent period where companies want to be Data-Driven, the software is Event-driven, your coffee machine is data-driven, and AI and ML require tons of data to work. In this article, I will share my experience working with date intervals in data warehouses and data lakes hoping to make your queries a little bit easier.

Motivation

Very often with modern software architecture like Event Driven Systems or with more legacy data warehouse or data lake modeling techniques we might end up having tables where each record has a validity period (i.e. a start date and an end date defining the interval where the record information is true in the system).

Recommended article
March 3, 2025

A Performance comparison of quick sort algorithms 

Noa Shtang

Noa Shtang

Data Science

In a data warehouse and data lake context, modeling data using a type of slow-changing dimension (SCD) is quite a popular choice to track your data history. For example, suppose you have clients registry data, and each time a client changes his shipping address then you set the end_date of the previous shipping address and the start_date of the
new record with the timestamp the modification inside the address_tbl

client_idshipping_addressstart_timestampend_timestamp
client_1383 Trusel DriveNorth Andover, MA 018452015-01-14 00:00:002017-05-11 10:00:00
client_191 Gates St. Valrico, FL 335942017-05-11 10:00:002999-01-01 23:59:59

Then you know that a client_1’s current shipping_address is the one satisfying the condition

SELECT *
FROM address_tbl
WHERE
 client_id=client_1 and
 address_tbl.start_date <= current_timestamp() < address_tbl.end_date

In the case of an event-driven system you have the same pieces of information represented as events in the address_evt_tbl:

client_idshipping_addressevent_timestamp
client_1383 Trusel DriveNorth Andover, MA 018452015-01-14 00:00:00
client_12015-01-14 00:00:002017-05-11 10:00:00

Then when you use that data for analytics or modeling in general you end up transforming the event table into a table with validity intervals with a query like:

WITH address_tbl AS(
 SELECT client_id, 
 shipping_address, 
 event_timestamp AS start_timestamp,
 COALESCE(LEAD(event_timestamp) OVER (PARTITION BY client_id ORDER BY event_timestamp), '2999-01-01 23:59:59') as
end_timestamp
 FROM address_evt_tbl
)
--As before, you know that a client_1's current shipping_address is the one satisfying the condition 
SELECT *
FROM address_tbl
WHERE
 client_id=client_1 and
 address_tbl.start_date <= current_timestamp() < address_tbl.end_date Code language: PHP (php)

Now suppose you have another table called fav_payment_method telling your client’s favorite payment method:

client_idstart_timestampend_timestamppayment_method
client_12015-01-01 00:00:002016-01-01 00:00:00at_delivery_time
client_12016-01-01 00:00:002018-01-01 10:00:00paypal
client_12018-01-01 10:00:002019-01-01 23:00:00credit card
client_12019-01-01 23:00:002999-01-01 23:59:59Paypal

And, don’t ask your Boss why, but you have to associate each client’s shipping address with his favorite payment method and how they evolve over time. What is the
best way to perform this operation in SQL or frameworks with SQL-like interface?

Naive solution

The first idea we have is to enumerate all the possibilities. For example, we might write:

SELECT * 
FROM address_tbl at INNER JOIN fav_payment_method fpm
 ON at.client_id = fpm.client_id
WHERE (at.start_timestamp >= fpm.start_timestamp AND
 at.start_timestamp < fpm.end_timestamp) OR -- (1)
 (at.end_timestamp >= fpm.start_timestamp AND
 at.end_timestamp < fpm.end_timestamp) OR -- (2 )
 (fpm.start_timestamp >= at.start_timestamp AND
 fpm.start_timestamp < at.end_timestamp) OR -- (3)
 (fpm.end_timestamp >= at.start_timestamp AND
 fpm.end_timestamp < at.end_timestamp) -- (4)
Code language: HTML, XML (xml)

That can be represented graphically with the following segments (address_tbl in red and fav_payment_method in blue)

data warehouses and data lakes time invervals
data warehouses and data lakes time invervals

We can see that the query has a lot of conditions to match the records taking into account the validity periods and we are not even sure it covers all the cases. Do not mention the case when you inherit the code from a colleague who has quit the company and you have to interpret it…

Second try

But indeed we can be less naive than this. It turns out that the problem is a 1D segment intersection problem. Segments are your ranges of dates and you only need to take the periods that intersect. But how can we write that? And can we do it simply, maybe without enumerating all the possibilities? Well, we can say that two segments intersect if and only if they not(not intersect). Ah-ah, simple right?

data warehouses and data lakes time invervals

In the image we can see there isn’t an intersection when:

at.start_timestamp > fpm.end_timestamp or at.end_timestamp < fpm.start_timestampCode language: CSS (css)

This also means that negating this condition we obtain not(not intersect)

at.start_timestamp <= fpm.end_timestamp and at.end_timestamp >= fpm.start_timestampCode language: HTML, XML (xml)

And we can finally write a simple, clean and concise query that does exactly what we wanted:

SELECT * 
FROM address_tbl at INNER JOIN fav_payment_method fpm
 ON at.client_id = fpm.client_id
WHERE at.start_timestamp <= fpm.end_timestamp and 
 at.end_timestamp >= fpm.start_timestampCode language: HTML, XML (xml)
fpm.client_idfpm.start_timestampfpm.end_timestampfpm.payment_methodat.client_idat.shipping_addressat.start_timestampat.end_times
client_12015-01-01 00:00:002016-01-01 00:00:00at_delivery_timeclient_1383 Trusel
DriveNorth Andover,
MA 01845
2015-01-14
00:00:00
2017-05-11
10:00:00
client_12016-01-01 00:00:002018-01-01 10:00:00paypalclient_1383 Trusel
DriveNorth Andover,
MA 01845
2015-01-14
00:00:00
2017-05-11
10:00:00
client_12016-01-01 00:00:002018-01-01 10:00:00paypalclient_191 Gates St. Valrico,
FL 33594
2017-05-11
10:00:00
2999-01-01
23:59:59
client_12018-01-01 10:00:002019-01-01 23:00:00credit cardclient_191 Gates St. Valrico,
FL 33594
2017-05-11
10:00:00
2999-01-01
23:59:59
client_12019-01-01 23:00:002999-01-01 23:59:59paypalclient_191 Gates St. Valrico,
FL 33594
2017-05-11
10:00:00
2999-01-01
23:59:59

Display them nicely

So far so good, but then the final table is a mess, there are four timestamps, what should we do next? We should go one step further and give unique, continuous, and
coherent validity intervals. But how?

data warehouses and data lakes time invervals

Looking at the picture we can compute the intersection of the date segments to get the new periods for each record. The intersection of 1D segments is $$
intersection_start_timestamp = max(ad.start_timestamp, fpm.start_timestamp)$$ and $$ intersection_end_timestamp = min(ad.end_timestamp, fpm.end_timestamp)$$

And the final result is:

SELECT fpm.client_id, 
 greatest(at.start_timestamp, fpm.start_timestamp) as start_timestamp,
 least(at.end_timestamp, fpm.end_timestamp) as end_timestamp,
 fpm.payment_method,
 at.shipping_address
FROM address_tbl at INNER JOIN fav_payment_method fpm
 ON at.client_id = fpm.client_id
WHERE at.start_timestamp <= fpm.end_timestamp and 
 at.end_timestamp >= fpm.start_timestamp

Code language: JavaScript (javascript)
client_idstart_timestampend_timestamppayment_methodshipping_address
client_12015-01-01 00:00:002016-01-01 00:00:00at_delivery_time383 Trusel DriveNorth Andover, MA 01845
client_12016-01-01 00:00:002017-05-11 10:00:00paypal383 Trusel DriveNorth Andover, MA 01845
client_12017-05-11 10:00:002018-01-01 10:00:00paypal91 Gates St. Valrico, FL 33594
client_12018-01-01 10:00:002019-01-01 23:00:00credit card91 Gates St. Valrico, FL 33594
client_12019-01-01 23:00:002999-01-01 23:59:59paypal91 Gates St. Valrico, FL 33594

Note that, since at the end you obtain a table with only one start_timestamp and one end_timestamp, you can iterate this approach indefinitely and obtain a snapshot
for each version of your data.
Now you won’t suffer that much when working with date intervals in data warehouses and data lakes. Have a good query!

Related Posts

Smart Mobility Hitachi, 360,

Smarter Mobility: A Data-Driven Approach to Modern Public Transportation

Codemotion
November 5, 2024
Databricks and python. A complete guide for data dominance by Federico Trotta. Data science

Python and Databricks: A Dynamic Duo for Data Dominance

Federico Trotta
August 28, 2023
analisi dati pandas

Data Analysis Made Easy: Mastering Pandas for Insightful Results

Federico Trotta
July 26, 2023
Python vs Julia: which to choose to kickstart your career in data science

Getting Started with Data Science: Python vs Julia

Codemotion
April 12, 2023
Share on:facebooktwitterlinkedinreddit

Tagged as:Big Data Careers Data Analysis

Nicola Di Santo
Terraform vs. Bicep: The Differences Between the Two Azure Tools That Build Good Infrastructure Code
Previous Post
Video: Implementing Micro-Frontends Using The Decisions Framework
Next Post

Footer

Discover

  • Events
  • Community
  • Partners
  • Become a partner
  • Hackathons

Magazine

  • Tech articles

Talent

  • Discover talent
  • Jobs

Companies

  • Discover companies

For Business

  • Codemotion for companies

About

  • About us
  • Become a contributor
  • Work with us
  • Contact us

Follow Us

© Copyright Codemotion srl Via Marsala, 29/H, 00185 Roma P.IVA 12392791005 | Privacy policy | Terms and conditions