DATA ENGINEERING

How to DeploySlowly Changing Dimensions?

February 27, 2025

By Express Analytics

In order to track the changes in a dimension, the SCD Slowly Changing Dimensions Type 2 technique is used. In this type when values for current dimension record change, the current record is marked as closed and the new record gets inserted.

How to Deploy Slowly Changing Dimensions?

In the previous blog, I had explained the different types of Slowly Changing Dimensions (SCD) that can be used in dimensional modeling for tracking changes. In this blog, I shall be focusing on the Type 2 form of SCD (Slowly Changing Dimensions) and the issues associated with its implementation.

Type 2 Slowly Changing Dimensions (SCD) Implementation Recap

Before that, using an example, I would like to do a quick recap on Type 2 Slowly Changing Dimensions (SCD) implementation. While building star schema's in a data warehouse, the dimensions tables are joined with the fact tables. To track the changes in a dimension, the Type 2 technique of Slowly Changing Dimensions (SCD) is used. In this type when values for the current dimension record change, the current record is marked as "closed", and the new record gets inserted.

So, for example, if Mike moves from Los Angeles to New York, then there will be 2 records associated with Mike in the updated dimension table. The latest record will be marked as open/active while the old record will be marked as inactive/closed.

Original Record

Customer ID	Master Customer ID	Name	City	IsActive
1111	1111	Mike	Los Angeles	1
2222	2222	Jane	San Francisco	1

Updated Record

Customer ID	Master Customer ID	Name	City	IsActive
1111	3333	Mike	Los Angeles	0
2222	2222	Jane	San Francisco	1
3333	3333	Mike	New York	1

Implementation Challenges

There are a few challenges one is likely to face while fetching the latest information related to a customer. Let's take the example of Mike and Jane which I had talked of in the first part while explaining a Type 2 SCD (slowly changing dimensions type 2), further.

Let's suppose both Mike and Jane have one order containing one product when they were living in Los Angeles and San Francisco, respectively. After moving to New York, Mike updated his city information on the e-commerce website but did not purchase any product.

Sales Data

Customer ID	Order ID	Product ID	Order Amount
1111	ABCD	123	599
2222	EFGH	1234	499
1111	IJKL	12	201

In the above scenario, if all we care about is Mike and Jane's Master Customer ID and associated sale, then we can simply join the customer table with the sales table using customer id and roll the information by Master Customer ID. The query should look like below:

Select cu.Master Customer ID,
sum(sl.Order Amount) as Total Amount
From Customer cu
Join Sales sl on cu.Customer ID = sl.Customer ID
Group by cu.Master Customer ID

The result would look like below:

Master Customer ID	Order Amount
3333	800
2222	400

However, if we need some more recent information about the customer then the above method will not work. There are a few ways by which you can achieve this.

Achieve business success with our data analytics consulting services

Solution 1: Joining the Customer Table Twice

In this method, there is another join needed with the customer table in the above query so that the customers' latest information is selected and displayed in the result. Modified query is copied below:

Select cu.Master Customer ID,
Mast.Name,
Mast.City,
sum(Order Amount) as Total Amount
From Customer cu
Join Sales sl on cu.Customer ID = sl.Customer ID
Join Customer Mast on cu.Master Customer ID = Mast.Customer ID and Mast.IsActive = 1
Group by cu.Master Customer ID,
Mast.Name,
Mast.City

Although the above method is straightforward, it increases the cost of the query, especially if the number of columns in the customer table is more. This results in slow query performance.

The above query still gives faster results for columnar databases such as Redshift, Actian, MonetDB. But row-based databases such as Oracle or Teradata really struggle in producing the output quickly. To mitigate this, 2 views can be created:

One view will have only 2 columns containing Master Customer ID and related customer IDs
Another view will have a snapshot of the latest customer information frequently needed

Two customer joins mentioned in the above query will be replaced by these 2 views.

Solution 2: Using Analytic Functions

We can get similar results by using Analytic functions such as First_value. Modified query is copied below:

Select cu.Master Customer ID,
cu.Name,
FIRST_VALUE(cu.City) OVER (PARTITION BY Master Customer ID ORDER BY Customer ID DESC ROWS UNBOUNDED PRECEDING) AS LatestCity,
sum(NVL(Order Amount,0)) as Total Amount
From Customer cu
Left Join Sales sl on cu.Customer ID = sl.Customer ID
Group by cu.Master Customer ID,
cu.Name,
LatestCity
;

The result is copied below:

Master Customer ID	Name	City	Order Amount
3333	Mike	New York	800
2222	Jane	San Francisco	400

The above query should produce similar results.

Drawbacks of Analytic Functions Approach

There are some drawbacks to the above method:

Left Join Requirement: We have to create a left join between Customer & Sales. If we create an inner join, then the results will be incorrect.
Additional Rows: Apart from that, the left join will create an additional row for the records which do not have associated sales information.

Example: In the case of Mike, a record with null order information will be created for Customer ID 3333 since that ID does not have any corresponding sales information. This can be tackled by adding the NVL function to replace the null order amount with 0.
Null Record Handling: If we do not want to aggregate then we can write a nested query where we can add a not null condition in where clause to remove the null record associated with blank order id.

Best Practices for Type 2 SCD Implementation

Performance Optimization: Use views for frequently accessed customer information
Query Optimization: Consider database type (columnar vs row-based) when writing queries
Data Integrity: Always use proper joins to maintain referential integrity
Monitoring: Track query performance and optimize as needed
Documentation: Maintain clear documentation of SCD implementation strategy

Conclusion

It is clear that although SCDs type 2 (slowly changing dimensions type 2) are used to track historical changes one needs to be careful while writing queries & fetching the correct information. The choice between different approaches depends on:

Database type (columnar vs row-based)
Query performance requirements
Data volume and complexity
Business requirements for historical tracking

Proper implementation of Type 2 SCD requires careful consideration of these factors to ensure both data accuracy and query performance.

Share this article

Ready to Transform Your Data Strategy?

Get expert guidance on data cleaning, analytics, and business intelligence solutions tailored to your needs.

Related Blogs

How AI is Changing Business Intelligence: Expert Insights

The ever-illustrious pace of the contemporary business scenario embraces immense volumes of data that come from sources and engagements of organizations. Increasingly, such organizations are seeking to understand these complexities and convert them into actionable insights to transform BI with an improved capability through the incorporation of artificial intelligence (AI).

Oct 11, 2024Read More

7 Easy Ways to Use AI in Content Marketing Right Now

Creating relevant and engaging content by keeping marketing goals and the audience's needs in mind is not an easy job. Artificial intelligence steps in this case as it uses natural language processing to boost content marketing campaigns in smart ways.

Oct 20, 2024Read More