In the previous blog, I had explained the different types of Slowly Changing Dimensions (SCD) that can be used in dimensional modeling for tracking changes. In this blog, I shall be focusing on the Type 2 form of SCD (Slowly Changing Dimensions) and the issues associated with its implementation.
Type 2 Slowly Changing Dimensions (SCD) Implementation Recap
Before that, using an example, I would like to do a quick recap on Type 2 Slowly Changing Dimensions (SCD) implementation. While building star schema's in a data warehouse, the dimensions tables are joined with the fact tables. To track the changes in a dimension, the Type 2 technique of Slowly Changing Dimensions (SCD) is used. In this type when values for the current dimension record change, the current record is marked as "closed", and the new record gets inserted.
So, for example, if Mike moves from Los Angeles to New York, then there will be 2 records associated with Mike in the updated dimension table. The latest record will be marked as open/active while the old record will be marked as inactive/closed.
Original Record
Customer ID | Master Customer ID | Name | City | IsActive |
---|---|---|---|---|
1111 | 1111 | Mike | Los Angeles | 1 |
2222 | 2222 | Jane | San Francisco | 1 |
Updated Record
Customer ID | Master Customer ID | Name | City | IsActive |
---|---|---|---|---|
1111 | 3333 | Mike | Los Angeles | 0 |
2222 | 2222 | Jane | San Francisco | 1 |
3333 | 3333 | Mike | New York | 1 |
Implementation Challenges
There are a few challenges one is likely to face while fetching the latest information related to a customer. Let's take the example of Mike and Jane which I had talked of in the first part while explaining a Type 2 SCD (slowly changing dimensions type 2), further.
Let's suppose both Mike and Jane have one order containing one product when they were living in Los Angeles and San Francisco, respectively. After moving to New York, Mike updated his city information on the e-commerce website but did not purchase any product.
Sales Data
Customer ID | Order ID | Product ID | Order Amount |
---|---|---|---|
1111 | ABCD | 123 | 599 |
2222 | EFGH | 1234 | 499 |
1111 | IJKL | 12 | 201 |
In the above scenario, if all we care about is Mike and Jane's Master Customer ID and associated sale, then we can simply join the customer table with the sales table using customer id and roll the information by Master Customer ID. The query should look like below:
Select cu.Master Customer ID,
sum(sl.Order Amount) as Total Amount
From Customer cu
Join Sales sl on cu.Customer ID = sl.Customer ID
Group by cu.Master Customer ID
The result would look like below:
Master Customer ID | Order Amount |
---|---|
3333 | 800 |
2222 | 400 |
However, if we need some more recent information about the customer then the above method will not work. There are a few ways by which you can achieve this.
Achieve business success with our data analytics consulting servicesSolution 1: Joining the Customer Table Twice
In this method, there is another join needed with the customer table in the above query so that the customers' latest information is selected and displayed in the result. Modified query is copied below:
Select cu.Master Customer ID,
Mast.Name,
Mast.City,
sum(Order Amount) as Total Amount
From Customer cu
Join Sales sl on cu.Customer ID = sl.Customer ID
Join Customer Mast on cu.Master Customer ID = Mast.Customer ID and Mast.IsActive = 1
Group by cu.Master Customer ID,
Mast.Name,
Mast.City
Although the above method is straightforward, it increases the cost of the query, especially if the number of columns in the customer table is more. This results in slow query performance.
The above query still gives faster results for columnar databases such as Redshift, Actian, MonetDB. But row-based databases such as Oracle or Teradata really struggle in producing the output quickly. To mitigate this, 2 views can be created:
- One view will have only 2 columns containing Master Customer ID and related customer IDs
- Another view will have a snapshot of the latest customer information frequently needed
Two customer joins mentioned in the above query will be replaced by these 2 views.
Solution 2: Using Analytic Functions
We can get similar results by using Analytic functions such as First_value. Modified query is copied below:
Select cu.Master Customer ID,
cu.Name,
FIRST_VALUE(cu.City) OVER (PARTITION BY Master Customer ID ORDER BY Customer ID DESC ROWS UNBOUNDED PRECEDING) AS LatestCity,
sum(NVL(Order Amount,0)) as Total Amount
From Customer cu
Left Join Sales sl on cu.Customer ID = sl.Customer ID
Group by cu.Master Customer ID,
cu.Name,
LatestCity
;
The result is copied below:
Master Customer ID | Name | City | Order Amount |
---|---|---|---|
3333 | Mike | New York | 800 |
2222 | Jane | San Francisco | 400 |
The above query should produce similar results.
Drawbacks of Analytic Functions Approach
There are some drawbacks to the above method:
-
Left Join Requirement: We have to create a left join between Customer & Sales. If we create an inner join, then the results will be incorrect.
-
Additional Rows: Apart from that, the left join will create an additional row for the records which do not have associated sales information.
Example: In the case of Mike, a record with null order information will be created for Customer ID 3333 since that ID does not have any corresponding sales information. This can be tackled by adding the NVL function to replace the null order amount with 0.
-
Null Record Handling: If we do not want to aggregate then we can write a nested query where we can add a not null condition in where clause to remove the null record associated with blank order id.
Best Practices for Type 2 SCD Implementation
- Performance Optimization: Use views for frequently accessed customer information
- Query Optimization: Consider database type (columnar vs row-based) when writing queries
- Data Integrity: Always use proper joins to maintain referential integrity
- Monitoring: Track query performance and optimize as needed
- Documentation: Maintain clear documentation of SCD implementation strategy
Conclusion
It is clear that although SCDs type 2 (slowly changing dimensions type 2) are used to track historical changes one needs to be careful while writing queries & fetching the correct information. The choice between different approaches depends on:
- Database type (columnar vs row-based)
- Query performance requirements
- Data volume and complexity
- Business requirements for historical tracking
Proper implementation of Type 2 SCD requires careful consideration of these factors to ensure both data accuracy and query performance.