Sunday, May 29, 2016

Why use a Date Dimension Table in a Data Warehouse

In the Data Mart, or the Data Warehouse world, there is a date dimension table in all schemas if you are using the Kimball Dimensional Modeling method. In the beginning of Dimensional Modeling, it was called a Time dimension. Since then, the Time dimension has actually been separate from the date values. After listing all these terms: time, date, dimensional modeling, data mart, data warehouse, etc., you might be more confused than ever about what we are talking about.

So, let’s start with some definitions. A Data Mart is actually a small part of a Data Warehouse. The Data Warehouse is considered the entire set of tables in a database. The Data Mart is defined as a business process that is represented by the data contained in the process. The structure of a Data Mart is to enable a simplistic way to do querying or reporting. The data population stage has a component that translates the data complexity into an actual column for a description rather than a code.

Dimensional Modeling is the process to take the data from various sources and produce this Data Mart structure with usually one Fact table and multiple Dimension tables. The Dimension tables are related to the Fact table by a surrogate keys. The Fact contains the measures of the data the business process wants to consume. There can be multiple Data Marts in a Data Warehouse, so do not get hung up by the single Fact table in a Data Mart. Eventually, you will see the Dimension tables related to many Fact tables in the overall schema. These dimension are termed Conformed Dimensions.

The Date dimension is one of these dimension tables related to the Fact. Here is a simple Data Diagram for a Data Mart of Internet Sales information for the Adventure Works DW database which can be obtained for free from CodePlex or other online sources.

image_thumb[1]

As you can see in this diagram, the Date table (DimDate) is related to the fact table (FactInternetSales) by 3 different dates in the fact: Order, Ship and Due Date. This is called a Role Playing dimension in the Dimensional Modeling world. The relationship is created by the surrogate keys columns (integer data type) rather than the date data type. The DateKey column in the DimDate table is related to the OrderDateKey column in the FactInternetSales table as well as ShipDateKey and DueDateKey.

The DimDate table has columns that normally would have to be extracted from a Date column with a function. These include CalendarYear, CalendarQuarter or DayNumberOfMonth. The data in this table has a row for every day possible in your Data Mart.

 

DateKey

FullDateAlternateKey

DayNumberOfMonth

CalendarQuarter

CalendarYear

20050101

2005-01-01

1

1

2005

20050102

2005-01-02

2

1

2005

20050103

2005-01-03

3

1

2005

20050104

2005-01-04

4

1

2005

20050105

2005-01-05

5

1

2005

20050106

2005-01-06

6

1

2005

20050107

2005-01-07

7

1

2005

20050108

2005-01-08

8

1

2005

20050109

2005-01-09

9

1

2005

20050110

2005-01-10

10

1

2005

Think about the reason for every day to be in this table. If there are no Internet Sales on 12/28/2010, then we would have to do some fancy LEFT JOIN with a sub-query to create this row in an analysis of the data. But, with a Date Dimension table, we LEFT JOIN from the DimDate table to the Internet Sales fact table and we can extract $0 with a IsNull(SalesAmount) from this query.

 

SELECT d.CalendarYear, d.FullDateAlternateKey, Sum([SalesAmount])

  FROM dbo.DimDate d

    LEFT JOIN [dbo].[FactInternetSales] fs ON fs.OrderDateKey = d.DateKey

  WHERE d.CalendarYear = 2010

    AND d.MonthNumberOfYear = 12

  GROUP BY d.CalendarYear, d.FullDateAlternateKey

  ORDER BY d.CalendarYear, d.FullDateAlternateKey

image_thumb[4]

The query users have to write against a Data Mart are much simpler than against a transaction database. There are less joins because of the one to many relationships between the fact dimension table(s). The dimension tables are confusing to someone who has been normalizing databases as a career. The dimension is a flattened or de-normalized table. This creates cases of duplicate data, but the simplistic query overrides the duplicate data in a dimensional model.

With this table, we can stop using functions on date columns like YEAR (), MONTH (), etc. So, a query from an OLTP might look like this:

 

SELECT DATEPART(Year, SOH.OrderDate), SUM(DET.LineTotal) AS Sales

  FROM Sales.SalesOrderHeader SOH

    INNER JOIN Sales.SalesOrderDetail DET ON SOH.SalesOrderID = DET.SalesOrderID

  GROUP BY DATEPART(Year, SOH.OrderDate)

Whereas the dimensional model query would look like this:

 

SELECT d.CalendarYear, Sum(SalesAmount)

  FROM dbo.FactInternetSales fi

    INNER JOIN dbo.DimDate d ON d.DateKey = fi.OrderDateKey

  GROUP BY d.CalendarYear

The end user might have a better time understanding the structure of the Dimensional Model than the transactional system. Especially, if data is obtained from different databases maybe on different servers. Is it no fun trying to explain the LinkedServer and how to use it? Consolidating the needed reporting data into a Data Mart, then Data Marts together into a Data Warehouse makes life much easier for report writers.

The date dimension can also contain columns for Weekend versus Weekday, Holiday and month markers like 2014-10 or by quarter like 2014-Q1. All these can be computed once in the dimension table and used at will by query writers. They now do not have to know how to use T-SQL functions or concatenate substrings of “CASTed” date columns.

Then, when the DimDate is related to various Fact tables and processed into an OLAP cube, the measures and aggregations are displayable side by side through the DimDate dimension which is now considered a Conformed Dimension. The slicing and dicing of data has just been made a whole lot easier.