Data Warehouse

The data warehouse is often referred to as a strategic or business decision-making tool because the stored data offers a historical view which allows trend and patterns to be analysed to predict future needs and events. For example, estimating how many laptops the company will need to have instock\ available for a given calendar month, based on sales over a number of years. The typical view of a data warehouse is data been transferred from the company’s relational database into a non-volatile historical record. There are set characteristics of a data warehouse:

  • Data is subject oriented (sales, HR, customers etc…)
  • Data is integrated (data is transformed into correct data\ matching data)
  • Data is non-volatile (cannot be changed)
  • Data is time variant (not current but has a time indicator (timestamp))
  • Data is a high quality (accurate e.g. spelling – British English vs US English)
  • Data is (maybe) aggregated (Atomic (operational), Grouped (tactical), Aggregated (strategic))
  • Data is (typically) denormalised (increase data redundancy to increase performance)
  • Data is not (typically) current (the warehouse is batched updated daily, weekly, monthly or yearly)

 

Visualising the Data Warehouse

I use this very simplest explain. The university collects operational (atomic) data about the students (e.g. age, gender etc…). To be able to use this operational data in a meaningful way we need to group the data e.g. by male and female. This grouped data gives the university tactical information about the customer base, such as the gender bias for the student population. This will help the university during the open day(s) to provide activities and facilities (toilets) which reflect the gender bias needs. To help with the plan for next years open days the university needs to aggregate (summarise) the grouped (tactical) data which will for strategic information. For example, roughly knowing how many students will attend each open day will allow the university to allocate staffing, resources, catering etc… Typically a data warehouse will store the aggregated data as this is the data which is used in strategic or business decision-making.

 

Figure 1: two-dimensional gender and age (aggregated values)

Figure 2: three-dimensional gender, age, course (aggregated values)

 

 

Enterprise Data Warehouse (EDW)

 

The Data Mart (DM)

 

 

 

Data Warehouses and Schemas

The process of moving data from its original datastore(s) into the data warehouse can be very complex and time-consuming to get right. This process is often referred to as having four stages data extraction (getting the data from it original datastore), data cleaning (make sure the data is what you want), data transformation (convert it to the new format) and data loading (add the data to the data warehouse). 

Star, Snowflake

 

Online Analytic Processing and Data Mining

OLAP (manual processing)

  • Drill down (moving from yearly to daily data)
  • Slice (narrow focus on one dimension e.g. BSc Computing)
  • Pivot or Rotation (interchanging data on a two-dimensional axis)

Data Mining (automated processing)

  • Using software to find new knowledge (Genetic algorithms, Neural networks etc..)
  • See CRISP-DM