Key Concepts of a Data Warehouse

This article covers key concepts of the data warehouse, comparing its capabilities with that of relational databases, data marts, and data lakes — all common solutions to the challenges of consuming large, varied types of data.

A data warehouse (DW) pulls together data from different sources into a single target for business intelligence (BI) analysis and support for strategic decisions. It is sometimes referred to as an enterprise data warehouse (EDW).

Introduction to data warehouses

One of the primary challenges to running a business, or any other data-centric operation, is making good decisions based on data that may be scattered far and wide. A follow-on challenge is capturing institutional knowledge into discoverable data. The general field that covers these challenges is called business intelligence.

Data warehouses were created to address this critical business need. DWs contain the aggregation of multiple sources (possibly scattered across the internet) into a single central data store. This aggregated consistent data lends itself to mining for truths in a straightforward way that the scattered data obscures. This mining may take the form of human data analysis, machine data mining (for significant trends), artificial intelligence (AI), or machine learning (ML).

In addition to the benefits provided by a single source of truth, data warehouses have been engineered to store quantities of data that challenge traditional database implementations. The current record holder for largest data warehouse is SAP, holding 12.1 petabytes.

Databases, data marts, data lakes

Colloquially, the terms databases, data warehouses, data lakes, and data marts are used interchangeably. While all deal with aggregated data, important differences exist between these concepts. After all, a bus, car, and motorcycle are all examples of a vehicle but have very different features and uses.

Databases

Database software has traditionally been a monolithic solution for acquiring and storing data from a well-defined source application, like online order systems. Databases are designed for speedy transaction storage and processing, coupled with a query language optimized for simple reporting, like the status of an order or the transaction history of a user, as opposed to analytics which deals with aggregations, groupings, and segmentation of users and their activities. Databases generally deal with updating real-time data.

Data warehouses, in contrast, are designed to store data in varying formats from many different sources. They capture current and historical data to use for predictive analytics, as inputs for machine learning, and for other model generation and analysis. Databases are often used to store the state of things such as order status or subscription type. Data warehouses are typically used to store both event stream data (e.g. Product View, Add to Cart, Purchase) that lead up to a change in state in the database and record-type data. Record data is mostly the result of ETL processes where existing rows get updated when the source record changes instead of new rows being added. An example of this would be syncing records from a CRM system into your data warehouse.

Data marts

Whereas data warehouses contain the aggregation of multiple data sources and are ready for varied uses across a business enterprise, data marts contain data specific to a particular business department or function. This smaller subset of data enables faster discovery of focused insights.

Data lakes

Data warehouses gather data from multiple sources into a central access point. That data is then transformed into structures specified in predefined schemas designed for data analytics. (More about these transformations is covered in “The architecture of a modern data warehouse,” below.) Schemas used in data warehouses will be covered in a separate article.

Data lakes on the other hand store data without the use of predefined schemas in a cost-effective and scalable way. As such, data lakes are often used as a staging ground to prepare data for a data warehouse, or make data available for consumption by big data platforms such as Apache Hadoop or Google BigQuery and Google Cloud Dataproc.

Benefits of cloud-based data warehouses

In recent times the option to store data warehouses in the cloud has made these endeavors more affordable, more reliable, and easier to deploy than purchasing and maintaining on-premises hardware. Cloud-based data warehouses are:

  • Speedy: With the chores of purchasing, installing, and maintaining hardware taken on by cloud computing companies, the process of instantiating a cloud-based data warehouse is reduced to a short online session. Specifying initial configurations for the warehouse and sandboxes and watching them become available is short work.
  • Inexpensive: Compared with the cost of acquiring, operating, and expanding on-premises hardware, data warehouse-as-a-service (DWaaS) schemes charge only for resources as needed.
  • Flexible: Some cloud data warehouses offer pricing options based primarily on compute resources used (Google BigQuery), while others base their pricing on a traditional $/hour approach (AWS Redshift), or a mix of licensing and cloud costs (Snowflake).
  • Elastic: Upgrading capacity to handle heavier workloads is trivial compared to the arduous and expensive process of improving local hardware. Budgeting separately for storage and computation makes it easier to demonstrate the value of ongoing operations. The use of virtualized computing appliances in a highly distributed environment makes it much easier to spin up instances of known configurations, boosting reliability. Many cloud warehouses can add additional storage or compute capacity without any downtime whatsoever.
  • Up to date: Cloud providers handle the acquisition, deployment, and maintenance of software. This makes the timely integration of new technologies and software packages uncomplicated.
  • Secure: Regulatory and public-relations needs for data security are more easily satisfied with competent management of challenging technologies like cryptographic key management services (KMSs).
  • Robust: Disaster mitigation and recovery are provided for by off-premises implementation of distributed duplication of data and periodic backups.

The architecture of a modern data warehouse

Generally, the process of populating a data warehouse consists of:

  • extracting data from other sources
  • cleansing and preparing the data into a well-defined format
  • loading the data into the warehouse
  • maintaining and validating the data integrity on an ongoing basis

Before we delve into the details of these processes we should explore the architecture of a modern data warehouse.

Architecture, in this context, refers to the functional structure of the primary computing roles in running a data warehouse. These roles can reside on one piece of hardware and/or software or divided among several.

Data warehouses generally have a three-tiered structure, sometimes referred to as data, semantics, and analysis layers.

Collecting and cleaning data—the bottom tier

The bottom tier, the entry point for data into the warehouse, retrieves data from external sources, cleanses and sanity-checks the data, and transforms the data into a well-defined format.

This process is commonly either one of Extract, Transform, and Load (ETL) or Extract, Load, and Transform (ELT). It is important to note here that the functionality of this tier is often handled by tools external to the data warehouse itself such as ETL providers or Customer Data Platforms (CDPs).

Searching and modifying data—the middle tier

Once the data has been converted into the desired format, it’s ready to be searched and modified. These actions, grouped into analysis and transaction-processing, depend upon the nature of the data and the ultimate needs of the data owner or data consumer / end user.

OLAP (online analytical processing) performs multidimensional analysis of the aggregated data for typical business intelligence needs, data mining, and predictive scenarios. The results support financial analysis, budgeting, and forecasting.

OLTP (online transaction processing) performs large-scale real-time execution of transaction-oriented processing such as is needed by e-commerce platforms, including online booking and reservation systems, credit card payments and ATM transactions, and other  record-keeping.

Digging through the aggregated data—the top tier

The top tier implements interactions with the aggregated data: either canned business intelligence reporting or live ad-hoc queries. “Inside” the warehouse, user can use a workbench or sandbox area for interactive data exploration and the development of new analysis models. “Outside” the warehouse, users can create dashboards and reports, monitor key performance indicators (KPIs) in specialized tools that connect to the warehouse. These access tools allow users to interact with the data as well as aggregate it further or drill down deeper.

Types of data warehouses

Knowing the use cases, benefits, and architecture of data warehouses will help you decide which type(s) of warehouse to use.

On-premises data warehouse software

Purchasing data warehouse software for deployment on on-premises infrastructure is a good choice if you need to comply with strict security protocols or mandated data privacy regulatory standards. This enhanced control over data is typically seen in financial institutions, government entities, and for-profit research organizations.

Cloud-based data warehouse

The advantage of this managed data warehouse as a service (DWaaS) is that off-premises entities deal with the complexities of acquiring, deploying, and managing computing resources. You don’t need to make an upfront investment in fixed infrastructure, you have the flexibility of changing to handle different capacity needs, and the service includes canned backup and disaster recovery systems. This lower barrier to entry makes DWaaS a very attractive option, especially if you want to reduce your on-premises data center footprint.

Data warehouse appliance

Bundling together hardware and software into a plug-and-play business solution, data warehouse appliances are an intermediate option between on-premises and cloud-based data warehouses. The decisions about computing hardware, operating system software, and data warehouse software and configuration are made by a third-party provider. They dictate the upfront costs, types of administrative control and data access tools, and scalability options.

Data warehouse best practices

Whether you’re starting from scratch or enhancing an existing data warehouse, we’ve highlighted some industry best practices below.

  • Define what success looks like. Understand what types of reporting and analyses would satisfy your needs. A clear expectation of the end result will help to inform the following steps; not having a good vision of success will result in wasted time, effort, and cost.
  • Determine the information required for success. Identify the data sources needed to support the required analyses, whether they live within your organization or are provided externally by authoritative entities.
  • Determine what (if any) extra tooling you may need to extract this data from its sources into your warehouse.
  • Document the structure, quality, location, and availability of data. Knowing the freshness and quality of what’s available sets realistic expectations for timely insights. Knowing the location and availability of data informs the operations needed to retrieve it. Identifying providers of data and establishing contracts for continued data availability are key to making promises to downstream customers.
  • Establish data gaps and transformation rules. Bridge the gap between what data you have and the data you’ll need to achieve your goals.

Conclusions and next steps

Data warehouses are increasingly becoming essential to business intelligence, allowing key personnel access to fresh, relevant data derived from a wide variety of sources. Combining home-grown data from internal business systems and authoritative information from external sources, data warehouses are the foundation for informed,data-driven decision making.

Data warehouses offer fast, complex data mining and analytics that complement existing business systems. They support the needs of staff, management, and executives for flexible models, reports, KPIs, dashboards, and situational alerts.

If you’re looking to get started with a data warehouse, check out these articles:

The Data Maturity Guide

Learn how to build on your existing tools and take the next step on your journey.

Build a data pipeline in less than 5 minutes

Create an account

See RudderStack in action

Get a personalized demo

Collaborate with our community of data engineers

Join Slack Community