Data Warehouses versus Data Lakes
The ever-increasing amount of data needed to evaluate past performance and existing processes and to model future scenarios presents business intelligence challenges that may seem daunting. Several data storage methodologies tame this wild landscape, including data warehouses and data lakes. While these terms are sometimes used interchangeably, significant differences and use cases separate the two data systems.
What is a data warehouse?
Data warehouses contain structured data that’s been processed and sanitized.. Data warehouses’ data sources are usually coordinated by historical time series, preparing sets for modeling and inference. Data warehouses are designed to be highly responsive to the demands of analysts, enabling flexible real-time iteration of data insights. For more details, read our articles Key Concepts of a Data Warehouse and What are the Benefits of a Data Warehouse?
What is a data lake?
In contrast to the well-ordered data warehouse, a data lake usually contains a vast pool of unstructured, raw data, although it can also be used to store the same structured data one would find in a warehouse. Whereas the filtering, cleaning, and transformation processes used to populate a data warehouse is informed by the needs of the intended audience and the purpose(s) to which they’ll put the data, lakes contain data aggregated without a specific intention; the query scenarios for lake data are undefined. A data lake is an archive of data, safely kept to satisfy future needs.
An emerging architecture — the data lakehouse — combines the archival flexibility of a data lake with the optimized, query-ready structure of a warehouse.
Key differences between data warehouses and data lakes
The following table summarizes the key differences between the data warehouse and data lake architecture methodologies. Details about each difference follow.
Data Warehouse | Data Lake | |
Intended purpose | Known before insertion | Not yet known |
Audience | Business intelligence | Data science |
Data structure | Transformed for use cases | Raw, native format |
Access and update cost | Accessed primarily via tailored queries per use case; updated after extensive transformation process | Easy to access; frequently updated |
Access model | Schema-on-writing | Schema-on-reading |
Storage and computing | Tightly-coupled | Loosely-coupled |
Intended purpose
The data stored in a data warehouse has been deliberately gathered from external data sources with a particular purpose in mind. Upon retrieval, the source data is processed, sanitized, and otherwise readied for storing in a manner that’s optimal for the planned query needs. This processing is computationally expensive in both time and infrastructure (electricity, computing resources, interim storage needs).
In a data lake, data is aggregated without specific purposes, use cases, or even audiences in mind. It is purely for archival needs, to preserve every bit of information collected over a span of time, ready for the future demands of business analysis.
Audience
Business intelligence efforts require the warehouse’s aggregated, cleaned, well-structured data. This aggregated, consistent data lends itself to mining for truths in a straightforward way that the scattered data obscures. The data mining process may take the form of human data analysis, algorithmic machine pattern matching, or more intelligent AI models.
Data scientists, in contrast, are the primary consumers of the unstructured raw data stored in the lake. Having all the primary source data at hand enables the investigative process that determines what relationships and supporting data are available to the business intelligence community. Additionally, the purely unstructured nature of a data lake is unsuitable for most machine-automated pattern seeking.
Data structure
Data warehouses contain data that’s already been tidied into a planned format, ready for extraction for specific purposes. This provides maximum speed of consumption (at the cost of some flexibility). Data warehouses also typically prepare data with a shared timestamp schema that lends itself to machine learning training and time-bracketed data lookup.
Data lakes aggregate information from multiple sources in their native formats, ready for whatever purpose lies in store. This provides maximum flexibility (at some possible future computational cost). This aggregated data can be loaded into the lake faster (and accessed more quickly) because the pre-loading processing is side-stepped.
Access and update cost
Information from data warehouses is accessed primarily via specific, tailored queries that have been constructed to satisfy specific use cases such as customer 360 modeling, data visualization, or scorecard reporting. Updates to the warehouse occur only after the computationally-extensive transformation process is completed for the new data. The choice of using a data warehouse balances the immediacy of getting updated data against the need to format new data before storing.
In contrast, updated data inserted into a data lake may be accessed more quickly, because the sanitizing process doesn’t happen (saving the costs of computation). Having access to the raw data may provide more ad hoc opportunities for exploring the data, executing proof-of-concepts, and self-service, free-form prediction modeling.
Access model
Data warehouses require a computationally-intensive process of cleansing and transforming information prior to storing. Technically, this scheme of organizing data is called schema on write, since you can think of data warehouses as highly scalable databases with fixed database and table schemas
In the planning process for deploying the data warehouse, a series of exercises must be undertaken to determine the ultimate schema (organization and relationships) for storing the data, one that provides the most usable data to the most audiences. Changes in requirements demand updates to the schema and process to insert data, typically a time- and labor-intensive process.
Data lakes, in contrast, store data in their native formats. Information is extracted and transformed on the fly for each need. This is known as schema on read, since data is stored in raw files, each of which could contain different data schemas. It is therefore impossible to know the full schema for all data before the query. As with the overhead benefits of updates and storage, schema on read offloads the work of designing a schema to later users of the data, while sidestepping the potential debt of a data warehouse schema migration.
Storage and computing
Like in the early days of the web, the relationship between content & structure (HTML) and formatting (CSS), which were deliberately separated for ease of making stylistic changes, there’s a relationship between storing data and the associated computing needs when it comes to data lakes and warehouses.
Data warehouses, with their pre-storage computation needs, have a “tightly-coupled storage and computation” relationship. Increasing storage capacity demands a proportional increase in processing needs.
Incidentally, data warehouses are generally designed to store larger quantities of data than data lakes. Currently, the largest data warehouse is SAP, holding over 12 petabytes of information (check out a PDF report about it here).
Separating the data storage from the transformation computation — technically called “decoupled storage and computation” — optimizes the business’ costs by matching storage requirements against the frequency of data access. Archiving business data in less expensive tiers — the data lake — saves money and provides more immediate availability of the information. As mentioned before, this immediacy enables staff to more quickly begin experiments in data exploration, analysis, and modeling.
Data warehouse or data lake: which is better?
Deciding which data storage solution to implement depends upon a delicate balance between needs, value extracted from data analysis, and the costs of infrastructure, storage, and computation.
Organizations that require great agility and work on smaller quantities of information may choose the way of the data lake. For example, if you can load somewhat structured event or ETL data into a data lake, the query capabilities of a data lake will often suffice for many small to mid-sized organization’s reporting & analysis needs.
However, those in industries where far greater quantities of data exist, and where the data must be massaged in order to be most useful to the greatest audience, may choose the data warehouse.
Some may choose both to provide maximal flexibility. For example, you can utilize a cost effective data lake for the raw storage of event stream and ETL’ed data but combine it with a data warehouse that is used to selectively load data from the datalake to support more advanced business needs like data modeling or connection to Business Intelligence tools.
In conclusion, a data lake is a cost-effective way to store vast amounts of raw data at the cost of more complicated query or data modeling down the road. A data warehouse is a larger investment but offers higher performance and extra features. Many organizations tend to use a combination of both to balance cost with functionality.
Build a data pipeline in less than 5 minutes
Create an accountSee RudderStack in action
Get a personalized demoCollaborate with our community of data engineers
Join Slack Community