Data Modeling in the Warehouse for Data Engineers
Companies are struggling to manage and report on all of their data. Even asking basic questions like “how many customers do we have in specific regions,” or “which product do our customers from age 20-30 purchase the most” can be challenging. The data warehouse was supposed to solve these challenges.
The concept of the data warehouse has evolved dramatically since it originated in the mid-1980s. It morphed into a distinct discipline to meet the increasing challenges and complexities of the business world. This drove better technology and tighter business practices.
Originally, data warehouses were created to enable companies to maintain an analytical data source they could use to answer questions. This is still an important factor, but today companies require easier access to information on a larger scale by a more diverse set of end-users.
The defined user has massively expanded from specialized developers to just about anyone who can drag and drop in Tableau.
Understanding the end-users of data warehouses is important if you plan to build one. It can be easy with modern tooling to pull data into Snowflake or Bigquery without prioritizing the end-user, but the goal should be to create a core layer of data that is easy to understand by anyone. At the end of the day, data is the product of the data team and needs to be understandable, reliable, and easy to work with just like any other feature or product.
Data Is A Product
Data is just as much a product as it is a utility. Saying data is the new oil still has some weight, but truly data just is now. It’s just expected to work. We don’t want crude oil. We want high-octane gasoline. We want to be able to just stick the gasoline into our car and have it work without any problems. Since people get much more up close and personal with this product, it needs to be usable. This means it should be:
- Easy to understand
- Easy to work with
- Robust
- Trustworthy
- Timely
Standardizing a company’s data practices can dramatically improve an end-users experience with said data. Overall, part of treating data like a product is following best practices that help take data from crude oil to high-octane gasoline.
Best Practices For Data Modeling
When building your data warehouse it’s important to take some best practices into play. However, it’s also important to not be overly dogmatic. Many data warehouse solutions don’t even easily support (or are not optimized for) some of the more standard data modeling techniques. But this doesn’t mean you can haphazardly load data into your data warehouse without any standards or modeling. You don’t need to be dogmatic, but do be consistent. If you’re going to be wrong, be wrong in the same direction. This means you will need to set standards to make it clear for developers what their expectations are.
Basic best practices, like having standardized names, can make a huge difference in an end-users experience of the data.
Standardize names - To ensure analysts can quickly identify what columns mean what, having standard naming conventions is a must. Using consistent naming for data types like “ts”, “date”, “is_’ and so on ensures everyone knows what they are looking at without looking into data documentation. This is like the classic design principle of the signifier that delineates the afferences of the column.
Standardize data structures - Overall, trying to avoid complex data structures like arrays and dictionaries in the upper core layers is beneficial because you reduce the confusion that analysts can run into.
Standardize IDS as much as possible - IDs allow analysts to merge data across multiple systems. Looking back through my career, this one best practice has made a massive impact. When IDs weren't standardized, I was completely unable to join data sets, no matter how creative I got. In comparison, when working at companies with processes to ensure system IDs were traceable, I was able to smoothly join very disparate data sets.
Improve processes with software teams - Less of a best practice and more of a problem you will deal with is how you ensure your data doesn’t change too much. Sure, you can store your data in JSON or less structured data sets in the raw layer. But as a data engineer, the more you can understand what changes in data fields and entities are occurring upstream, the faster you can avoid any failed pipelines.
Data Modeling High Level Concepts
There are many different schools of thought when it comes to developing a data warehouse. Recently Galen B published “Learn from Google’s Data Engineers: Dimensional Data Modeling is Dead”. Now, this got mixed reviews, as there are still many people who strongly support traditional data modeling, but whichever camp you sit in, it is important to realize that there is no fast way to a solid core data model.
Your data engineering team will need to take time to understand how the data is being used, what it represents, and what it looks like. This will ensure that you create data sets that your cross-functional partners will want to use and use efficiently. All of which start with the same stages of data processing.
Here are the key stages in most companies' data modeling patterns:
Raw - This layer either is often stored in S3 buckets or perhaps a raw table used as an initial first layer of data. Teams can then run quick data checks to ensure all the data remains sane. And it can be reprocessed in case of accidental deletions
Staging - Some form of preprocessing of data is generally unavoidable. In turn, data teams rely on staging layers to perform a first pass on their data. To one extent or another, there tends to be duplicate data, highly nested data, and data that is just named inconsistently that gets standardized in the staging layer. Once the data is processed, there will generally be another layer of QA before loading data into the core data layer.
Core - This layer is where you will find the baseline for a company's data. It’s where you could trace back every transaction or occurrence in a business to the most granular level. You can think about this layer as where all the various entities and relationships are stored. It’s the layer that everything else is built off of.
Analytics - The analytical layer is generally wider tables that have been pre-joined to reduce the number of errors and improper logic application that could occur as analysts develop on the core data layer.
Aggregates - On top of the analytical layer there tends to be a combination of metrics, KPIs, and aggregate type data sets that need to be created. These metrics and KPIs are used for reporting such as dashboards that go to directors, the C-suite and operational managers who need to make decisions based on the KPIs changing over time.
Why Invest In Best Practices
Building a solid data storage system, whether a data warehouse or data lakehouse, gives you a solid foundation to build on top of. Whether you’re building a data product or doing research, a well defined core data layer allows everyone in the company to build their data products with full confidence in the accuracy of the data.
In addition, doing little things like standardizing IDs makes it easier to join data across the multiple systems’ data sets. This means end-users such as analysts and data scientists can create analysis off more disparate data sets.
And we’re not even taking into account all the benefits people get when column names have an expected naming convention, such as no longer having to spend time analyzing columns before using them because they aren’t sure what type of data to expect in said column. I know, riveting stuff. While all these best practices take time, they assure that, in the long run, a company can make decisions with a high level of data trust.
Conclusion
Companies invest in centralized data storage systems because they provide easy access to huge volumes of data without the need to manually pull data from every data source, put it into excel, and then munge all that data there. This is why companies invest so much into data warehouses, data lakes, and data lake houses. Building any form of data storage system always requires some level of standardization and data modeling (one way or another).
Following the best practices discussed in this article can help take your team’s data to the next level. It will provide your analysts and data scientists with more usable data because they can clearly see how all the data should be joined or processed without constantly reaching out to engineering. In the end, this helps ensure that everyone, from the data engineer to the C-suite looking at dashboards has a better experience.
Benjamin Rogojan
Seattle Data Guy, Data Science and Data Engineering Consultant