Data quality best practices: Bridging the dev data divide
Data quality is an obsession at RudderStack. We’re building a tool to help companies turn their customer data into competitive advantage, and we recognize the foundational role data quality plays in driving this outcome. We’ve even written about how AI/ML success starts with data quality.
If you work in data, you know this intuitively, but implementing data quality is easier said than done. The crux of the problem is the storied divide between data producers and data consumers. For our intents and purposes, we’ll look specifically at developers as the data producers and data teams as the data consumers.
The data team owns the data warehouse – the source of truth for reporting, analytical, and AI/ML use cases. Business and data people alike have an implicit desire to trust that the data in the warehouse is fresh, accurate, and consistent. But while the data team owns the warehouse (and is held responsible for the outcomes of initiatives built on top), they don’t create the data stored in the warehouse or control the processes that create it. In many cases, they don’t even have visibility into these processes. They’re at the mercy of the data producers. This typical scenario makes it difficult to diagnose and fix data quality issues, and it makes it nearly impossible to address them proactively.
Achieving data quality comes down to bridging this gap with continuous alignment, collaboration, and transparency while implementing enforcement measures as early as possible in the data lifecycle. Below, we’ll examine why data quality issues are so pervasive and how they manifest. Then, we’ll cover the principles you can use to overcome bad data and its consequences.
Fix it in post
With dev teams and data teams operating in isolation from one another, you have no choice but to fix issues in post. You’re stuck writing ETL and ELT pipelines that do everything from reformatting data for analytics to fixing NULL or incorrect values. It’s messy, reactive work that introduces fragility into your data infrastructure. Anytime the producers implement a change upstream, your pipelines will break, triggering recursive patching work for the data team.
Oftentimes these broken pipelines don’t manifest until a downstream stakeholder’s report, analysis, or ML model breaks. When this happens, it’s the data team that has to perform a fire drill to figure out where the data broke down and implement a fix.
Data quality problems will wreak havoc on any data org, but they become increasingly consequential as companies invest in more sophisticated applications of data, especially AI/ML use cases. At this stage, even small breaks can create drastic consequences, so data quality is imperative.
The dev / data divide
Just like the data team, development teams are under pressure to work quickly and efficiently to accomplish their goals. It’s not like development teams purposely make life difficult for their data team counterparts. They’re just doing their jobs, and their incentives are, by nature, different from yours.
In the development world, raw data is produced for operational use cases in operational systems. These use cases typically have a focus on small fast queries of normalized or semi-structured data. They’re driven by constantly evolving business requirements that can necessitate immediate, not gradual, changes. This can result in a column that’s no longer needed and dropped, a column that suddenly has a name change, or a field value that changes as a calculation is updated.
These sudden changes can make it feel like developers have a devil-may-care attitude, taking for granted that you’ll just deal with whatever changes they make. However, it’s important to keep in mind that they’re under pressure and operating with different incentives. One of our core values at RudderStack is to assume good intent, and it has a great application here. The first step to better data quality is recognizing that data and dev are on the same team and deciding to address the issue together.
Bridging the divide
Most companies attempt to solve their data quality problems by fixating on a need for better communication or collaboration. It’s generally true that better communication and collaboration will help, but generalizing the problem is neither helpful nor practical. Actioning on data quality requires tackling specific issues and starting with the root cause: lack of visibility.
In most situations, development teams have no visibility into how the data they create is used downstream. It’s not realistic for them to communicate every potential change to every potential stakeholder just in case the stakeholder could be affected. Because dev teams aren’t aware of downstream use, and they have no direct incentive to do the legwork required to understand it, they don’t make an effort to communicate. They keep doing their jobs and rely on the data team to reactively handle any issues related to source changes.
Getting away from this reactive model requires continuous alignment between data producers and consumers, and visibility is a prerequisite for alignment. If you want to improve data quality, it’s helpful to consider a few fundamental principles, starting with alignment.
- Alignment – Alignment is the foundation for all of your data quality practices and includes explicit agreement on expectations and use cases.
- Collaboration – Data quality requires collaboration between development and data teams, but it’s important to remember that these teams operate with different incentives, so building a low friction process that automates as much ‘collaboration’ as possible is preferred, and it’s possible if you’ve done your alignment work.
- Early enforcement – The further downstream bad data gets, the worse the consequences and the more inefficient the fix. Data producers should own enforcement at the source.
- Transparency – To ensure continual alignment it’s important that transparency is established and maintained in an accessible manner for all stakeholders. Transparency also encourages ongoing collaboration.
To illustrate these principles in a concrete manner, next, we’ll look at one of the most effective tools for managing data quality: the data contract.
Data contracts
Data contracts are taken from the world of software engineering and modeled after API documentation. A data contract is an agreement between data producers and data consumers to generate high-quality, trusted, well-modeled data. Data contracts typically take the form of a code document (usually a format like json or YAML) that defines the details of the data available for data consumers. They typically include:
- Schema – defines what the data looks like including names, data types, format, and range of acceptable values.
- Semantics – defines the business understanding of the data including business descriptions, relationships to other data, business rules that the data must follow (e.g. purchase price > 0), and any other business-related information.
- SLA – defines details of delivery of the data, specifically Data update schedule
- Data governance – defines data permissions such as who can see the data and specifies any sensitive (PII) data and how it’s handled.
While it can be tempting to send a template to your software developers and treat the data contract as a checklist item, this approach undermines the data contract's effectiveness. Remember, data quality requires alignment, collaboration, early enforcement, and transparency. Data contracts are a tool to help you achieve each of these results. Use them successfully, and you can break away from the reactive model where you’re painfully fixing issues in post. To ensure your data contracts are effective, they should be:
- An abstraction over source data – data contracts are a way for data consumers to standardize source data, but data producers need to be able to iterate and change as the business needs dictate. So, practical data contracts create an abstraction on top of the source data by modeling business entities (nouns) and events (verbs). For example, a data contract can be created for the entity (customer) instead of for the customer table. The entity has a definition and data attributes mapped from one or several tables. With this abstraction, data producers don’t have to worry about changing the names of columns or which table a column lives in. When they make changes, they can map the new name or column to the same attribute in the data contract without disrupting anything for the consumer, so they don’t need to inform them of the change. This method also makes it easier for data consumers to use data properly instead of assuming or guessing which field represents the data they need.
- Enforced at the data producer level – enforcing a contract downstream from the data producer is too late. Data must be controlled before it gets into the warehouse. Early enforcement also incentivizes data producers to pay attention to downstream requirements and gives them explicit ownership over their data and data contracts.
- Defined by data producers and consumers – while data producers should be responsible for enforcement, data consumers must be a part of the contract definition. Only the data consumers know what is necessary for the specific downstream use cases. Giving data consumers a role in the initial contract definition and ongoing contract updates fosters alignment and ongoing collaboration. It generates a healthy push and pull as data consumers can express their needs and data producers can check to make sure consumers are using the data correctly.
- Public, living documents – since data contracts are written in code they can (and should) be handled like all other pieces of code and placed into version control with a change management process. This supports transparency, enabling anyone in the business to read the contract and understand the evolving nature of data requirements. Data contracts should be kept up to date as business needs change, AI/ML models are added or removed, or other use cases change.
Data contract tools
We’ve purposely focused on principles in this piece. When it comes to tooling for data contracts, there are a number of options. When picking a solution for your business, focus on the tools that are already in your stack or that easily integrate with it. You should also write your data contracts to be tool agnostic so that as the tooling market grows and matures you can maintain flexibility to use the tools that best fit your needs. You can find more details on the technical implementation of data contracts in this piece from Chad Sanderson.
As a final note, it’s important to remember that data quality is an ongoing process. You don’t need to try to do too much too soon. Start with your most business-critical pipelines and use cases. When you create and implement data contracts for them, you’ll increase alignment and buy-in on all sides. Plus you’ll take away some valuable learning and get feedback on where to focus next.
Drive data quality at the source
Data quality is the foundation of every data initiative, but it’s impossible to achieve while data producers and consumers operate in isolation. As long as this is the case, you’ll be in reactive mode fixing problems in post with brittle solutions.
You can bridge the divide, and go from reactive to proactive, through alignment, collaboration, early enforcement, and transparency. Data contracts are an excellent tool to help you get there. When you start driving quality at the source, you can trust that the data in your warehouse is fresh, accurate, and consistent. You’ll be able to deliver on use cases for every team and fuel AI/ML with powerful input.
To find out how RudderStack can fit into your data quality workflow, explore our data quality toolkit and reach out to our team to request a demo.