Feeling stuck with Segment? Say 👋 to RudderStack.

SVG
Log in

Feeling stuck with Segment? Say 👋 to RudderStack.

SVG
Log in

Blogs

Reverse ETL is Just Another Data Pipeline

Written by
Soumyadeb Mitra

Soumyadeb Mitra

Founder and CEO of RudderStack
Blog Banner

Astasia from RedPoint Ventures wrote a great post on new technologies supporting “reverse ETL” functionality in the customer data stack.

We’re excited to be innovating in the area of reverse ETL tech (via our Reverse ETL feature), and our product and engineering teams discuss these topics and industry trends often, so we thought it would be helpful to provide a bit more technical depth on a few of Astasia’s points.

1) Data Movement Differs Between Event Streams and Tabular Data, Which is an Important Consideration for Reverse ETL

Differences in Moving Data

ETL/ELT solutions all accomplish a similar function, moving data, but there are several foundational differences to keep in mind when it comes to the data. One of those is the difference between event stream data and tabular data.

Everyone is familiar with this distinction on the ingestion side of the stack. Ingesting event streams is different from ingesting tabular data from SaaS applications (like Salesforce), not just in how the data is generated, i.e., pulled via API vs. generated by an SDK also in the structure of the data itself.

This distinction has significant impacts, from real-time requirements to downstream reporting implications. This distinction has led to different vendors doing well in each category (i.e., Segment for event streaming and Fivetran for tabular data), but modern companies have to leverage both.

At RudderStack, we believe there is an opportunity to do both well, together. In fact, a unified stack to do both can achieve some interesting things, like cross-pipeline identity stitching (i.e., joining Salesforce record IDs with anonymousIDs) and unified data governance. We’re building these solutions at RudderStack, but that’s a topic for a different post.

Event Stream vs. Tabular Data for Reverse ETL

When you hear the term reverse ETL, it’s easy to think only of tabular data. The distinction still exists, though, and you can (and should) distinguish between event stream data and tabular data.

Astasia touched on a few use cases for tabular data, but reverse ETL as an event stream is equally important. One use case we see quite often among our customers is sending events stored in logs (e.g., generated by your back end application and dumped into S3) into destinations like Google Analytics and Amplitude for analytics or platforms like Braze for marketing.

Many of our customers also perform more advanced processing like data mining or ML modeling of logs before sending them as events, then use RudderStack to pipe the data.

2) Table Sync can be Modeled as an Event Stream

An important point related to tabular vs. event stream data is that tabular data can be modeled as events, but not necessarily the other way around. On the ETL/ELT side, CDC technologies (or incremental pulls) have generated quite a bit of interest because there are advantages of representing that data as events versus doing a batch pull via API.

Some of those advantages include incremental syncs for real-time updates, maintaining a consistent point-in-time state, and routing the data to streaming technologies (we will cover this topic in more depth in a future post).

This is possible because tabular data are actually a subset of event data. Similarly, batch processing is a subset of streaming processing. This means that tabular data can be derived from a stream of events and recreate the final state at any point (see event sourcing architecture). This is not true for the inverse, though.

In fact, the reason the industry adopted the tabular/batch model was primarily technical—it is much more difficult to build and manage streaming data. However, this is changing with technologies like RudderStack.

At RudderStack, we have modeled table sync as an event stream in our reverse ETL solution.

At a high level, syncing a row from your warehouse to a ‘row’ in a cloud application is an “event” that specifies which data points are being mapped. Tools like Segment and RudderStack already accomplish that functionality with .identify calls in the event stream, so there is no inherent limitation of the data model for the use case.

So, while there are certainly different user experiences for streaming-based solutions like RudderStack vs. table sync solutions like Census, they are primarily variations in UI/UX.

3) The Limitations of Segment’s Personas Product for Reverse ETL

Astasia made a great point about the distinction between reverse ETL and Segment’s Persona’s product. Personas are a powerful product feature, but it isn’t a reverse ETL solution. The reason is simple: Personas treats the user profile as a first-class object in data sync. The practical implication of this is that all data sync must conform to a contact/account structure. Still, as the recent increase in reverse ETL startups has shown, companies need sync functionality that serves a much wider range of use cases.

Reverse ETL as an event stream directly from the warehouse is unhindered by those limitations. In fact, with our Reverse ETL feature, our customers can turn warehouse tables into a flexible, configurable event stream. That includes updating contacts, accounts, and audiences, but can also support sending of cleansed internal events, derived proxy events (events represented by the absence of behavior), and use cases where the data needs to be delivered to other infrastructure via tools like Kafka, Redis or HTTP endpoints.

4) Reverse ETL is Still Just Data Movement, and a Single Pipe Simplifies Your Stack, Security, and Data governance

Our mission at RudderStack is to help data engineers become the heroes of their companies by providing every team with rich data. We want to make their jobs easier, and part of that mission is simplifying data management into one pipeline.

In the modern stack, the warehouse is king, and many destinations are also becoming sources (and vice versa!).

For example, sources of data often include

  • Events coming from your client or server-side apps
  • Data from your SaaS tools
  • Data from your internal databases
  • Data from your warehouses and data lakes (and...lakehouses)
  • Data from internal event streams (like Kafka).

When all of those sources are also destinations, almost every combination of source and destination is a use case, which creates some important categories of tooling in the customer data stack:

  • App to warehouse/SaaS (event streaming)
  • SaaS to the warehouse (‘traditional’ ETL/ELT)
  • Warehouse to SaaS (new ‘reverse ETL’ category)
  • SaaS to SaaS (API to API category)

Increasingly, those categories within the stack need to support important use cases that are becoming standard but are still challenging for many companies to implement from a technical standpoint:

  • Enabling customer-facing ML use-cases by sending live events to a key-value store (like Redis) for real-time personalization
  • Enabling internal ML use-cases by pulling data, enriching ML, then sending it from the warehouse to tools for internal teams (i.e., Salesforce, Marketo, etc.)
  • Streaming internal events from Kafka to SaaS applications
  • Feeding transformed data (features) to feature stores (like Tecton)

When you step back and look at those categories and the use cases they must support, it becomes clear that a business could easily have to build or buy a significant number of technologies to enable all of the functionality. In fact, we hear from companies all of the time about the pain of managing multiple technologies and vendors (which means contracts) across data pipelines.

One commenter on Astasia’s LinkedIn post said it this way:

“I think we have failed as technologists if we need to build different tools to load data into a warehouse and get data out of the warehouse. This is "focus on doing one thing well" and "you must create a new category" gone too far.”

We agree. Customers tell us all of the time that when it comes to managing pipelines in the context of the modern data stack, “best of breed” is becoming problematic to manage—after all, it’s the same customer data.

At RudderStack, we’re building the complete customer data stack for simplified pipeline management, including the reverse ETL component.

Sign up for Free and Start Sending Data

Test out our event stream, ELT, and reverse-ETL pipelines. Use our HTTP source to send data in less than 5 minutes, or install one of our 12 SDKs in your website or app. Get started.

February 25, 2021