Event Deduplication in RudderStack

Learn how RudderStack deduplicates events during processing.

2 minute read

This guide details how RudderStack deduplicates events during processing.

What is deduplication?

Rudderstack’s SDKs ensure at least once delivery of events. However, there can be cases where an event is sent multiple times due to retries (as a result of network issues, etc.)

With deduplication, RudderStack ensures an event is not processed multiple times, with a goal of exactly once delivery.

How deduplication works

RudderStack SDKs assign a immutable UUID for every event called messageId and it is included in every event. This messageId is unchanged even during event retries and has the same value as assigned initially.

When RudderStack processes an event, it checks if a particular messageId is already processed and if yes, drops it. If it’s not processed already, RudderStack processes the event and stores the messageId for future lookups.

Does deduplication promise exactly once delivery?

While our system is designed with the goal of exactly once delivery, there are certain conditions under which exactly once delivery cannot be guaranteed. RudderStack cannot guarantee event deduplication in the following cases:

If the event is received outside the dedup window of 7 days.
Technical issues (like network interruption, etc.) that prohibit a a messageId from being stored after processing an event. This is a rare exception and is very unlikely to happen.
When a customer’s server nodes are scaled up or down as per their requirement (to handle peak traffic, for example). In such cases, the data in the existing nodes isn’t copied over to the new nodes and event duplication is possible.

Deduplication time window

RudderStack’s deduplication time window is 7 days - this means that it stores a messageId for 7 days. If an event that RudderStack processed 8 days back reappears, it is processed again.

Identify duplicates in your data store

A sample SQL query to get duplicate IDs (messageId) in Snowflake is shown below:

SELECT id, COUNT(*)
FROM your_table 
GROUP BY id
HAVING COUNT(*) > 1

A sample SQL query to get the duplicate IDs along with their columns in Snowflake is shown below:

WITH duplicate_ids AS (
    SELECT id
    FROM your_table
    GROUP BY id
    HAVING COUNT(*) > 1
)
SELECT yt.*
FROM your_table
JOIN duplicate_ids d ON yt.id = d.id
ORDER BY yt.id ASC;

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Questions? Contact us by email or on Slack