Event Deduplication in RudderStack
Learn how RudderStack deduplicates events during processing.
This guide details how RudderStack deduplicates events during processing.
What is deduplication?
Rudderstack’s SDKs ensure at least once delivery of events. However, there can be cases where an event is sent multiple times due to retries (as a result of network issues, etc.)
With deduplication, RudderStack ensures an event is not processed multiple times, with a goal of exactly once delivery.
How deduplication works
RudderStack SDKs assign a immutable UUID for every event called messageId
and it is included in every event. This messageId
is unchanged even during event retries and has the same value as assigned initially.
When RudderStack processes an event, it checks if a particular messageId
is already processed and if yes, drops it. If it’s not processed already, RudderStack processes the event and stores the messageId
for future lookups.
Does deduplication promise exactly once delivery?
While our system is designed with the goal of exactly once delivery, there are certain conditions under which exactly once delivery cannot be guaranteed. RudderStack cannot guarantee event deduplication in the following cases:
- If the event is received outside the dedup window of 7 days.
- Technical issues (like network interruption, etc.) that prohibit a a
messageId
from being stored after processing an event. This is a rare exception and is very unlikely to happen. - When a customer’s server nodes are scaled up or down as per their requirement (to handle peak traffic, for example). In such cases, the data in the existing nodes isn’t copied over to the new nodes and event duplication is possible.
Deduplication time window
RudderStack’s deduplication time window is 7 days - this means that it stores a messageId
for 7 days. If an event that RudderStack processed 8 days back reappears, it is processed again.
Identify duplicates in your data store
A sample SQL query to get duplicate IDs (messageId
) in Snowflake is shown below:
SELECT id, COUNT(*)
FROM your_table
GROUP BY id
HAVING COUNT(*) > 1
A sample SQL query to get the duplicate IDs along with their columns in Snowflake is shown below:
WITH duplicate_ids AS (
SELECT id
FROM your_table
GROUP BY id
HAVING COUNT(*) > 1
)
SELECT yt.*
FROM your_table
JOIN duplicate_ids d ON yt.id = d.id
ORDER BY yt.id ASC;
Questions? Contact us by email or on
Slack