Customer Session Analysis Using dbt and RudderStack
One of the most important and foundational aspects of any Customer Data Platform is the ability to trace the steps of individual users as they navigate throughout your website, apps, marketing ecosystem, social platforms, etc. RudderStack makes it easy to both collect this data from every platform and send it to your cloud data warehouse for advanced modeling and analysis.
In this exercise we will walk through a sample dbt that is useful for two key analytics use cases in your warehouse that are enabled by RudderStack behavioral data (the code is available via our Git repo or at hub.getdbt.com/rudderlabs).
- Linking anonymous activity (pre first identify call) to now-known users as well as joining different anonymous ids for the same user either between subdomains or across devices
- Defining sessions based on any activity (in our case track calls) associated with a user (per #1) and some amount of time lapsing between events
Having the raw data to gain visibility in both cases is incredibly helpful both for enabling analytics, but also problem-solving around issues of identity resolution and session definitions.
It’s also worth pointing out that in this example, we define sessions as some increment of time. In other use cases, you might want to replace the term “session” with “funnel stage” which can be defined as a user having performed some particular behavior (i.e., track call) or some other datapoint (returned package for example) in your warehouse. For example, if you wanted to build an e-commerce sales funnel based on stages, you could easily replace the session logic with an events map that defines checkout stages like whether a lead had converted on a landing page and/or whether they placed an item in their cart.
Overview
This dbt project builds on top of the source table tracks which is created by default in all the RudderStack warehouse destinations. You could also include page calls or other activity in your warehouse, such as customer service tickets, inbound calls, returned packages, etc. as long as that activity has a timestamp.
The data from the tracks table is used to first create a session abstraction and then prepare the sequence of 5 events triggered by the customer during the course of a session. This sequence represents the customer's journey. You can increase the number of events in the journey by adding the necessary code to dbt_tracks_flow.sql.
How to use this repository
This project was created on the dbt Cloud. Hence there is no profiles.yml file with the connection information. You can import this IMPORT SCREENSHOT
Developers who want to execute the models on the Command Line Interface (CLI) mode will need to create additional configuration files by following the directions provided here.
Note: While this code has been tested for Google BigQuery, it should also be usable for other RudderStack-supported data warehouses like Amazon Redshift and Snowflake with only minor tweaks required for the platform-specific analytics formulas.
Project components
The following files are included in the dbt model and will be referenced below. You can replicate this repo and load it directly into your own cloud dbt project or follow the links to inspect the code offline.
- dbt_project.yml - Every dbt project has a dbt_project.yml file. These are written in YAML and define common conventions and properties. For our project, the highlights from this page include the name, version and that we want our models to be materialized as views.
- dbt_aliases_mapping.sql - Creates a map or “crosswalk” of anonymous id’s to known user id’s so that activity performed by once anonymous users can be included in the journey of known users. This is important for identifying users across devices as well as users performing activity across different subdomains.
- dbt_mapped_tracks.sql - Leverages the alias mapping model above and to link all of the events to the now known user id. We also set the stage to define “session” as any number of minutes by measuring the time between events and store this in an idle time field.
- dbt_session_tracks.sql - We define the sessions for each user based on the mapped tracks model. We define a session as 30 minutes of idle time, identifying a start, sequence number and next session start time if there is one.
- dbt_track_facts.sql - We assign the events with from mapped tracks model to their respective session base on the user and timestamp falling between the start and next session start window.
- dbt_tracks_flow.sql - Here we leverage analytics functions to display more useful information about each session, such as the first 5 events. The sky’s the limit for what you might want to add in here, including identifying specific track calls that constitute conversions, or aggregating the number of events, total dollars for checking transactions, etc.
- tracks.yml - YAML file to define the schema and tracks tables we want to use for the dbt_alias_mapping model.
Note: Each of these models is materialized as a table. Since these sessions are time based, we could also use incremental tables based on the latest event timestamp in our table.
Sequence of commands
The sequence in which the dbt models should be executed for a fresh run is as follows:
dbt_aliases_mapping
This model/table has two attributes/columns - alias and dbt_visitor_id. This table captures the linkages between one or more anonymous_id values (alias) and a user_id (dbt_visitor_id). For a deeper dive into identity mapping and graphing, check out this blog post from our founder on Identity Graphing & ID Resolution.
dbt_mapped_tracks
This table has the columns event_id, anonymous_id, dbt_visitor_id, timestamp, event, and idle_time_minutes.
The event field represents the actual event name. Timestamp corresponds to the instant when the event was actually generated. idle_time_minutes captures the time gap between the current event and the immediately preceding one. You can add whatever fields from your track events or other non-rudder activity in your warehouse.
dbt_session_tracks
This table defines the sessions for each individual user. In the next step we will assign the events to these sessions. This session contains the columns session_id, dbt_visitor_id, session_start_at, session_sequence_number, and next_session_start_at.
The data in the dbt_mapped_tracks table is partitioned first by dbt_visitor_id. It is then partitioned further into groups of events where the time gap within one group i.e. idle_time_minutes is not more than 30 minutes. In other words - if idle_time_minutes for an event is more than 30, a new group is created.
Some important points to remember:
- These groups of sequential events are the sessions. The value of idle_time_minutes can be modified in the model definition.
- The session_sequence_number represents the order of the session for a particular user.
- The session_id is of the form session_sequence_number - dbt_visitor_id.
dbt_track_facts
This table has the columns anonymous_id, timestamp, event_id, event, session_id, dbt_visitor_id, and track_sequence_number.
In this table, the information from dbt_session_tracks is tied back to the records in the dbt_mapped_tracks table. Each event is now tied to a session_id and within the session, the event is assigned a track_sequence_number.
dbt_tracks_flow
The columns in this table are event_id, session_id, track_sequence_number, event, dbt_visitor_id, timestamp, event_2, event_3, event_4, and event_5. This is essentially a table where each event and 4 subsequent events are represented in each record.
Note: Please remember to change schema in tracks.yml and dbt_aliases_mapping.sql to your database schema.
Next steps
Once you unleash the power of dbt on the RudderStack data in your data warehouse, you can send this information back into your ecosystem using RudderStack Reverse ETL. Check out our blog RudderStack's Data Stack: Deep Dive to see how we do this internally, and join us on Slack to see how other teams are leveraging the tool.