How to build web analytics on your data warehouse with RudderStack, dbt, Snowflake and a visualization layer
Many businesses use web analytics data to understand how their websites are used and how to make them more effective. Monitoring the number of visitors, how long they stay on the site, and which pages they visit can be hugely beneficial for teams that want to make informed decisions about website design, content and marketing tactics.
Google Analytics is the most popular web analytics tool in the world, but in light of their recent deadline requiring migration to their new platform, GA4, many data teams are foregoing Google Analytics all together and building web analytics with first-party data on their own data warehouses.
Our team at RudderStack has worked with our customers to build robust, open-source dbt models that make it easy for analysts and analytics engineers to build rich web analytics visualizations on their own data store.
In this guide, we will walk through the steps a data and analytics team needs to take to build web analytics on their warehouse—from first party data collection all the way to visualization.
Step 1. Define the metrics
Google Analytics uses their own set of metrics definitions, so the first step in building analytics on your warehouse is clearly defining the metrics that are important to the consumers of the data.
In this guide, we use the following metrics definitions:
- Unique users: “Users” refers to “unique users” or the number of distinct individuals who have visited the site (i.e., generated at least one page() call) within a given timeframe.
- Note: as we will show below, RudderStack’s pageview schema makes it very easy to calculate unique users by querying distinct anonymousIds in page/screen tables in the data warehouse.
- Note: as we will show below, RudderStack’s pageview schema makes it very easy to calculate unique users by querying distinct anonymousIds in page/screen tables in the data warehouse.
- Sessions: A session is a collection of user interactions that happen on a website over the course of a specific time period. A single session, for instance, may include a number of website views, events, and online purchases.
Note: You can collect event tracking metrics and combine them with session metadata using RudderStack's session tracking feature, which gives you the raw session data with each event as well as the ability to customize your session definitions.
- Bounce rate bounce rate measures the percentage of users who leave the site after viewing one page. We’ve used our session logic to calculate the bounce rate by dividing sessions with a duration of 0 (number of sessions where session_start time and session_end time are the same) by the total number of sessions.
- Traffic by channel: Channels refer to groups of referring sites that go together logically. For example, traffic referred from Facebook and Twitter can be grouped together into a “Social” channel that represents traffic from all social media sites. So, traffic by channel breaks visits and sessions down into channel groupings. Google Analytics applies this logic under the hood, so in our dbt model we’ve built specific logic to create those groupings (which you can customize to meet your specific needs or definitions).
- Traffic by referrer: traffic by referrer counts referring URLs by unique visit for a given time period. For example, how many visits came from the referring domain of github.com?
- Note: RudderStack automatically captures referrer and referring domain through the pages object.
- Note: RudderStack automatically captures referrer and referring domain through the pages object.
- Traffic by source/medium: marketers often use UTM parameters to tag URLs with information on the source of traffic (i.e., reddit) and the medium (i.e., paid social). Traffic by source/medium allows them to break traffic down to see how those sources and types of traffic within those sources are performing. The logic in our dbt model is based on captured UTM values in page calls as well as the referrer and referred domain as a fallback in case UTM source and medium aren’t present.
- Note: RudderStack automatically captures UTM parameters within the campaign object.
- Traffic by device: traffic by device breaks down traffic by the type of device used to access the site. Device is categorized as either desktop, mobile or tablet and our dbt model uses logic based on screen size data points to extract this metric. You can also leverage user agent information from page calls to derive device type (and many data teams use both).
Note: RudderStack automatically captures screen sizes within the screen object.
Step 2: Capture first-party traffic data (events) from your website
1. The first step would be to sign up free for an account if you haven’t already.
2. The starting point for this step would be to get event data from our website into our data warehouse of choice. We’ll set up an Event stream source from our website and connect it to a Snowflake Destination.
3. The different page views, events and user interactions will be written into separate tables in the data warehouse. RudderStack makes this easier by automatically creating tables that match the type of event using a pre-build Warehouse schema.
4. Once we start streaming events into our data warehouse, we want to make sure the following tables were created:
- Tracks
- Pages
- Identifies
Note : For a detailed guide on getting started with RudderStack Event Stream, you can check out this link.
Step 3. Run dbt to generate metrics tables in your data warehouse
Once we have metrics defined and the first-party event data from our website in our warehouse, we need to run models on that data that will materialize the metrics as tables in the data warehouse. On top of those metrics tables we can build visualizations for our marketing team or even run more advanced queries to understand things like attribution for user acquisition. Since we are only looking at web traffic analytics in this guide, we can start by creating tables for the metric mentioned above in Step 1.
1. In this GitHub repo, we’ll find the code that we will run in dbt to generate our web analytics models. To make things simpler, we’ll walk through and break down the contents of this repo. In the models folder is where we will keep the model we intend to run, you’ll notice it has three subfolders:
- Sources: This folder will include a .sql file named rs_stg_all_events.sql which will create an ephemeral table that will combine all the events from pages, tracks and identifies tables, apply time based filtering, and map the UTM details for each row to a channel, and get device type. Think of this table as a pre-processing layer.
Note: reporting_dims.sql in the macros folder will include functions that will extract the metrics we identified in Step 1, this will include formulas for calculating the channel, utm_source, utm_medium, referrer and device fields.
- Intermediate: this is where all the staging models will be in, there will be two .sql files in this subfolder that we will be used to build user level tables:
- rs_stg_session_metrics.sql: all user level data including channel, referrer, source and medium which were extracted from the session rs_stg_all_events.sql table, we will use this to capture session based metrics like number of sessions, bounce rate average session length.
- rs_stg_user_first_touch.sql: user level data which includes the first touch timestamp, channel and UTM source/medium and will be also used to extract other aggregate metrics such as the DAU and split that by channel, referrer, source details.
- Metrics: this will include aggregate tables that are based on the intermediate staging tables:
- rs_metric_dau_based: this contains the Daily Active Usage (DAU) aggregated by specific event dates.
- rs_metric_session_based: this contains total session, bounce rate as well as average session length aggregated by specific event dates.
2. Now that we know what the models do, we’ll go ahead and create a dbt project that will build those models as tables in our data warehouse.
- After we have chosen our warehouse and setup our data warehouse connection, we’ll want to set up our code repository, here we’ll want to point to a forked version of this GitHub repo.
- With the project set up, we’ll select Develop that will take us to our IDE. We’ll want to make sure the project assets are listed under File Explorer.
- Locate the dbt_project.yml and find the “vars” section in that file that will contain all the variable definitions.
- Make sure to update rs_database and rs_schema to the name of your database and schema that you have configured a connection for in Step 1.a
- At this point, we are ready to build this project, which will materialize them in our data warehouse. We will run those two commands in sequence. The first will update and configure the dependencies and the other will run the entire components of this project:
- dbt deps
- dbt build
Step 4. Building the BI layer with data visualization tools
Now that our models have been built, we want to be able to create visualizations that help us drill into the data, share it with different relevant teams and be able to use it to identify areas of improvement and make data-driven decisions.
There are different tools that enable us to build BI reports which range from modern data analytics tools like Hex, Metabase and Sigma, as well as more traditional tools such as Tableau, Power BI and Data studio. In this step, we will be using Hex and Tableau as two different approaches for data visualization.
Hex
Hex is an end to end data analytics tool that helps you connect to a number of different data sources to explore, transform and visualize the data. You can use SQL, Python and no-code methods to get started and build your analytics into apps that capture data in rich visualizations.
In this example, we’ll create a Hex app that will convert the metrics we have identified and built out into dbt models into visualizations.
1. After signing up for an account in Hex and creating a new project, we are going to create a data warehouse connection to our Database that our dbt models will live in.
2. On the logic side of the app, we’ll start by choosing the connection then adding the SQL code that will query the rs_stg_session_metrics table for session related metrics and the rs_stg_user_first_touch for user level metrics .
To make this easier, we have built this app to show visualization of the most common metrics. By looking at the logic side of the app we can see the SQL queries that were used to calculate the metric for each one of the visualizations. As an example, in order to calculate the bounce rate by day metric, we used the following SQL query:
SQL
select event_date, source_medium, channel, referrer, device_type,sum(bounced_sessions)*100/sum(n_sessions) as bounce_ratefrom "<your-database>"."<your-schema>"."RS_STG_SESSION_METRICS"group by 1,2,3,4,5order by event_date desc
Which we’ll also build the following visualization for the bounce rate:
And same thing for Traffic by channel:
Note: The connection to the data source in Hex is not Live, therefore we will have to perform a “Run all” on all cells to get the most up to date data. Check out this doc to learn more about Scheduled runs in Hex.
Tableau
Some users may prefer more traditional BI tools like Tableau to build their visualization layer. In this example, we’ll show how to create a dashboard which showcases our metrics in dbt. We’ll start with a new Workbook in Tableau and connect it to our data source which is Snowflake.
1. We’ll start by selecting the warehouse and the database that we build our dbt tables into from the dropdown .
2. Once that’s selected we’ll want to drag the staging tables into the canvas and create a relationship based on the anonymous ID of the two tables:
- rs_stg_session_metrics.sql
- rs_stg_user_first_touch.sql
3. With the data source set up, we’ll now want to add a new sheet for each of the metrics we want to visualize in Step 1 with the following formulas:
Sheet | Column | Row | Filters/Marks |
---|---|---|---|
Session by day | DAY (Event Date) | SUM (N Sessions) | Filters: DAY(Event Date) |
Session duration | DAY (Event Date) | SUM([Total Session Length])/SUM([N Sessions]*60) | Filters: DAY(Event Date) |
Bounce rate by day | DAY (Event Date) | SUM([Bounced Sessions])*100/SUM([N Sessions]) | Filters: DAY(Event Date) |
Unique users by day | DAY (Event Date) | COUNTD([Anonymous Id (Rs Stg User First Touch)]) | Filters: DAY(Event Date) |
Traffic by device | Device Type | COUNT([Anonymous Id]) | Filters: Event Date |
Traffic by channel | YEAR(Event Date) | COUNTD(Anonymous Id) | Marks: Channel Filter: Year (Event Date) |
Traffic by referrer | YEAR(Event Date) | COUNTD(Anonymous Id) | Marks: Referrer Filter: Year (Event Date) |
Traffic by UTM Source/Medium | YEAR(Event Date) | COUNTD(Anonymous Id) | Marks: SOURCE_MEDIUM Filter: Filter: Year (Event Date) |
4. Lastly, we can create a Dashboard by dragging the sheets we’ve created earlier to form two different views that will have all of those visualizations as tiles:
- Web analytics traffic data: will include channel, referrer, UTM Source traffic data
- Web analytics session data: will include all session related data and aggregated views
Conclusion
The way to properly know which marketing efforts are effective starts by analyzing the traffic and user interactions on your web app to help inform the various channels you would want to invest more resources in.
With this simple end-to-end solution we have taken ownership of our user event data in our data warehouse and unlocked many useful analytics insights starting from real time event stream data to fully structured warehouse data. From there we can use customizable, open-source models with dbt to transform the warehouse data to rich, web analytics insights that can be visualized using our BI tool of choice.
If you would like to continue the conversation or learn more about how RudderStack can help address your company’s data engineering needs, join us on slack.