It is 2020, and discussions around data-ownership and privacy have finally crossed the Atlantic and reached the American boardroom, thanks to CCPA coming into force.
Companies, both big and small are – or should be – taking stock of all their data-sharing activities, and the first thing usually discussed is the use of data analytics. Products like Google Analytics, Amplitude and MixPanel are near omnipresent, but businesses are finally wondering if there are alternatives. Beyond the data sharing & privacy issue, a couple of other drivers for that thought are:
- Cost: The SaaS products can cost hundreds of thousands of dollars beyond their free tier (typically sub-10m events). Per-user unit economics in some domains (e.g. free to play mobile games) often don’t justify these costs.
- Flexibility: The SaaS analytics products are great, but often restrict the kind of queries you can run. Full SQL is not supported, which makes it hard to combine event data with other internal data. Warehouse dumps, even if supported, cost extra. Also, they are quite time-delayed and restricted in terms of functionality (for e.g. Google Analytics only gives a data dump into BigQuery).
- Open-source and Vendor lock-in: Finally, there is an increasing drive to adopting open-source tech as much as possible to avoid vendor lock-in.
This blog describes the design and implementation of an open-source analytics stack in the context of a mobile casino game. We used RudderStack and Apache SuperSet, with Amazon Redshift as the underlying data-warehouse. This game had all the pain points discussed above – they wanted to bring their data in-house, and run complex analytics queries which weren’t possible on Amplitude’s UI. At the same time, they wanted to continue to use Amplitude for some of their other reporting tasks, but only forward the relevant events to it, thereby limiting data sharing and saving on the event budget.
RudderStack’s transformation module is designed to address this use-case – the events can be removed, sampled or even combined before they are sent to the required destinations. However, that is not the focus of this blog and is covered in detail in a separate blog.
Understanding the mobile game data metrics
Mobile game developers make extensive use of data to increase user engagement and ensure effective monetization. Most mobile games adopt a ‘freemium’ approach, wherein the game is made free to users with the expectation that users will pay for in-app purchases in order to use certain features. Analytics plays a very important role in identifying such features. At the same time, it helps in optimizing the UX (User Experience) so that they don’t lose interest in the game.
Different teams within a mobile game company work with different data metrics. For example:
- Executive teams typically track overall metrics around the health of the business such as total users, revenue, etc.
- Growth teams track campaign or group level metrics, such as the activity of users who have performed a particular event – for e.g. downloaded the app from a specific campaign or made an in-app purchase
- Customer teams track individual user-level metrics such as identifying most active users, or users who had a significant drop in activity.
All these metrics demonstrate the diverse nature and complexity of the queries that any analytics stack has to support, beyond just the traditional reporting.
Next, we give a quick overview of the tools involved.
RudderStack (https://github.com/rudderlabs/rudder-server/) is an open-source alternative to platforms like Segment or mParticle for collecting and routing event data.
Note: You can read this article on clickstream analytics to learn more about events, and how event data is processed.
Why use RudderStack?
RudderStack offers many features that make it a good choice for building an analytics stack. Some of them are:
- RudderStack provides for a powerful transformation framework to process your event data on the go
- RudderStack supports a plethora of hosting options for the backend, such as Docker, Kubernetes, Terraform, etc. It runs on all cloud environments including AWS, GCP and Azure, as well as on a private cloud or a bare-metal environment
- RudderStack has integrations into multiple data warehouses like Amazon Redshift, S3, Google BigQuery, Snowflake and more, in addition to cloud destinations like Google Analytics, Amplitude, MixPanel, HubSpot, UserIQ and Salesforce (Over 20 as of now)
- Finally, RudderStack is open-source
How the mobile game uses RudderStack
In this specific use-case, the mobile game integrated RudderStack’s Unity SDK for generating events. Events were then sent to the RudderStack’s open-source data plane, from where they were dumped into Amazon Redshift.
Some of the events generated by the game were:
spin_result, corresponding to casino spins
Following is a typical event structure generated by application using RudderStack’s SDK. It includes attributes specific to the event as well as specific to the user – all of which needs to be accurately captured in order to make the information comprehensive.
In addition, the RudderStack SDK automatically captures contextual information regarding the type/Operating System/screen dimensions of the device, time of the day etc.
The following section shows how Amazon Redshift loads the data received from RudderStack.
Loading Data into Amazon Redshift
RudderStack has native connectors to analytics databases like Redshift, one of the most popularly used data warehousing services in the cloud. All the events routed by RudderStack are stored in Redshift’s track table, with the entire JSON payload as a column. In addition, we also create separate tables for each event type (e.g. track event) with event properties as columns. The following images demonstrate two examples – one for a particular event type and the other is for all events:
The table for a specific event contains attributes specific to the event, whereas the second table contains information that is event-agnostic. The two tables can be linked by the id field.
RudderStack does automatic schema management. It creates the table schema based on the event structure and keeps it updated if new fields are added to the event JSON or data types of existing fields are changed.
Redshift and other analytics databases store the table in columnar format. All the column values are stored together. This is very efficient for queries that only touch a small number of columns. For example, getting the total revenue requires fetching the revenue column data and summing it up. Adding a filter (
SELECT) on a few columns is pretty efficient too. For example, to get the total revenue from a particular game, we need to fetch fields such as the
revenue columns and sum the revenue values for matching games.
Once the data is loaded into Redshift, we can then perform analytics on it using a BI tool on top of Redshift.
Performing Visual Analytics in Redshift using Apache SuperSet
Apache SuperSet provides an intuitive, code-free interface for creating enterprise-ready dashboards and charts. It also provides a rich visual interface to write SQL queries.
For analyzing the data, we will use Apache Superset to use the data from Redshift, and visualize the results of our queries. The following are some examples of this:
Identifying top users by event count
The query is as follows:
Execution Time : 0.98 seconds
Pie chart for the top 100 events
The query is as follows:
Execution Time : 09.01 seconds
Find people who have churned
As a more complex example, finding paying users whose activity count dropped by 90% week-over-week can be a bit more tricky as it requires:
- Computing the user level week-over-week activity level by doing a GROUP-BY on user_id & week on the ‘tracks’ table
- Doing a self JOIN of the above table to compute the activity drop from week X to week X+1
- Filtering the above to those users which have > 90% activity drop and are paying users
The query is as below:
The corresponding graph is as shown below:
In the above example, we identified users who are churning week-over-week. That’s great, but can we engage with these users to get them to play the game again? Something that allows the game developers to send a push notification to these users, or showing them an ad on Facebook would be an invaluable feature to have.
To address such a problem, RudderStack will also support taking the output of a SQL query as a data source, and send it to cloud destinations like Facebook Ads, Mailchimp, etc. so that the users can be notified. This feature will be available in RudderStack soon.
We saw how RudderStack can be easily integrated with tools such as Redshift and SuperSet to build a seamless, efficient and enterprise-grade analytics stack.
While this blog focuses only on Redshift, the concept of building the analytics stack is equally applicable to any data warehouse which offers a SQL interface, such as Google BigQuery, Azure, etc. or their open-source equivalents like ClickHouse or Apache Druid.
To learn about how RudderStack can help you build your own analytics stack, schedule a demo.