Load data from Mixpanel to PostgreSQL

Extract data from Mixpanel

Mixpanel is an analytics-as-a-service application, so naturally, it requires data to offer its analytics features. We usually think of it as a consumer of data and not a place where we would get our data in order to do perform analysis. But Mixpanel collects a lot of data related to how your customers use your product, and in the case where you would like to do anything that also involves data from others sources, you really have two choices.

The first one is to enrich the data of Mixpanel with data coming from other sources and the second one is to extract the data Mixpanel holds for you and load it on a data warehousing repository for further analysis. This post will consider the second case.

Mixpanel is evolving into a platform where apart from the analytics services that it offers, you will also be able to build applications that are integrated with it. In this post, we will work only with the Export API which purpose is to allow us to export our data from Mixpanel.

As a web API, you can access it using by using tools like CURL or Postman or your favorite http client for the language or framework of your choice. Some options are the following:

  • Apache HttpClient for Java
  • Spray-client for Scala
  • Hyper for Rust
  • Ruby rest-client
  • Python http-client

Or you can use the libraries/SDKs that Mixpanel offers for the following languages:

  • Python
  • PHP
  • Ruby
  • Javascript

As a RESTful API, it offers the following resources that you can interact with:

Annotations

  • annotations– list the annotations for a specified date range.
  • create– create an annotation
  • update– update an annotation
  • delete– delete an annotation

Export

  • export– get a “raw dump” of tracked events over a time period

Events

  • events– get total, unique, or average data for a set of events over a time period
  • top– get the top events from the last day
  • names– get the top event names for a time period

Event Properties

  • properties– get total, unique, or average data from a single event property
  • top– get the top properties for an event
  • values– get the top values for a single event property

Funnels

  • funnels– get data for a set of funnels over a time period
  • list– get a list of the names of all the funnels

Segmentation

  • segmentation– get data for an event, segmented and filtered by properties over a time period
  • numeric– get numeric data, divided up into buckets for an event segmented and filtered by properties in a time period
  • sum– get the sum of a segment’s values per time unit
  • average– get the average of a segment’s values per time unit
  • Segmentation Expressions– a detailed overview of what a segmentation expression consists of

Retention

  • retention– get data about how often people are coming back (cohort analysis)
  • addiction– get data about how frequently people are performing events

People Analytics

  • engage– get People Analytics’ data

Let’s assume that we want to export our raw data from Mixpanel. To do so we’ll need to execute requests to the export endpoint. An example of a request that would get us back raw events from Mixapanel looks like this:

SH
https://data.mixpanel.com/api/2.0/export/?from_date=2012-02-14&expire=1329760783&sig=bbe4be1e144d6d6376ef5484745aac45 &to_date=2012-02-14&api_key=f0aa346688cee071cd85d857285a3464& where=properties%5B%22%24os%22%5D+%3D%3D+%22Linux%22&event=%5B%22Viewed+report%22%5D

The returned result is always in JSON serialization with one event per line sorted by increasing timestamp. It looks like the following sample:

SH
{"event":"Viewed report", "properties":{"distinct_id":"foo","time":1329263748,"origin":"invite", "origin_referrer":"https://mixpanel.com/projects/","$initial_referring_domain":"mixpanel.com", "$referrer":"https://mixpanel.com/report/3/stream/","$initial_referrer":"https://mixpanel.com/", "$referring_domain":"mixpanel.com","$os":"Linux","origin_domain":"mixpanel.com","tab":"stream", "$browser":"Chrome","Project ID":"3","mp_country_code":"US"}}

Mixpanel Data Preparation for PostgreSQL

To populate a PostgreSQL database instance with data, first, you need to have a well-defined data model or schema that describes the data. As a relational database, PostgreSQL organizes data around tables.

Each table is a collection of columns with a predefined data type as an integer or VARCHAR. PostgreSQL, like any other SQL database, supports a wide range of different data types.

A typical strategy for loading data from Mixpanel to PostgreSQL database is to create a schema where you will map each API endpoint to a table. Each key inside the Mixpanel API endpoint response should be mapped to a column of that table, and you should ensure the right conversion to a PostgreSQL compatible data type. For example, if an endpoint from Mixpanel returns a value as String, you should convert it into a VARCHAR with a predefined max size or TEXT data type. Tables can then be created on your database using the CREATE SQL statement.

Of course, you will need to ensure that as the data types from the Mixpanel API might change, you will adapt your database tables accordingly. There’s no such thing as automatic data typecasting.

After you have a complete and well-defined data model or schema for PostgreSQL, you can move forward and start loading your data into the database.

Load data from Mixpanel to PostgreSQL

Once you have defined your schema and you have created your tables with the proper data types, you can start loading data into your database.

The most straightforward way to insert data into a PostgreSQL database is by creating and executing INSERT statements. With INSERT statements, you will be adding data row-by-row directly to a table. It is the most basic and straightforward way of adding data into a table, but it doesn’t scale very well with larger data sets.

The preferred way for adding larger datasets into a PostgreSQL database is by using the COPY command. COPY is copying data from a file on a file system that is accessible by the PostgreSQL instance. In this way, much larger datasets can be inserted into the database in less time.

You should also consult the documentation of PostgreSQL on how to populate a database with data. It includes a number of very useful best practices on how to optimize the process of loading data into your PostgreSQL database.

COPY requires physical access to a file system in order to load data. Nowadays, with cloud-based, fully managed databases, getting direct access to a file system is not always possible. If this is the case and you cannot use a COPY statement, then another option is to use PREPARE together with INSERT, to end up with optimized and more performant INSERT queries.

Updating your Mixpanel data on PostgreSQL

As you will be generating more data on Mixpanel, you will need to update your older data on PostgreSQL. This includes new records together with updates to older records that for any reason have been updated on Mixpanel.

You will need to periodically check Mixpanel for new data and repeat the process that has been described previously while updating your currently available data if needed. Updating an already existing row on a PostgreSQL table is achieved by creating UPDATE statements.

Another issue that you need to take care of is the identification and removal of any duplicate records on your database. Either because Mixpanel does not have a mechanism to identify new and updated records or because of errors on your data pipelines, duplicate records might be introduced to your database.

In general, ensuring the quality of the data that is inserted in your database is a big and difficult issue, and PostgreSQL features like TRANSACTIONS can help tremendously. However, they do not solve the problem in the general case.

The best way to load data from Mixpanel to PostgreSQL

So far, we just scraped the surface of what you can do with PostgreSQL and how to load data into it. Things can get even more complicated if you want to integrate data coming from different sources.

Are you striving to achieve results right now?

Instead of writing, hosting, and maintaining a flexible data infrastructure use RudderStack that can handle everything automatically for you.

RudderStack, with one click, integrates with sources or services, creates analytics-ready data, and syncs your Mixpanel to PostgreSQL right away.

Sign Up For Free And Start Sending Data
Test out our event stream, ELT, and reverse-ETL pipelines. Use our HTTP source to send data in less than 5 minutes, or install one of our 12 SDKs in your website or app.
Don't want to go through the pain of direct integration? RudderStack's Mixpanel integration makes it easy to send data from Mixpanel to PostgreSQL.