How to load data from MailChimp to Snowflake
Extract your data from MailChimp
First of all, Mailchimp updated its API to v3 recently, so make sure that anything you do will be with this version as the previous are all deprecated although still supported. For more information, visit MailChimp API v3.0 documentation. MailChimp was always a promoter of APIs and encouraged integration with other systems. It has a rich API that exposes a large number of endpoints for interacting with the resources of the applications. More specifically, there are endpoints for the following resources:
- Automations – offers functionality related to automated tasks that we define on MailChimp
- Batch operations – for managing batch processes on our MailChimp account
- Campaign folders – helps you organize your campaigns into groups
- Campaigns – for managing your campaigns
- Conversations – helps you track threads of emails with specific users
- File manager files – for managing your static assets like images
- File manager folders – for creating folders to organize your assets
- Lists – CRUD operations on lists of users
- Reports – for accessing reports with statistics on your campaigns
- Template folders – operations on creating folders for organizing your templates for your emails
- Templates – operations on templates for your emails
Something interesting to note about the MailChimp API is that we don’t see a root-level resource for users anywhere. This makes sense of course, if we take into consideration that MailChimp is all about mail campaigns so users or to put in the correct context, subscribers, are not a standalone resource but instead they exist only inside the lists we manage.
It is important to understand that every service perceives the world from a different perspective which of course it’s relevant to the value it offers, so even if what we care about is information about our users, it makes complete sense for someone like MailChimp to organize everything around lists and campaigns.
If we check the model of another service, like Intercom, for example, we’ll notice that the user is at the top resources, which again makes sense because Intercom is all about one to one communication with users.
MailChimp and any other service that you might be using has figured out (hopefully) the optimal model for its operations, but when we fetch data from them, we usually want to answer questions or do things that are not part of the context that these services operate, something that makes these models sub-optimal for your analytics needs.
For this reason, we should always keep in mind that when we work with data coming from external services, we need to remodel it and bring it to the right form for our needs.
Interacting with the MailChimp REST API can be done by using tools like CURL or Postman or by using HTTP clients for your favorite language or framework. A few suggestions:
- Apache HttpClient for Java
- Spray-client for Scala
- Hyper for Rust
- Ruby rest-client
- Python http-client
The MailChimp REST API supports OAuth 2.0 authentication, more information can be found in the Authorised apps section of the API documentation. After you successfully authenticate with the REST API, you have to start interacting with its resources and start fetching data from it in order to load them on your warehouse.
Extract your user data from the MailChimp API
So let’s assume that we want to get all the information we got on MailChimp for our users in order to enrich our user records inside our data warehouse. To do that, we need to do the following.
First, we need to fetch all the lists that we have created on MailChimp, we can do this by performing a GET request to the appropriate endpoint.
JAVASCRIPT
curl --request GET--url 'https://usX.api.mailchimp.com/3.0/lists'--user 'anystring:apikey'—include
We should get back a response like the following:
JAVASCRIPT
{"lists": [{"id": "57afe96172","name": "Freddie's Jokes","contact": {"company": "MailChimp","address1": "675 Ponce De Leon Ave NE","address2": "Suite 5000","city": "Atlanta","state": "GA","zip": "30308","country": "US","phone": ""},
Using the ID we get from the response for each list, we iterate through all the lists and make requests to the appropriate end-points to get the members for the list.
JAVASCRIPT
curl --request GET--url 'https://usX.api.mailchimp.com/3.0/lists/57afe96172/members'--user 'anystring:apikey'—include
and we should get a response back like the following:
JAVASCRIPT
{"members": [{"id": "f777bbffab8d1ceca8b757df63c47cb8","email_address": "urist.mcvankab+1@freddiesjokes.co","unique_email_id": "882e9bec19","email_type": "html","status": "subscribed","status_if_new": "","merge_fields": {"FNAME": "","LNAME": ""},"interests": {"9143cf3bd1": true,"3a2a927344": false,"f9c8f5f0ff": false,"f231b09abc": true,"bd6e66465f": false},
Keep in mind that a user might appear in more than one lists, which means that you also have to reduplicate your results based on this fact. After we get the results we need and collect all the user-related information from various other resources we ought to map it to the model of our Data Warehouse repository before we do the actual loading.
MailChimp Data Preparation for Snowflake
The first step, before you start ingesting data into a Snowflake data warehouse instance, is to have a well-defined schema of any data you own.
Data in Snowflake is organized around tables with a well-defined set of columns, with each one having a specific data type.
Snowflake supports a rich set of data types. It is worth mentioning that a number of semi-structured data types is also supported. With Snowflake, it is possible to directly load data in JSON, Avro, ORC, Parquet, or XML format. Hierarchical data is treated as a first-class citizen, similar to what Google BigQuery offers
There is also one notable common data type that is not supported by Snowflake. LOB or large object data type is not supported. Instead, you should use a BINARY or VARCHAR type. But these types are not that useful for data warehouse use cases.
A typical strategy for loading data from MailChimp to Snowflake is to create a schema where you will map each API endpoint to a table.
Each key inside the MailChimp API endpoint response should be mapped to a column of that table, and you should ensure the right conversion to a Snowflake data type.
Of course, you must ensure that as any data types from the MailChimp API might change, you will adapt any database tables you have accordingly. There’s no such thing as automatic data type casting.
After you have a complete and well-defined data model or schema for Snowflake, you can move forward and start loading data into the database.
Load data from MailChimp to Snowflake
Usually, data is loaded into Snowflake in a bulk way, using the COPY INTO command. Files containing data, usually in JSON format, are stored in a local file system or in Amazon S3 buckets. Then a COPY INTO command is invoked on the Snowflake instance and data is copied into a data warehouse.
The files can be pushed into Snowflake using the PUT command into a staging environment before the COPY command is invoked.
Another alternative is to upload data directly into a service like Amazon S3, from where Snowflake can access data directly.
Finally, Snowflake offers a web interface as a data loading wizard where someone can visually set up and copy every data into the warehouse. Just keep in mind that the functionality of this wizard is limited compared to the rest of the methods.
Snowflake, in contrast to other technologies like Redshift, does not require a data schema to be packed together with the data that will be copied. Instead, the schema is part of the query that will copy data into the data warehouse. This simplifies the data loading process and offers more flexibility on data type management.
Updating your MailChimp data on Snowflake
As you will be generating more data on MailChimp, you will update your older data on Snowflake. This includes new records together with updates to older records that for any reason have been updated on MailChimp.
You must periodically check MailChimp for new data and repeat the process that has been described previously, while updating your currently available data if needed. Updating an already existing row on a Snowflake table is achieved by creating UPDATE statements.
Snowflake has a great tutorial on the different ways of handling updates, especially using primary keys.
Another issue that you need to take care of is the identification and removal of any duplicate records on your database. Either because MailChimp does not have a mechanism to identify new and updated records or because of errors on your data pipelines, duplicate records might be introduced to your database.
In general, ensuring the quality of data that is inserted in your database is a big and difficult issue.
The best way to load data from MailChimp to Snowflake
So far we just scraped the surface of what you can do with Snowflake and how to load data into it. Things can get even more complicated if you want to integrate data coming from different sources.
Are you striving to achieve results right now?
Instead of writing, hosting, and maintaining a flexible data infrastructure use RudderStack that can handle everything automatically for you.
RudderStack, with one click, integrates with sources or services, creates analytics-ready data, and syncs your MailChimp to Snowflake right away.