How to load data from Stripe to Google BigQuery
Extract data from Stripe
Stripe is an API-first product with a unified set of APIs and tools that instantly enables businesses to accept and manage online payments. It is a Web API following the RESTful principles. Stripe tries to use as many as HTTP built-in features as possible to make it accessible to off-the-shelf HTTP clients, and the serialization they support for their responses is JSON. They also have two different types of keys used for authentication, one for testing mode and one for live mode, using the testing mode key, it becomes easy to test every aspect of the API without messing with your real data. For security reasons, the calls you make to the Stripe API have to be over HTTPS only. Plain HTTP and non-authenticated calls fail. In case you want to experiment with the API, do not forget to use your testing mode key.
Currently, Stripe API is built around the following ten core resources:
- Balance – an object that represents your stripe balance.
- Charges – to charge a credit or debit card you create a charge
- Customers – Customer objects allow you to perform recurring charges and track multiple charges associated with the same customer.
- Dispute – A dispute occurs when a customer questions your charge with their bank or credit card company.
- Events – Events are our way of letting you know when something interesting happens in your account.
- File uploads – There are various times when you’ll want to upload files to Stripe (for example, when uploading dispute evidence).
- Refunds – Refund objects allow you to refund a charge that has previously been created but not yet refunded.
- Tokens – Tokens can be created with your publishable API key.
- Transfers – When Stripe sends you money or you initiate a transfer to a bank account
- Transfer reversals – A previously created transfer can be reversed if it has not yet been paid out.
All of the above resources support CRUD operations by using HTTP verbs on their associated endpoints. As a web API, you can access it using tools like CURL or Postman or your favorite HTTP client for the language or framework of your choice. Some options are the following:
Many libraries wrap around the Stripe API and offer an easier way to interact with it, both community-developed and from Stripe. For more information, you can check the libraries section in the API documentation.
Stripe and any other service you might be using have figured out (hopefully) the optimal model for its operations. Still, when we fetch data out of them we usually want to answer questions or do things that are not part of the context that these services operate in, making these models suboptimal for your analytic needs. For this reason, we should always keep in mind that when we work with data coming from external services, we need to remodel it and bring it to the right form for our needs.
So let’s assume that we want to perform some churn analysis for our company and to do that we need customer data that indicates when they have canceled their subscriptions. To do that we’ll have to request the customer objects that Stripe holds for our company. We can do that with the following command:
SH
curl https://api.stripe.com/v1/charges?limit=3 -u sk_test_BQokikJOvBiI2HlWgH4olfQ2:
and a typical response will look like the following:
JSON
{ "object": "list", "url": "/v1/charges", "has_more": false, "data": [ { "id": "ch_17SY5f2eZvKYlo2CiPfbfz4a", "object": "charge", "amount": 500, "amount_refunded": 0, "application_fee": null, "balance_transaction": "txn_17KGyT2eZvKYlo2CoIQ1KPB1", "captured": true, "created": 1452627963, "currency": "usd", "customer": null, "description": "thedude@grepinnovation.com Account Credit", "destination": null, "dispute": null, "failure_code": null, "failure_message": null, "fraud_details": { }, …….
Inside the customer object there’s a list of subscription objects that look like the following JSON document:
JSON
{ "id": "sub_7hy2fgATDfYnJS", "object": "subscription", "application_fee_percent": null, "cancel_at_period_end": false, "canceled_at": null, "current_period_end": 1455306419, "current_period_start": 1452628019, "customer": "cus_7hy0yQ55razJrh", "discount": null, "ended_at": null, "metadata": { }, "plan": { "id": "gold2132", "object": "plan", "amount": 2000, "created": 1386249594, "currency": "usd", "interval": "month", "interval_count": 1, "livemode": false, "metadata": { }, "name": "Gold ", "statement_descriptor": null, "trial_period_days": null }, "quantity": 1, "start": 1452628019, "status": "active", "tax_percent": null, "trial_end": null, "trial_start": null }
These objects together with part of the customer contain the information we need to perform churn analysis. Of course, we’ll have to extract all the information we need, the schema of our data warehouse schema, repository and then load the data to it following this post’s instructions.
Stream Data Using the Stripe API to Your Data Warehouse
It is also possible to set up a streaming data infrastructure that will collect data out of Stripe and push them into the data warehouse in a streaming fashion. This can be achieved by using the webhooks functionality that Stripe supports, you register some events to it and every time something happens, Stripe will push a message to your webhook. For more information about that, check the API documentation on webhooks.
Prepare your Stripe Data for Google BigQuery
Before you load any data into BigQuery, you should make sure that it is presented in a format supported by it, so for example if the API you pull data out returns XML you have to first transform it into a serialization that BigQuery understands. Currently, two data formats are supported:
You also need to make sure that the data types you are using are the ones supported by BigQuery, which are the following:
- STRING
- INTEGER
- FLOAT
- BOOLEAN
- RECORD
- TIMESTAMP
For more information please check the Preparing Data for BigQuery page on the documentation.
Load from Stripe to Google BigQuery
If you want to load data by using Stripe to BigQuery, you have to use one of the following supported data sources.
- Google Cloud Storage
- Sent data directly to BigQuery with a POST request
- Google Cloud Datastore Backup
- Streaming insert
- App Engine log files
- Cloud Storage logs
From the above list of sources, 5 and 6 are not applicable in our case.
For Google Cloud Storage, you first have to load data into it, there are a few options on how to do this, for example, you can use the console directly as it is described here and do not forget to follow the best practices. Another option is to post data through the JSON API, as we see again APIs play an important role in both the extraction but also the loading of data in our data warehouse. In its simplest case, it’s just a matter of one HTTP POST request using a tool like CURL or Postman. It should look like the following example.
MARKDOWN
POST /upload/storage/v1/b/myBucket/o?uploadType=media&name=myObject HTTP/1.1 Host: www.googleapis.com Content-Type: application/text Content-Length:number_of_bytes_in_fileAuthorization: Beareryour_auth_token your Stripe data
and if everything went ok, you should get something like the following as a response from the server:
MARKDOWN
HTTP/1.1 200 Content-Type: application/json { "name": "myObject" }
Working with Curl or Postman is good only for testing. If you would like to automate the process of loading every data into Bigquery, you should write some code to send data you own to Google Cloud Storage. In case you are developing on the Google App Engine you can use the library that is available for the languages that are supported by it:
If you are using one of the above languages and you are not coding for the Google App Engine, you can use it to access the Cloud Storage from your environment. Interacting such a feature-rich product like Google Cloud Storage can become quite complicated depending on your use case, for more details on the different options that exist you can check Cloud Storage documentation. If you are looking for a less engaged and more neutral way of using Cloud Storage, you can consider RudderStack.
After you have loaded data to Cloud Storage, you have to create a Load Job for BigQuery to actually load the data into it, this Job should point to the source data in Storage on the Cloud that have to be imported, this happens by providing source URIs that point to the appropriate objects.
The previous method described, used a POST request to the Google Storage API on the Cloud for storing the data there and then load it into BigQuery. Another way to go is to do a direct HTTP POST request to BigQuery with the data you would like to query. This approach is similar to how we loaded the data to the Cloud in Google Storage through the JSON API, but it uses the appropriate end-points of BigQuery and loads the data there directly. The way to interact with it is quite similar, for more information can be found on the BigQuery API Reference and on the page that describes how you can load data by BigQuery using POST. You can interact with it using the HTTP client library of the language or framework of your choice, a few options are:
The best way to load data from Stripe to BigQuery
So far we just scraped the surface of what you can do with BigQuery and how you can load data into it. Things can get even more complicated if you want to integrate data coming from different sources.
Are you striving to achieve results right now?
Instead of writing, hosting, and maintaining a flexible data infrastructure use RudderStack that can handle everything automatically for you.
RudderStack with one click integrates with sources or services, creates analytics-ready data, and syncs your Stripe to BigQuery right away.