How to load data from Shopify to Google BigQuery

How to Extract my data from Shopify?

Shopify exposes its complete platform to developers through its API. It is used by thousands of developers to create applications that are then sold through the Shopify marketplace.

As a Web API following the RESTful architecture principles, it can be accessed through HTTP. As a RESTful API, interacting with it can be achieved by using tools like CURL or Postmanor by using HTTP clients for your favorite language or framework.

A few suggestions:

Shopify also offers a number of SDKs that are officially supported and maintained by them and that can be used to access their platform for different uses. For example, by using the iOS and Android Buy SDK, it is possible to add Shopify checkout capabilities to your mobile application. Other SDKs that are offered are:

Shopify API Authentication

There are two different types of applications in Shopify when it comes to authentication, private and public apps.

Private apps will only function on individual stores, so you don’t need to authenticate them through Oauth. You can get Private app credentials through your Partner dashboard or by logging into any Shopify admin. You can also use this API key to manipulate your store using the API console without the need for building a fully functional app.

The apps you create in your Partners dashboard function as public apps that can be made available for download in Shopify’s App Store if you meet the criteria.

Shopify rate limiting

The API call limit operates using a “leaky bucket” algorithm as a controller. This allows for infrequent bursts of calls and allows your app to continue to make an unlimited amount of calls over time.

The bucket size is 40 calls (which cannot be exceeded at any given time), with a “leak rate” of 2 calls per second that continually empties the bucket. If your app averages 2 calls per second, it will never trip a 429 error (“bucket overflow”).

Endpoints and available resources

Shopify exposes 35+ endpoints covering all the possible touchpoints of e-commerce. Some of the most important resources that can be accessed through these endpoints are the following:

  • Abandoned checkouts: used to return abandoned checkouts. A checkout is considered abandoned when a customer has entered their billing & shipping info but has yet to complete the purchase.
  • ApplicationCharge: Request to charge a shop a one-time fee by issuing this call.
  • Article: Operations concerning articles on your Blog.
  • Asset: files that make up the theme of a shop.
  • Blog: Shopify, in addition to the e-commerce features also offers an environment where the merchant can create a Blog for her shop.
  • CarrierService: A Carrier Service (also known as a Carrier Calculated Service or Shipping Service) provides real-time shipping rates to Shopify.
  • Collect: An object that connects a product to a custom collection.
  • Customer: A customer resource instance represents a customer account with the shop.
  • Event: Events are generated by specific Shopify resources when specific things happen, such as the creation of an article.
  • Order: An order is a customer’s completed request to purchase one or more products from a shop.
  • Product: A product is an individual item for sale in a Shopify shop.
  • Transaction: Transactions are created for every order that results in an exchange of money.

For a complete list of endpoints, you can see here.

It is clear that with such a rich platform and API, the data that can be pulled out of Shopify are both valuable and come in large quantities. So, let’s assume that we want to pull all events out of Shopify to use them for further analysis.

To do so, we need to make a request to the Event endpoint like this.

JAVASCRIPT
GET /admin/events.json?filter=Product,Order

This request will get us back to all the events that are related to Products and Orders for our shop. The response will be in JSON and will look like this:

JAVASCRIPT
HTTP/1.1 200 OK
{
"events": [
{
"id": 677313116,
"subject_id": 921728736,
"created_at": "2008-01-10T08:00:00-05:00",
"subject_type": "Product",
"verb": "create",
"arguments": [
"IPod Touch 8GB"
],
"body": null,
"message": "created a new product: <a href=\"\/admin\/products\/921728736\">IPod Touch 8GB<\/a>.",
"author": "Shopify",
"description": "created a new product: IPod Touch 8GB.",
"path": "\/admin\/products\/921728736"
},
{
"id": 365755215,
"subject_id": 632910392,
"created_at": "2008-01-10T07:00:00-05:00",
"subject_type": "Product",
"verb": "create",
"arguments": [
"IPod Nano - 8GB"
],
"body": null,
"message": "created a new product: <a href=\"\/admin\/products\/632910392\">IPod Nano - 8GB<\/a>.",
"author": "Shopify",
"description": "created a new product: IPod Nano - 8GB.",
"path": "\/admin\/products\/632910392"
}
]
}

Inside the response, there will be an array of objects with each one representing one Order or Product.

Events are generated for the following resources:

  • Articles
  • Blogs
  • Custom Collections
  • Comments
  • Orders
  • Pages
  • Products
  • Smart Collections

By using the “limit” and “page” parameters it is possible to page your results in the case that you need to pull out a large number of events. Additionally, there are a number of ways that you can filter the results for example based on a date, so in a continuous data extraction process, it is possible to pull only new data and avoid duplicates.

This can be achieved by using the “created_at_min” and “created_at_max” parameters in your GET request.

After you have successfully pulled data out of the Shopify API you are ready to extract and prepare them for BigQuery. Of course, the above process is only for one of the available resources. If you would like to have a complete view of all the available data, then you will have to create a much complex ETL process, including the majority of the 35+ resources that Shopify has.

How can I prepare my data to be sent from Shopify to Google BigQuery?

Before you load data into BigQuery, you should make sure that it is presented in a format supported by it, so for example, if the API you pull out data returns XML, you have first to transform it into a serialization that BigQuery understands. Currently, two data formats are supported:

  • CSV, and
  • JSON

You also need to make sure that the data types you are using are the ones supported by BigQuery, which are the following:

  • STRING
  • INTEGER
  • FLOAT
  • BOOLEAN
  • RECORD
  • TIMESTAMP

For more information, please check the Preparing Data for BigQuery page on the documentation.

Data load from Shopify to Google BigQuery

If you want to load any data from Shopify to BigQuery, you have to use one of the following supported data sources.

  • Google Cloud Storage
  • Sent data directly to BigQuery with a POST request
  • Google Cloud Datastore Backup
  • Streaming insert
  • App Engine log files
  • Cloud Storage logs

From the above list of sources, 5 and 6 are not applicable in our case.

For Google Cloud Storage, you first must load the preferred data into it. There are a few options on how to do this; for example, you can use the console directly as it is described here and does not forget to follow the best practices.

Another option is to post them through the JSON API, as we see again APIs play an important role in both the extraction and the loading of data into our data warehouse. In its simplest case, it’s just a matter of one HTTP POST request using a tool like CURL or Postman. It should look like the following example.

JAVASCRIPT
POST /upload/storage/v1/b/myBucket/o?uploadType=media&name=myObject
HTTP/1.1 Host: www.googleapis.com
Content-Type: application/text
Content-Length: number_of_bytes_in_file
Authorization: Bearer your_auth_token yourShopify data

and if everything went ok, you should get something like the following as a response from the server:

JAVASCRIPT
HTTP/1.1 200 Content-Type: application/json { "name": "myObject" }

Working with Curl or Postman is good only for testing, if you would like to automate the process of loading your own data into Bigquery, you should write some code to send them to Cloud in Google Storage.

In case you are developing on the Google App Engine you can use the library that is available for the languages that are supported by it:

If you are using one of the above languages and you are not coding for the Google App Engine, you can use it to access the Storage in the Cloud from your environment. Interacting such a feature-rich product like Google Cloud Storage can become quite complicated depending on your use case, for more details on the different options that exist you can check Google Storage documentation.

If you are looking for a less engaged and more neutral way of using Cloud Storage, you can consider a solution like RudderStack.

After you have loaded any data into Google Cloud Storage, you have to create a Load Job for BigQuery to actually load the data into it, this Job should point to the source data in Cloud Storage that have to be imported. This happens by providing source URIs that point to the appropriate objects.

The previous method described used a POST request to the Google Cloud Storage API for storing the data there and then load it into BigQuery. Another way is to do a direct HTTP POST request to BigQuery with the data you would like to query. This approach is similar to how we loaded the data to Storage in the Cloud through the JSON API, but it uses the appropriate end-points of BigQuery for loading the data there directly.

The way to interact with it is quite similar, for more information can be found on the BigQuery API Reference and on the page that describes how can you load data you own into BigQuery using POST. You can interact with it using the HTTP client library of the language or framework of your choice, a few options are:

The best way to load data from Shopify to Google BigQuery

So far, we just scraped the surface of what you can do with BigQuery and how can you load every data into it. Things can get even more complicated if you want to integrate data coming from different sources.

Are you striving to achieve results right now?

Instead of writing, hosting, and maintaining a flexible data infrastructure use RudderStack that can handle everything automatically for you.

RudderStack, with one click, integrates with sources or services, creates analytics-ready data, and syncs your Shopify to BigQuery right away. Don't want to go through the pain of direct integration? RudderStack’s Shopify to Google BigQuery integration makes it easy to send data from Shopify to Google BigQuery.

Sign Up For Free And Start Sending Data
Test out our event stream, ELT, and reverse-ETL pipelines. Use our HTTP source to send data in less than 5 minutes, or install one of our 12 SDKs in your website or app.
Don't want to go through the pain of direct integration? RudderStack's Shopify integration makes it easy to send data from Shopify to Google BigQuery.