How to load data from Magento to SQL Data Warehouse

Extract your data from Magento

Magento exposes its platform through both a REST and a SOAP interface. Both can be used to pull data from it, which is also the scope of this article, but also to interact with the platform. By using these interfaces, developers create rich applications and plugins for Magento. In this post we will use the REST version of the Magento platform.

As a Web API following the RESTful architecture principles, it can be accessed through HTTP. As a RESTful API, interacting with it can be achieved by using tools like CURL or Postman or by using http clients for your favourite language or framework. A few suggestions:

Magento does not publish official SDKs but by using the SOAP interface it is possible to automatically generate clients that can act as SDKs for your favourite language or platform. For example in Java, you can create a client in Eclipse by providing the WSDL file that Magento exposes after you set up the platform.

Magento API Authentication

Magento is a self hosted platform, unless you are using the Enterprise cloud edition, so you have much more control over its access than other solutions, but if you want to access its data through the REST API that it has, then you will have to use oAuth for authentication which is supported by Magento.

Magento rate limiting

As a platform hosted on your own premises it doesn’t really imposes any rate limiting. In any case as you would like to avoid stressing your e-commerce platform that is facing your customers you should make sure that your pipeline process does not over stress your Magento installation. But this is completely at your discretion.

Endpoints and available resources

Magento exposes the following resources:

  • Products. Retrieve the list of products, create, update, delete a product.
  • Product categories. Retrieve the list of categories assigned to a product, assign and unassign the category from a product.
  • Product images. Retrieve the list of websites assigned to a product, assign, unassign a website to/from a product.
  • Customers. Retrieve the list of customers, create, delete a customer, and update the customer information.
  • Customer Addresses. Retrieve the list of customer addresses, create, update, and delete the customer address.
  • Inventory. Retrieve the list of stock items, update required stock items.
  • Sales Orders. Retrieve the list of sales orders with detailed information on order addresses, items, and comments.
  • Order Addresses. Retrieve information on the specified order comments.
  • Order Items. Retrieve information on specified order items.

The API is possible to return either JSON or XML responses, this is something that you can control by providing the appropriate Accept header content type.

For all the above resources we can request from the Magento platform to pull out a list of results with all the associated data, so ideally we would like to pull all the data and make sure that we keep them up to date on our analytics platform of choice for further analysis. For this post we will just see how we can pull data for one resource, the Sales Orders, the process is the same for all other resources.

To pull data for the Sales Orders, we need to execute a get request to the following endpoint:

TEXT
http://magentohost/api/rest/orders

As a platform hosted by you, you need to replace the “magentohost” part of the URL with the actual URL of the host that has Magento running. The rest of the URL is the same as the above. The default response is in XML and looks like the following:

MARKDOWN
<?xml version="1.0"?>
<magento_api>
<data_item_1>
<customer_id>3</customer_id>
<base_discount_amount>0.0000</base_discount_amount>
<base_shipping_amount>455.0000</base_shipping_amount>
<base_shipping_tax_amount>0.0000</base_shipping_tax_amount>
<base_subtotal>13650.0000</base_subtotal>
<base_tax_amount>0.0000</base_tax_amount>
<base_total_paid></base_total_paid>
<base_total_refunded></base_total_refunded>
<tax_amount>0.0000</tax_amount>
<total_paid></total_paid>
<total_refunded></total_refunded>
<base_shipping_discount_amount>0.0000</base_shipping_discount_amount>
<base_subtotal_incl_tax>13650.0000</base_subtotal_incl_tax>
<base_total_due>14105.0000</base_total_due>
<total_due>14105.0000</total_due>
<base_currency_code>USD</base_currency_code>
<tax_name></tax_name>
<tax_rate></tax_rate>
<addresses>
<data_item>
<region>Palau</region>
<postcode>19103</postcode>
<lastname>Doe</lastname>
<street>2356 Jody Road Philadelphia, PA 19103</street>
<city>PA</city>
<telephone>610-634-1181</telephone>
<country_id>US</country_id>
<firstname>John</firstname>
<address_type>billing</address_type>
<prefix></prefix>
<middlename></middlename>
<suffix></suffix>
<company></company>
</data_item>
<data_item>
<region>Palau</region>
<postcode>19103</postcode>
<lastname>Doe</lastname>
<street>2356 Jody Road Philadelphia, PA 19103</street>
<city>PA</city>
<telephone>610-634-1181</telephone>
<country_id>US</country_id>
<firstname>John</firstname>
<address_type>shipping</address_type>
<prefix></prefix>
<middlename></middlename>
<suffix></suffix>
<company></company>
</data_item>
</addresses>
<order_items>
<data_item>
<sku>Sunglasses_1</sku>
<price>150.0000</price>
<base_price>150.0000</base_price>
<base_original_price>150.0000</base_original_price>
<tax_percent>0.0000</tax_percent>
<tax_amount>0.0000</tax_amount>
<base_tax_amount>0.0000</base_tax_amount>
<base_discount_amount>0.0000</base_discount_amount>
<base_row_total>13650.0000</base_row_total>
<base_price_incl_tax>150.0000</base_price_incl_tax>
<base_row_total_incl_tax>13650.0000</base_row_total_incl_tax>
</data_item>
</order_items>
</data_item_1>
<data_item_2>
<customer_id>3</customer_id>
<base_discount_amount>0.0000</base_discount_amount>
<base_shipping_amount>95.0000</base_shipping_amount>
<base_shipping_tax_amount>0.0000</base_shipping_tax_amount>
<base_subtotal>3350.0000</base_subtotal>
<base_tax_amount>0.0000</base_tax_amount>
<base_total_paid>2445.0000</base_total_paid>
<base_total_refunded>1845.0000</base_total_refunded>
<tax_amount>0.0000</tax_amount>
<total_paid>2445.0000</total_paid>
<total_refunded>1845.0000</total_refunded>
<base_shipping_discount_amount>0.0000</base_shipping_discount_amount>
<base_subtotal_incl_tax>3350.0000</base_subtotal_incl_tax>
<base_total_due>1000.0000</base_total_due>
<total_due>1000.0000</total_due>
<base_currency_code>USD</base_currency_code>
<tax_name></tax_name>
<tax_rate></tax_rate>
<addresses>
<data_item>
<region>Palau</region>
<postcode>19103</postcode>
<lastname>Doe</lastname>
<street>2356 Jody Road Philadelphia, PA 19103</street>
<city>PA</city>
<telephone>610-634-1181</telephone>
<country_id>US</country_id>
<firstname>John</firstname>
<address_type>billing</address_type>
<prefix></prefix>
<middlename></middlename>
<suffix></suffix>
<company></company>
</data_item>
<data_item>
<region>Palau</region>
<postcode>19103</postcode>
<lastname>Doe</lastname>
<street>2356 Jody Road Philadelphia, PA 19103</street>
<city>PA</city>
<telephone>610-634-1181</telephone>
<country_id>US</country_id>
<firstname>John</firstname>
<address_type>shipping</address_type>
<prefix></prefix>
<middlename></middlename>
<suffix></suffix>
<company></company>
</data_item>
</addresses>
<order_items>
<data_item>
<sku>Sunglasses_1</sku>
<price>150.0000</price>
<base_price>150.0000</base_price>
<base_original_price>150.0000</base_original_price>
<tax_percent>0.0000</tax_percent>
<tax_amount>0.0000</tax_amount>
<base_tax_amount>0.0000</base_tax_amount>
<base_discount_amount>0.0000</base_discount_amount>
<base_row_total>1350.0000</base_row_total>
<base_price_incl_tax>150.0000</base_price_incl_tax>
<base_row_total_incl_tax>1350.0000</base_row_total_incl_tax>
</data_item>
<data_item>
<sku>Sun_glasses</sku>
<price>200.0000</price>
<base_price>200.0000</base_price>
<base_original_price>200.0000</base_original_price>
<tax_percent>0.0000</tax_percent>
<tax_amount>0.0000</tax_amount>
<base_tax_amount>0.0000</base_tax_amount>
<base_discount_amount>0.0000</base_discount_amount>
<base_row_total>2000.0000</base_row_total>
<base_price_incl_tax>200.0000</base_price_incl_tax>
<base_row_total_incl_tax>2000.0000</base_row_total_incl_tax>
</data_item>
</order_items>
</data_item_2>
</magento_api>

As we can see, we get back a list of items with each one representing an order, that contains all the information that we would like to use for further analysis. information like the discount that we might have applied, the taxes paid, the base price of the order, etc. As we might have many order objects to retrieve, we should paginate through the results. To do that, we need to provide the “page” and “limit” parameters to our GET request.
Now that we have the results from our Magento shop, we can further process them before we are able to load them into the BI platform of our choice.

Load Data from Magento to SQL Data Warehouse

SQL Data Warehouse support numerous options for loading data, such as:

  1. PolyBase
  2. Azure Data Factory
  3. BCP command-line utility
  4. SQL Server integration services

As we are interested in loading data from online services by using their exposed HTTP APIs, we are not going to consider the usage of BCP command-line utility or SQL server integration in this guide. We’ll consider the case of loading our data as Azure storage Blobs and then use PolyBase to load the data into SQL Data Warehouse.

Accessing these services happens through HTTP APIs, as we see again APIs play an important role in both the extraction but also the loading of data into our data warehouse. You can access these APIs by using a tool like CURL or Postman. Or use the libraries provided by Microsoft for your favorite language. Before you actually upload any data you have to create a container which is something similar as a concept to the Amazon AWS Bucket, creating a container is a straightforward operation and you can do it by following the instructions found on the Blog storage documentation from Microsoft. As an example, the following code can create a container in Node.js.

JAVASCRIPT
blobSvc.createContainerIfNotExists('mycontainer', function(error, result, response){
if(!error){
// Container exists and allows
// anonymous read access to blob
// content and metadata within this container
}
});

After the creation of the container you can start uploading data to it by using again the given SDK of your choice in a similar fashion:

JAVASCRIPT
blobSvc.createBlockBlobFromLocalFile('mycontainer', 'myblob', 'test.txt', function(error, result, response){
if(!error){
// file uploaded
}
});

When you are done putting your data into Azure Blobs you are ready to load it into SQL Data Warehouse using PolyBase. To do that you should follow the directions in the Load with PolyBase documentation. To summarize, the required steps to do it, are the following:

  • create a database master key
  • create a database scoped credentials
  • create an external file format
  • create an external data source

PolyBase’s ability to transparently parallelize loads from Azure Blob Storage will make it the fastest tool for loading data. After configuring PolyBase, you can load data directly into your SQL Data Warehouse by simply creating an external table that points to your data in storage and then mapping that data to a new table within SQL Data Warehouse.

Of course, you will need to establish a recurrent process that will extract any newly created data from your service, load them in the form of Azure Blobs and initiate the PolyBase process for importing the data again into SQL Data Warehouse. One way of doing this is by using the Azure Data Factory service. In case you would like to follow this path you can read some good documentation on how to move data to and from Azure SQL Warehouse using Azure Data Factory.

The best way to load data from Magento to SQL Data Warehouse and possible alternatives

So far we just scraped the surface of what can be done with Microsoft Azure SQL Data Warehouse and how to load data into it. The way to proceed relies heavily on the data you want to load, from which service they are coming from, and the requirements of your use case. Things can get even more complicated if you want to integrate data coming from different sources. A possible alternative, instead of writing, hosting and maintaining a flexible data infrastructure, is to use a product like RudderStack that can handle this kind of problems automatically for you.

RudderStack integrates with multiple sources or services like databases, CRM, email campaigns, analytics and more. Don't want to go through the pain of direct integration? RudderStack’s Magento to SQL Data Warehouse integration makes it easy to send data from Magento to SQL Data Warehouse.

Sign Up For Free And Start Sending Data
Test out our event stream, ELT, and reverse-ETL pipelines. Use our HTTP source to send data in less than 5 minutes, or install one of our 12 SDKs in your website or app.