How to Load data from Salesforce to MS SQL Server
Pull data from Salesforce
You can’t use a Data Warehouse without data, so the first and most important step is to extract the data you want from Salesforce.
Salesforce has many products and it’s also a pioneer in cloud computing and the API economy. This means that it offers a plethora of APIs to access the services and the underlying data. In this post, we’ll focus only on Salesforce CRM, which again exposes a large number of APIs.
More specifically, when choosing which API to use, we have the following options.
- REST API
- SOAP API
- Chatter REST API
- Bulk API
- Metadata API
- Streaming API
- Apex REST API
- Apex SOAP API
- Tooling API
Pull Data using Salesforce REST API
From the above list, the complexity and feature richness of Salesforce API is more than evident. Both APIs are exposing the same functionalities but using different protocols. Interacting with the REST API can be done by using tools like CURL or Postman or by using HTTP clients for your favorite language or framework. A few suggestions:
- Apache HttpClient for Java
- Spray-client for Scala
- Hyper for Rust
- Ruby rest-client
- Python http-client
The Salesforce REST API supports oAuth 2.0 authentication, more information can be found in Salesforce’s Understanding Authentication article. After you successfully authenticate with this API, you have to start interacting with its resources and start fetching its data in order to load them on a data warehouse.
It’s easy to get a list of all the resources we have access to, for example using curl we can execute the following:
SH
curl https://na1.salesforce.com/services/data/v26.0/ -H "Authorization: Bearer token"
A typical response from the server will be a list of available resources in JSON or XML, depending on what you have asked as part of your request.
JSON
{"sobjects" : "/services/data/v26.0/sobjects","licensing" : "/services/data/v26.0/licensing","connect" : "/services/data/v26.0/connect","search" : "/services/data/v26.0/search","query" : "/services/data/v26.0/query","tooling" : "/services/data/v26.0/tooling","chatter" : "/services/data/v26.0/chatter","recent" : "/services/data/v26.0/recent"}
Salesforce REST API is very expressive, it also supports a language called Salesforce Object Query Language (SOQL) for executing arbitrarily complex queries. For example, the following curl command will return the name fields of accounts:
SH
curl https://na1.salesforce.com/services/data/v20.0/query/?q=SELECT+name+from+Account -H "Authorization: Bearer token"
and the result will look like the following:
JSON
{"done" : true,"totalSize" : 14,"records" :[{"attributes" :{
Again, the result can be server objects either in JSON or XML serialization. We would recommend using JSON as it will make the whole data connecting process easier because the most popular data warehousing solutions natively support it.
With XML you might have to transform it first into JSON before loading any data to the repository. More information about SOQL can be found on a Salesforce Object Query Language specification page.
If for any reason you would prefer to use SOAP, then you should create a SOAP client first: for example, you can use the force.com Web Service Connector (WSC), the client. Or create your own using the WSDL using the information provided by this guide.
Despite the protocol changes, the architecture of the API remains the same, so again you will be able to access the same resources.
After you have your client ready and you are able to connect to Salesforce you need to perform the following steps:
- decide which resources to extract from the API
- map these resources to the schema of each data warehouse repository that you will use
- transform data into it and
- load the transformed data on the repository based on the instructions below
As you can see, accessing the API alone is not enough for ensuring the operation of a pipeline that will safely and on time deliver your data on a data warehousing solution for analysis.
Pull Data using Salesforce Streaming API
Another interesting way of interacting with SalesForce is through the Streaming API.
With it, you define queries, and every time something changes to data that register to this query you get notifications. So for example, every time you get a new account created the API will push notification about the event to your desired service. This is an extremely powerful mechanism that can guarantee almost real-time updates on a Data Warehouse repository.
In order to implement something like that though, you need to take into consideration the limitations of both ends, while ensuring the delivery semantics that your use case requires for any data management infrastructure that you will build.
For more information, you can read the documentation of the Streaming API.
Salesforce Data Preparation for Microsoft SQL Server
As in every relational database, SQL Server requires a well-defined database schema before we start populating with data. Data is organized into schemas, which are distinct namespaces where database objects belong to.
The most common database objects are of course tables which have a number of columns with each one having a declared data type. MS SQL Server supports a large number of different data types. This gives us great flexibility in expressing the data that we have and at the same time optimizing our data warehouse.
When working with data coming from web services, where data is usually serialized in JSON, it is important to correctly map any data to the right data types. As changing the data type is a process that may cost in downtime of your database, it is important to spend enough time thinking about the proper data type assignments.
For example, dates in JSON are just strings, but when storing date objects in a database, we can enhance analytics with great capabilities by transforming the raw string data into an appropriate data type. A typical strategy for loading data from SalesForce to SQL Server database is to create a schema where you will map each API endpoint to a table. Each key inside the SalesForce API endpoint response should be mapped to a column of that table and you should ensure the right conversion to a SQL Server compatible data type.
Of course, you will need to ensure that as data types from SalesForce API might change, you will adapt your database tables accordingly since there’s no such thing as automatic data typecasting. After you have a complete and well-defined data model or schema for Microsoft SQL Server, you can move forward and start loading your data into the database.
Load data from SalesForce to MS SQL Server
As a feature-rich and mature product, MS SQL Server offers a large and diverse set of methods for loading data into it. One way of importing data to your database is by using the SQL Server Import and Export Wizard. With it, you will be able to bulk load data using a number of data sources that are supported through a visual interface.
You can import from another SQL Server database, an Oracle database, Flat Files, and Access Data Source, PostgreSQL, MySQL, or Azure Blob Storage. Especially if you are using a managed version of MS SQL Server on Azure, you should definitely consider utilizing the Azure Blob Storage connection: you can load data as Blobs on Azure and your MS SQL Server database will sync with it through the Import and Export Wizard.
Another way for importing bulk data into a SQL Server database, both on Azure and on-premises, is by using the bcp utility. This is a command-line tool that is built specifically for bulk loading and unloading of data using an MS SQL database.
Finally and for compatibility reasons, especially if you are managing databases from different vendors, you can use BULK INSERT SQL statements.
In a similar way as with other databases, you can also use the standard INSERT statements, where you will be adding data row-by-row directly to a table. It is the most basic and straightforward way of adding data in a table, but it doesn’t scale very well with larger datasets.
Updating your SalesForce data on MS SQL Server
As you will be generating more data on SalesForce, you will need to update your older data on an MS SQL Server database. This includes new records together with updates to older records that for any reason have been updated on SalesForce.
You will need to periodically check SalesForce for new data and repeat the process that has been described previously while updating your currently available data if needed. Updating an already existing row on a SQL Server table is achieved by creating UPDATE statements.
Another issue that you need to take care of is the identification and removal of any duplicate records on your database, which can come from Salesforce’s lack of a mechanism to identify new and updated records, or from errors on your data pipelines.
In general, ensuring the quality of data that is inserted in your database is a big and difficult issue; MS SQL Server features like TRANSACTIONS can help, although they do not solve the problem in the general case.
The best way to load data from Salesforce to MS SQL Server
So far we just scraped the surface of what you can do with MS SQL Server and how to load data into it. Things can get even more complicated if you want to integrate data coming from different sources.
Would you prefer to achieve results right now?
Instead of writing, hosting, and maintaining a flexible data infrastructure use RudderStack that can handle everything automatically for you.
RudderStack with one click integrates with sources or services, creates analytics-ready data, and gives you a SQL Server integration with Salesforce immediately. Don't want to go through the pain of direct integration? RudderStack’s Salesforce to MS SQL Server integration makes it easy to send data from Salesforce to MS SQL Server.