How to load data from Salesforce to PostgreSQL

Extract your data from Salesforce

You can’t use a Data Warehouse without data, so the first and most important step is to extract the data you want from Salesforce. Salesforce has many products, and it’s also a pioneer in cloud computing and the API economy. This means that it offers a plethora of APIs to access the services and the underlying data. In this post, we’ll focus only on Salesforce CRM, which again exposes many APIs. More specifically, and as it can be found in this excellent post from their Helpdesk about which API to use, we have the following options.

  • REST API
  • SOAP API
  • Chatter REST API
  • Bulk API
  • Metadata API
  • Streaming API
  • Apex REST API
  • Apex SOAP API
  • Tooling API

Pull data from the Salesforce REST API

From the above list, the complexity and feature richness of the Salesforce API is more than evident. The REST API and the SOAP API are exposing the same functionalities but using different protocols. Interacting with the REST API can be done using tools like CURL or Postman or by using HTTP clients for your favorite language or framework. A few suggestions:

  • Apache HttpClient for Java
  • Spray-client for Scala
  • Hyper for Rust
  • Ruby rest-client
  • Python http-client

The Salesforce REST API supports OAuth 2.0 authentication. More information can be found in the Understanding Authentication article. After you successfully authenticate with the REST API, you must start interacting with its resources and start fetching data from it to load them on a data warehouse. It’s easy to get a list of all the resources we have access to. For example, using curl we can execute the following:

JAVASCRIPT
curl https://na1.salesforce.com/services/data/v26.0/ -H "Authorization: Bearer token"

A typical response from the server will be a list of available resources in JSON or XML, depending on what you have asked as part of your request.

JAVASCRIPT
{
"sobjects" : "/services/data/v26.0/sobjects",
"licensing" : "/services/data/v26.0/licensing",
"connect" : "/services/data/v26.0/connect",
"search" : "/services/data/v26.0/search",
"query" : "/services/data/v26.0/query",
"tooling" : "/services/data/v26.0/tooling",
"chatter" : "/services/data/v26.0/chatter",
"recent" : "/services/data/v26.0/recent"
}

The Salesforce REST API is very expressive, and it also supports a language called Salesforce Object Query Language (SOQL) for executing arbitrarily complex queries. For example, the following curl command will return the name fields of accounts:

JAVASCRIPT
curl https://na1.salesforce.com/services/data/v20.0/query/?q=SELECT+name+from+Account -H "Authorization: Bearer token"

and the result will look like the following:

JAVASCRIPT
{
"done" : true,
"totalSize" : 14,
"records" :
\[
{
"attributes" :
{

Again, the result can be either in JSON or XML serialization. We would recommend using JSON to make the whole data moving process easier because the most popular data warehousing solutions natively support it. You might have to transform it first into JSON with XML before loading the data to the repository. More information about SOQL can be found on the Salesforce Object Query Language specification page. If you would prefer to use SOAP, then you should create a SOAP client first. You can, for example, use the force.com Web Service Connector (WSC) client or create your own using the WSDL using the information provided by this guide. Although the protocol changes, the API architecture remains the same, so you will be able to access the same resources, etc. After you have your client ready and you can communicate with Salesforce, you need to perform the following steps:

  1. decide which resources to extract from the API
  2. Map these resources to schema of the data warehouse repository that you will use
  3. transform the data into it and
  4. load the transformed data on the repository based on the instructions below

As you can see, accessing the API alone is not enough for ensuring the operation of a pipeline that will safely and on time deliver your data on a data warehousing solution for analysis.

Pull Data using the Salesforce Streaming API

Another interesting way of interacting with Salesforce is through the Streaming API. With it, you define queries, and every time something changes to the data that register to this query you get notifications. So for example, every time you get a new account created the API will push a notification about the event to your desired service. This is an extremely powerful mechanism that can guarantee almost real-time updates on a Data Warehouse repository. To implement something like that, though, you need to consider the limitations of both ends while ensuring the delivery semantics that your use case requires for the data management infrastructure that you will build.

For more information, you can read the documentation of the Streaming API.

Salesforce Data Preparation for PostgreSQL

To populate a PostgreSQL database instance with data, you need to have a well-defined data model or schema that describes the data. As a relational database, PostgreSQL organizes data around tables. Each table is a collection of columns with a predefined data type like an integer or VARCHAR. PostgreSQL, like any other SQL database, supports a wide range of different data types. A typical strategy for loading data from Salesforce to a PostgreSQL database is to create a schema where you will map each API endpoint to a table. Each key inside the Salesforce API endpoint response should be mapped to a column of that table and you should ensure the right conversion to a PostgreSQL compatible data type. For example, if an endpoint from Salesforce returns a value as String, you should convert it into a VARCHAR with a predefined max size or TEXT data type. Tables can then be created on your database using the CREATE SQL statement. Of course, you will need to ensure that as the data types from the Salesforce API might change, you will adapt your database tables accordingly. There’s no such thing as automatic data type casting. After you have a complete and well-defined data model or schema for PostgreSQL, you can move forward and start loading your data into the database.r

Load data from Salesforce to PostgreSQL

Once you have defined your schema and created your tables with the proper data types, you can start loading data into your database. The most straightforward way to insert data into a PostgreSQL database is by creating and executing INSERT statements. With INSERT statements, you will be adding data row-by-row directly to a table. It is the most basic and straightforward way of adding data to a table, but it doesn’t scale very well with larger datasets.

The preferred way for adding larger datasets into a PostgreSQL database is by using the COPY command. COPY is copying data from a file on a file system that is accessible by the PostgreSQL instance. In this way, much larger datasets can be inserted into the database in less time. You should also consult the documentation of PostgreSQL on how to populate a database with data. It includes a number of very useful best practices on optimizing the process of loading data into your PostgreSQL database. COPY requires physical access to a file system to load data. Nowadays, with cloud-based, fully managed databases, getting direct access to a file system is not always possible. If this is the case and you cannot use a COPY statement, then another option is to use PREPARE and INSERT to end up with optimized and more performant INSERT queries.

Updating your Salesforce data on PostgreSQL

As you will be generating more data on Intercom, you will need to update your older data on PostgreSQL. This includes new records and updates to older records that have been updated on Intercom for any reason. You will need to periodically check Salesforce for new data and repeat the process described previously while updating your currently available data if needed. Updating an already existing row on PostgreSQL is achieved by creating UPDATE statements. Another issue that you need to take care of is identifying and removing any duplicate records on your database. Either because Salesforce does not have a mechanism to identify new and updated records or because of errors on your data pipelines, duplicate records might be introduced to your database. In general, ensuring the quality of the data inserted in your database is a big and difficult issue, and PostgreSQL features like TRANSACTIONS can help tremendously. However, they do not solve the problem in the general case.

The best way to load data from Salesforce to PostgreSQL and possible alternatives

So far, we just scraped the surface of what can be done with PostgreSQL and how to load data into it. The way to proceed relies heavily on the data you want to load, from which service they are coming from, and your use case requirements.

Things can get even more complicated if you want to integrate data coming from different sources. A possible alternative, instead of writing, hosting, and maintaining a flexible data infrastructure, is to use an ETL as a service product like Rudderstack that can handle this kind of problem automatically for you.

Rudderstack integrates with multiple sources or services like databases, CRM, email campaigns, analytics, and more.

Sign Up For Free And Start Sending Data
Test out our event stream, ELT, and reverse-ETL pipelines. Use our HTTP source to send data in less than 5 minutes, or install one of our 12 SDKs in your website or app.
Don't want to go through the pain of direct integration? RudderStack's Salesforce integration makes it easy to send data from Salesforce to PostgreSQL.