ClickHouse Destination

Sync data from RudderStack to ClickHouse.

ClickHouse is an open-source, column-oriented database management system mainly used for online analytical processing (OLAP). It is fast, and allows for real-time analysis of your data.

info
Refer to the Warehouse Schema guide for more information on how the events are mapped to the tables in ClickHouse.
success
Find the open source code for this destination in the GitHub repository.

Setting user permissions in ClickHouse

Make sure your ClickHouse user has read access to system.columns table for fetching the table schemas.

You also to need to set date_time_input_format to best_effort for ClickHouse to parse all ISO 8601 date and time formats:

set date_time_input_format = 'best_effort';
info
RudderStack uses the UInt8 datatype to set Boolean values and map UInt8to Boolean internally. So when a schema is fetched from ClickHouse, RudderStack treats UInt8 as Boolean.
warning
If you are creating tables in the same database where RudderStack loads, it is highly recommend to not use UInt8as a datatype except for Boolean values (0,1).

Configuring ClickHouse in RudderStack

To send event data to ClickHouse, you first need to add it as a destination in RudderStack and connect it to your data source. Once the destination is enabled, events will automatically start flowing to ClickHouse via RudderStack.

To configure ClickHouse as a destination in RudderStack, follow these steps:

  1. In your RudderStack dashboard, set up the data source. Then, select ClickHouse from the list of destinations.
  2. Assign a name to your destination and then click Next.

Connection Settings

  • Host: The host name of your ClickHouse database.
  • Port: The TCP port of your ClickHouse host. If you want the connection to be secure, use the secure TCP port 9440. Refer to the ClickHouse guide for more information.
  • Database: The database name in your ClickHouse instance where the data gets loaded.
  • Cluster: The name of your ClickHouse cluster. If you are running a single host ClickHouse cluster, leave this field blank.
  • User: The name of the user with the required read/write access to the above database.
  • Password: The password for the above user.
  • Secure: Enable this setting to establish a secure connection.
  • Sync Frequency: Specify how often RudderStack should sync the data to your ClickHouse database.
  • Sync Starting At: This optional setting lets you specify the particular time of the day (in UTC) when you want RudderStack to sync the data to the warehouse.
  • Exclude Window: This optional setting lets you set a time window when RudderStack will not sync the data to your database.

Object storage settings

RudderStack lets you configure the following object storage configuration settings while setting up your ClickHouse destination:

  • Use RudderStack-managed object storage: Enable this setting to use RudderStack-managed buckets for object storage.
warning
This option is applicable only for RudderStack-hosted data planes. For self-hosted data planes, you will have to specify your own object storage configuration settings.

See How RudderStack stores data in an object storage platform for more information.

Advanced settings

  • Warehouse Append: This setting is turned on by default - RudderStack appends your incoming Event Stream data to the existing data in your warehouse. Turning it off causes RudderStack to merge your incoming data into your warehouse to ensure 100% non-duplicate data.
info

The append operation helps to achieve faster data syncs while reducing warehouse costs. However, note that it may increase the number of duplicates in the warehouse, especially if the existing data is older than two weeks. A common scenario where duplication might occur is when the SDKs retry sending events in case of failures.

A merge strategy ensures deduplication but can lead to longer sync times and increased warehouse costs.

  • Skip User Table: Toggle on this setting to send events exclusively to the identifies table and skip the users table, eliminating the need for a merge operation on the users table.
  • Skip Tracks Table: Toggle on this setting to skip sending events to the tracks table.

IPs to be allowlisted

To enable network access to RudderStack, allowlist the following RudderStack IPs depending on your region and RudderStack Cloud plan:

Plan
Region
US
EU
Free, Starter, and Growth
  • 3.216.35.97
  • 18.214.35.254
  • 23.20.96.9
  • 34.198.90.241
  • 34.211.241.254
  • 52.38.160.231
  • 54.147.40.62
  • 3.123.104.182
  • 3.125.132.33
  • 18.198.90.215
  • 18.196.167.201
Enterprise
  • 3.216.35.97
  • 34.198.90.241
  • 44.236.60.231
  • 54.147.40.62
  • 100.20.239.77
  • 3.66.99.198
  • 3.64.201.167
  • 3.123.104.182
  • 3.125.132.33
info
All the outbound traffic is routed through these RudderStack IPs.

FAQ

How does RudderStack de-duplicate the events that are loaded into the warehouse?

RudderStack creates tables with the engine ReplacingMergeTree order by (received_at, id) and column dataType as Nullable(dataType).ReplacingMergeTree replaces the latest event which has the same received_at, id while merging. Note that Nullable is not applicable for sortKeys.

How does RudderStack merge the user properties in the user’s table?

For the user’s table, RudderStack creates a table with an engine AggregatingMergeTree ordered by id and a column dataType as SimpleAggregateFunction(anyLast, Nullable(dataType)). Merging the columns with the sameidpicks the last value which is not null. Note that Nullable is not applicable for sortKeys.

How does RudderStack handle cases when loading data into ClickHouse?

RudderStack converts the event keys into the lower case before exporting the data into ClickHouse. This is so that it does not create two tables in case the event name has two different cases.

info
For a more comprehensive FAQ list, refer to the Warehouse FAQ guide.

Questions? Contact us by email or on Slack