In our previous blog post, we described our experience setting up an Open-Source Analytics Stack using RudderStack, Apache SuperSet, and Redshift. When it comes to choosing a data warehouse for the open-source analytics stack, it is very common to pit data warehouses against each other. Redshift vs BigQuery, Redshift vs Snowflake, etc. are some common examples.

In this post, we will go into the reasoning behind choosing AWS Redshift over other popular data warehousing solutions such as Google BigQuery and Snowflake.

Background

AWS Redshift and Google BigQuery are market leaders for cloud-based data warehouse solutions available today. As they both utilize columnar storage, they offer improved performance over traditional RDBMS. These platforms provide mature, intuitive management consoles and a rich set of programmatic interfaces. On the other hand, Snowflake has emerged as a modern data warehouse solution with features such as ease of access, instant scalability, as well as powerful analytics and security.

For this comparison, we used a data set that consists of approximately 20 million event records, triggered directly or indirectly by the users of a mobile casino game. The mobile application contains multiple games – both free and paid – that mimic the casino slot machines.

Data for comparison

We used the data from the following three tables:

  • revenue – details for purchase events
  • spin_result – details for events triggered by automatic or manual player spin action
  • tracks – parent table containing common details of all events
Click here for the revenue table schema
Column NameData Type
abtest_tagsSTRING
abtest_valuesSTRING
anonymous_idSTRING
categorySTRING
clan_idSTRING
clan_nameSTRING
clan_revenue_share_eligibleBOOL
coin_balanceFLOAT64
context_app_buildSTRING
context_app_nameSTRING
context_app_namespaceSTRING
context_app_versionSTRING
context_device_idSTRING
context_device_manufacturerSTRING
context_device_modelSTRING
context_device_nameSTRING
context_ipSTRING
context_library_nameSTRING
context_library_versionSTRING
context_localeSTRING
context_network_bluetoothBOOL
context_network_carrierSTRING
context_network_cellularBOOL
context_network_wifiBOOL
context_os_nameSTRING
context_os_versionSTRING
context_screen_densityFLOAT64
context_screen_heightFLOAT64
context_screen_widthFLOAT64
context_timezoneSTRING
context_traits_anonymousidSTRING
context_useragentSTRING
current_module_nameSTRING
current_vip_pointsFLOAT64
eventSTRING
event_textSTRING
fb_profileSTRING
game_nameSTRING
gem_balanceFLOAT64
graphicsqualitySTRING
idSTRING
idfaSTRING
internetreachabilitySTRING
islowenddeviceSTRING
levelFLOAT64
lifetime_gem_balanceFLOAT64
player_total_battlesFLOAT64
player_total_shieldsFLOAT64
priceFLOAT64
productidSTRING
quantityFLOAT64
received_atDATETIME
revenuetypeSTRING
sent_atDATETIME
start_dateDATETIME
timestampDATETIME
total_paymentsFLOAT64
user_idSTRING
uuid_tsDATETIME
versionsessioncountFLOAT64
vip_days_leftFLOAT64
vip_levelFLOAT64

Click here for the spin_result table schema
Column NameData Type
abtest_tagsSTRING
abtest_valuesSTRING
additional_bet_indexFLOAT64
anonymous_idSTRING
battle_idSTRING
bet_amountFLOAT64
bet_levelFLOAT64
bet_multiplierFLOAT64
bingo_dropped_numbersSTRING
categorySTRING
coin_balanceFLOAT64
context_app_buildSTRING
context_app_nameSTRING
context_app_namespaceSTRING
context_app_versionSTRING
context_device_idSTRING
context_device_manufacturerSTRING
context_device_modelSTRING
context_device_nameSTRING
context_ipSTRING
context_library_nameSTRING
context_library_versionSTRING
context_localeSTRING
context_network_bluetoothBOOL
context_network_carrierSTRING
context_network_cellularBOOL
context_network_wifiBOOL
context_os_nameSTRING
context_os_versionSTRING
context_screen_densityFLOAT64
context_screen_heightFLOAT64
context_screen_widthFLOAT64
context_timezoneSTRING
context_traits_anonymousidSTRING
context_useragentSTRING
current_module_nameSTRING
current_vip_pointsFLOAT64
days_in_gameFLOAT64
eventSTRING
event_textSTRING
extra_paramSTRING
fb_profileSTRING
featuregametypeSTRING
game_idSTRING
game_nameSTRING
gem_balanceFLOAT64
graphicsqualitySTRING
idSTRING
idfaSTRING
internetreachabilitySTRING
is_auto_spinSTRING
is_hyper_bonusSTRING
is_turboSTRING
isfSTRING
ishighrollerSTRING
islowenddeviceBOOL
jackpot_win_amountFLOAT64
jackpot_win_typeSTRING
levelFLOAT64
lifetime_gem_balanceFLOAT64
no_of_spinFLOAT64
player_total_battlesFLOAT64
player_total_shieldsFLOAT64
received_atDATETIME
sent_atDATETIME
start_dateDATETIME
timestampDATETIME
total_paymentsFLOAT64
total_win_amountFLOAT64
tournament_idSTRING
user_idSTRING
uuid_tsDATETIME
versionsessioncountFLOAT64
vip_days_leftFLOAT64
vip_levelFLOAT64
win_amountFLOAT64

Click here for the tracks table schema
Column NameData Type
anonymous_idSTRING
context_app_buildSTRING
context_app_nameSTRING
context_app_namespaceSTRING
context_app_versionSTRING
context_device_idSTRING
context_device_manufacturerSTRING
context_device_modelSTRING
context_device_nameSTRING
context_ipSTRING
context_library_nameSTRING
context_library_versionSTRING
context_localeSTRING
context_network_bluetoothBOOL
context_network_carrierSTRING
context_network_cellularBOOL
context_network_wifiBOOL
context_os_nameSTRING
context_os_versionSTRING
context_screen_densityFLOAT64
context_screen_heightFLOAT64
context_screen_widthFLOAT64
context_timezoneSTRING
context_traits_anonymousidSTRING
context_useragentSTRING
eventSTRING
event_textSTRING
idSTRING
received_atDATETIME
sent_atDATETIME
timestampDATETIME
user_idSTRING
uuid_tsDATETIME

Evaluation Methodology

We compared Redshift vs BigQuery vs Snowflake on two major aspects – performance and cost. For performance evaluation, we ran a set of 11 queries that were used extensively by the product managers of the gaming application on each of the data warehouses. As the pricing model varies across the three data warehouses, we also made an attempt to create a common cost evaluation standard based on usage.

Some other considerations that guided this performance comparison exercise were:

  • Data loading time was not considered as part of this exercise.
  • Performance comparisons were done using two Redshift cluster configurations (one with single dc2.large instance and another with two dc2.large instances). Higher configurations were deemed to be too costly for the given setup as discussed later.
  • All the queries were executed from a Python program. The module used for accessing Redshift was psycopg2 while we used the google.cloud module for BigQuery. For Snowflake, SnowSQL i.e. the Snowflake Command Line Interface was used
  • Multiple readings were taken to balance one-off factors such as network disruptions.
  • No platform optimizations, vendor-provided or otherwise, were implemented during the exercise.

Cost Comparison: Redshift vs BigQuery vs Snowflake

When comparing the cost of data storage for running the queries, the following are the key takeaways for each data warehouse:

AWS Redshift

  • For dc2.large with 160GB/node storage, the cost is $0.25/node/hour, that is $180/month + no cost for processing queries.
  • As we use two dc2.large nodes for each cluster, the total cost comes up to $360/month.

Google BigQuery

  • BigQuery charges $20/TB/month for storage and $5/TB for queries.
  • The total cost of running the 11 queries (only once) came up to roughly around $0.56.
  • The size of the dataset being queried is 39.30 GB, that is, 0.0393 TB. Therefore, the storage cost per month came up to around $0.79.

Snowflake

Snowflake offers different kinds of platforms that can be opted for, as per our usage. For this exercise, we used the Business Critical platform for our data storage. It also offers two types of services:

  • On-demand: Snowflake charges $40/TB per month for storage. Although our total cost of storage comes to $40 * 0.0393 TB = $1.57/month, there is a $25 minimum monthly charge for on-demand customers.
  • Pre-purchased: Snowflake offers a pre-purchased plan of $23/TB/month for data storage.

For querying, SnowFlake Credits are used to pay for the consumption of the cloud services that exceed 10% of the daily usage of the compute resources. The total number of credits used for executing the 11 queries is 1.866761388. Given the cost per credit for a business-critical platform is $4, the total cost of querying comes to $4*1.866761388  = $7.5.

Key Takeaways

  • If the queries are run just once in a month, Redshift is approximately 267 times more expensive than BigQuery.
  • When it comes to storing and querying data in BigQuery as compared to Snowflake, the cost of storing and running the queries once on the former ($1.35) was much less than doing so in the latter ($7.5).

Performance Comparison: Redshift vs BigQuery vs Snowflake

The following table shows the performance comparison when it comes to Redshift vs BigQuery vs Snowflake, as per our exercise:

QueryRedshift (1 dc2.large)Redshift (1 dc2.large)Redshift (1 dc2.large)Redshift (1 dc2.large)Redshift (2 dc2.large)Redshift (2 dc2.large)Redshift (2 dc2.large)Redshift (2 dc2.large)SnowflakeSnowflakeSnowflakeSnowflakeBigQuery BigQuery BigQuery BigQuery
1st Query Execution Time
(seconds)
Average of Remaining 4 columns 
(seconds)
Median  of Remaining 4 columns 
(seconds)
Standard Deviation  of Remaining 4 columns 
(seconds)
1st Query Execution Time
(seconds)
Average of Remaining 4 columns 
(seconds)
Median  of Remaining 4 columns 
(seconds)
Standard Deviation  of Remaining 4 columns 
(seconds)
1st Query Execution Time
(seconds)
Average of Remaining 4 columns 
(seconds)
Median  of Remaining 4 columns 
(seconds)
Standard Deviation  of Remaining 4 columns 
(seconds)
1st Query Execution Time
(seconds)
Average of Remaining 4 columns 
(seconds)
Median  of Remaining 4 columns 
(seconds)
Standard Deviation  of Remaining 4 columns 
(seconds)
Q117.660560.28160.25000.6006 4.398740.50310.25090.43904.3521.52660.3871.5912.081940.6006 0.43900.0519
Q220.703540.26370.25340.02953.93964 0.51890.28270.43071.733
0.6088
0.3330.56210.682880.54730.54650.0422
Q317.465670.26830.24820.04153.911620.24650.24620.00151.1870.50880.3490.3390.604820.53090.52290.0481
Q429.12982 0.53410.53840.03654.707080.70020.66410.18301.7840.65560.3990.56490.637460.52030.51730.0282
Q57.108460.24700.24710.00483.585400.24720.24730.00101.847
1.02981.2670.57300.702320.80760.74280.2236
Q62.502000.26749 0.25180.03693.477700.42000.33340.20304.5981.40460.3541.65290.551070.60960.61580.1063
Q72.47661   0.26430.25010.03953.42400 0.29280.28600.20300.3650.3472
0.365
0.02870.638420.52210.61050.1585
Q82.521710.2560.24160.02853.592050.26060.25010.02040.8531.06721.2640.4230.589930.56530.54600.0448
Q911.169860.26120.25560.01244.634950.25470.24970.01169.7272.77921.3463.49520.645360.55520.52970.0473
Q1028.492860.27410.27500.034010.897860.55940.25180.53527.3812.14941.2402.6460.548030.54450.54920.0204
Q118.578870.89890.87910.036814.847570.66930.54520.22852.9920.890.3581.0510.615910.66730.63700.1444

Key Takeaways

  • In terms of query execution time, the first readings of Redshift with one node are much higher than both Snowflake and Redshift with 2 nodes.
  • Redshift seems to cache data while BigQuery does not, by default

Conclusion

Based on the performance and cost evaluation of the three warehouses, we see that using Redshift as our preferred data warehouse solution makes sense. Considering there are large numbers of real-time or near real-time dashboards that are being accessed by a high number of users, Redshift gives us the best RoI on the performance and cost aspects.

For a detailed appendix that captures the actual times recorded during the execution of queries, please click here.

Editor’s note: This article was originally published in December 2019. It has been completely rewritten for accuracy.