In our previous blog post, we described our experience setting up an Open-Source Analytics Stack using RudderStack, Apache SuperSet, and Redshift. In this post, we will go into the reasoning behind choosing AWS Redshift over other popular data warehousing solutions such as Google BigQuery. Here’s what we think of AWS Redshift vs Google BigQuery.

Background

AWS Redshift and Google BigQuery are market leaders for cloud-based data warehouse solutions available today. As they both utilize columnar storage, they offer improved performance over traditional RDBMS. These platforms provide mature, intuitive management consoles and a rich set of programmatic interfaces.

Evaluation Methodology

The dataset used for this comparison exercise consists of ~20M event records triggered directly or indirectly by users of a mobile casino game. The mobile application contains multiple games, both free and paid, that mimic various kinds of slot machines available in casinos.

The performance evaluation was done for a set of 11 queries that were used extensively by the product managers of the gaming application. We also did a cost evaluation for the warehouse solutions.

The following are a few considerations that guided this performance comparison exercise:

  • Data loading time was not considered as part of the exercise.
  • Performance comparisons were done using two Redshift cluster configurations (one with single dc2.large instance and another with two dc2.large instances). Higher configurations were deemed to be too costly for the given setup as discussed later.
  • All the queries were executed from a Python program. The module used for accessing Redshift was psycopg2 while we used the google.cloud module for BigQuery.
  • Multiple readings were taken to balance one-off factors such as network disruptions.

Performance Comparison: AWS Redshift Vs Google BigQuery

The following table shows the performance comparison in terms of speed between Redshift and BigQuery, as per our exercise:

Redshift (1 dc2.large) Redshift (1 dc2.large) Redshift (1 dc2.large) Redshift (1 dc2.large) Redshift (2 dc2.large) Redshift (2 dc2.large) Redshift (2 dc2.large) Redshift (2 dc2.large) BigQuery BigQuery BigQuery BigQuery
Query
1st Query Execution Time(seconds)Average of Remaining 4 columns (seconds)Median  of Remaining 4 columns (seconds)Standard Deviation  of Remaining 4 columns (seconds)
1st Query Execution Time(seconds)Average of Remaining 4 columns (seconds)Median  of Remaining 4 columns (seconds)Standard Deviation  of Remaining 4 columns (seconds)
1st Query Execution Time(seconds)Average of Remaining 4 columns (seconds)Median  of Remaining 4 columns (seconds)Standard Deviation  of Remaining 4 columns (seconds)
Q117.660560.28160.25000.6006 4.398740.50310.25090.43902.081940.6006 0.43900.0519
Q220.703540.26370.25340.02953.93964 0.51890.28270.43070.682880.54730.54650.0422
Q317.465670.26830.24820.04153.911620.24650.24620.00150.604820.53090.52290.0481
Q429.12982 0.53410.53840.03654.707080.70020.66410.18300.637460.52030.51730.0282
Q57.108460.24700.24710.00483.585400.24720.24730.00100.702320.80760.74280.2236
Q62.502000.26749 0.25180.03693.477700.42000.33340.20300.551070.60960.61580.1063
Q72.47661   0.26430.25010.03953.42400 0.29280.28600.20300.638420.52210.61050.1585
Q82.521710.2560.24160.02853.592050.26060.25010.02040.589930.56530.54600.0448
Q911.169860.26120.25560.01244.634950.25470.24970.01160.645360.55520.52970.0473
Q1028.492860.27410.27500.034010.897860.55940.25180.53520.548030.54450.54920.0204
Q118.578870.89890.87910.036814.847570.66930.54520.22850.615910.66730.63700.1444

Pricing Comparison: AWS Redshift vs Google BigQuery

In this section, let us do a comparative cost analysis of storage and querying in Redshift as against BigQuery.

AWS Redshift 

For dc2.large with 160GB/node storage the cost is $0.25/node/hour, i.e., $180/month + no cost for processing queries.

As we are using 2 dc2.large nodes for each cluster, the total cost will be $360/month. 

Type of Node UsedNo. of Nodes UsedCost Per Node Per Hour ($)Total Cost Per Month ($)[Cost Per Node Per Hour x No. of Nodes x days per month x hours per day]
dc2.large (2 vCPU, 15 GiB Memory, 0.16 TB SSD)20.250.25 x 2 x 30 x 24 =360

BigQuery:

BigQuery charges $20/TB/month for storage and $5/TB for queries.

Query Data billed in TB Per Query Cost
Q10.00042$0.0021
Q20.00028$0.00141
Q30.00028$0.00141
Q40.0005$0.00281
Q50.0000097$0.0000485
Q60.0000097$0.0000485
Q70.0000097$0.0000485
Q80.0000097$0.0000485
Q90.0012 $0.0063
Q100.0966$0.483
Q110.0032$0.061

The total cost of these 11 queries: $0.55822 ~ $0.56 (rounded off).

Size of the dataset schema being queried is: 39.30 GB = 0.0393 TB.

Therefore, the total storage cost per month will be $0.786 ~ $.079 (rounded off).

If the queries are run just ONCE in a month, Redshift will be $360 / $ (0.56 + 0.79) ~ 266.67 times more expensive than BigQuery.

Put it differently, one can afford to execute the queries approximately 267 times in a month i.e. about 9 times a day before BigQuery becomes costlier than Redshift.

Queries can be run multiple times, for example, when there are a large number of dashboards that are being accessed by a number of users. The mobile game had this exact scenario so Redshift made more sense. BigQuery was faster for most of the queries but RedShift was good enough for the “near real-time” requirements of the analytics team.

Conclusion

  • Redshift is a more viable solution because a large number of users are accessing near real-time dashboards.
  • Redshift allows for caching of queries as seen by the improved latency of query execution with each run. However, BigQuery doesn’t show a significant improvement in the query execution times with each run.
  • For the first run of query execution in both Redshift and BigQuery, a 2-node setup delivers better performance than a single-node setup.

Appendix

This appendix captures the actual times recorded during the execution of queries on AWS Redshift vs Google BigQuery. The statistical measures such as average, median, and standard deviation were calculated based on these times. The measurements also help in substantiating the conclusions drawn earlier. We have presented a brief overview of each query. For each query, the performance was compared followed by measuring the execution time. As a single-node configuration for Redshift was used initially followed by a 2-node one – measurements for both are provided. These two measurements are compared further with BigQuery measurements.

Query 1: Top N games by the total bet amount 

Users can associate a bet amount with each spin in a slot game. This query seeks to list top N (in this case, all 59) games by the cumulative bet amount associated with the game within a certain time range. This query involved a single table with 21887554 records.

Run Vs Execution Time (seconds)

RunRedshift (1 dc2.large)Redshift (2 dc2.large)BigQuery
117.660564.398742.08194
20.386891.263640.53393
30.239650.247110.62359
40.249400.253650.57320
50.250710.248170.67189

Query 2: Count of auto Vs non-auto spins

Users can choose for an auto spin and specify the number of times they want the wheel to be spun automatically instead of spinning it manually. The query was on the same table as above with 21887554 records.

Run Vs Execution Time (seconds)

RunRedshift (1 dc2.large)Redshift (2 dc2.large)BigQuery
120.703543.939640.68288
20.258431.263460.56000
30.248540.246980.60644
40.235120.318220.48988
50.312900.247310.53312

Query 3: Count of turbo vs non-turbo Spins

Find how many spin events are of turbo category and how many are not. The table involved was the same as the previous two queries and had 21887554 records.

Run Vs Execution Time (seconds)

RunRedshift (1 dc2.large)Redshift (2 dc2.large)BigQuery
117.465673.911620.60482
20.237280.245340.60386
30.243510.247210.50556
40.339640.244890.54029
50.253000.248740.47426

Query 4: Count of users by bins of the average number of days “In” game 

Different users would download and start playing the game at different points of time. The engagement level of different users will be different – some might be quite addicted to the game while others might play only occasionally. Hence, the number of days that the user has been “in” the game or, in other words, the number of days a user has been accessing the game would vary from user to user. This query creates bins of the average number of days “in” game and then calculates the number of users per such bin. This query also runs on the same table of 21887554 records as above.

Run Vs Execution Time (seconds)

RunRedshift (1 dc2.large)Redshift (2 dc2.large)BigQuery
129.129824.707080.63746
20.568440.979620.51152
30.508370.591760.48408
40.488320.493080.56262
50.571420.736520.52310

Query 5: Total payment by a Facebook profile

Users can choose to make payments in the game via their Facebook profiles. This query calculates total payments per Facebook profile. This query runs on a single table of 18158 records.

Run Vs Execution Time (seconds)

RunRedshift (1 dc2.large)Redshift (2 dc2.large)BigQuery
17.108463.585400.70232
20.253180.246170.83950
30.249840.248281.15934
40.244530.246400.58569
50.240510.248340.64613

Query 6: Total payment by VIP level

There are various VIP levels that the users can be associated with. This query lists the total payments per VIP level. This is not at the user level but rather at the VIP level so payments made by all users at that VIP Level would be summed up. The source data resides in the same table of 18158 records.

Run Vs Execution Time (seconds)

RunRedshift (1 dc2.large)Redshift (2 dc2.large)BigQuery
12.502003.477700.55107
20.241800.310180.51948
30.261830.765230.71858
40.329380.356800.71219
50.236970.247930.48819

Query 7: Top N price points by payment volume

The volume of payments varies by the price point. This query captures the top N price points by the payment volume (no. of payments). It is noted that the ranking is not by cumulative payments but rather by the number of payments made. This query also works on the single table of 18158 records.

Run Vs Execution Time (seconds)

RunRedshift (1 dc2.large)Redshift (2 dc2.large)BigQuery
12.476613.424000.63842
20.243960.303950.61991
30.249750.268190.24754
40.250460.351730.61166
50.313170.247600.60948

Query 8: Top N products by revenue

The mobile casino game application allows for in-app purchases. This query lists the top N products by revenue. The data source is the same single table of 18158 records as above.

Run Vs Execution Time (seconds)

RunRedshift (1 dc2.large)Redshift (2 dc2.large)BigQuery
12.521713.592050.58993
20.305060.248340.54918
30.236670.246660.52736
40.246660.295880.64178
50.236350.251870.54301

Query 9: Top N events

Various events are triggered within the mobile game application either directly or indirectly through user action. This query captures top N such events by their occurrence across all installations. This data is calculated from a single table of 78021396 records.

Run Vs Execution Time (seconds)

RunRedshift (1 dc2.large)Redshift (2 dc2.large)BigQuery
111.169864.634950.64536
20.282570.247040.63723
30.251020.245030.53023
40.256220.252550.52926
50.255170.274300.52424

Query 10: Top N users by events

Various events are triggered within the mobile game application either directly or indirectly through user action. This query enumerates top N users on the basis of the count of events generated directly or indirectly by them. The source of data is the same single table of 78021396 records.

Run Vs Execution Time (seconds)

RunRedshift (1 dc2.large)Redshift (2 dc2.large)BigQuery
128.4928610.897860.54803
20.242700.248490.55039
30.308910.247730.56804
40.307381.486560.54808
50.237570.255130.51166

Query 11: Date-wise average (per user) coin balance for past 7 days

Users can win “coins” in the course of playing the game. This query captures the date-wise, per-user average coin balance (considering users for whom such coin balance has been reported in the past 7 days). If coin balance has not been reported for a user in the past 7 days, such user records won’t be considered. The source of data is the same single table of 78021396 records.

Run Vs Execution Time (seconds)

RunRedshift (1 dc2.large)Redshift (2 dc2.large)BigQuery
18.5788714.847570.61591
20.883460.554110.71461
30.874920.522330.55951
40.962480.536370.51352
50.874811.064680.88164