SQL Database, Table and Data Partitioning: When and How to Do It
What is SQL partitioning
First things first, let’s start with some naming conventions. A partition is a small piece (object) of a database table. That object is created by partitioning a table. Partitioning allows tables, indexes, or index-organized tables to be subdivided into smaller, manageable pieces (partitions). Each partition has its name and possibly its storage characteristics.
Why you need SQL partitioning
So the first question is when investing time in table partitioning seems an excellent way to go.
Let’s see an example of a typical case you need SQL partitioning:
“A company maintains a large database that stores all of the data produced due to its activity. As time goes by, the velocity of the data increases more and more, and the database queries become slower and slower, as whole tables need to be scanned. But what happens in cases where there is no need for a full scan?
Imagine the compilation of monthly business intelligence reports.
The only data that are needed are those produced during the last month. It becomes evident that there are cases where it would be beneficial to have control over the data that our queries take into consideration while being evaluated. ”
From what I have figured out, I would say that before creating any partitions, you should try to exhaust all other alternative options, including table indexing and revision of queries.
If you conclude that the only solution is table partitioning, you have to pay special attention for implementing it.
The benefits of SQL partitioning
The benefits such an implementation provides are constrained by selecting the partition key and the granularity.
Regarding the first factor, you should keep in mind that partitioning can only occur in a single column and that your queries must include that column. If, for example, you have created a partitioned table to run a query over the “eliminated” data, the partition indicator must be included in that query. Otherwise, a full scan will be performed.
Consequently, it is essential to review how your queries access the table to choose the most suitable column for partitioning.
As for the granularity, if your partitions are large, you won’t see any particular improvement in the performance. On the other hand, small partitions can be hard to handle.
Furthermore, even in the case of good design, you won’t be able to see significant improvement in performance unless you are dealing with really huge tables.
If all the above concerns are being evaluated and you have concluded that table partitioning serves your needs, then the benefits you are going to gain include:
- The relative speedup of queries that require only portions of large data sets. In this case, the optimizer eliminates searching in partitions that do not have relevant information.
- Faster data load
- Faster deletion of old data limited to specific partitions if they are no longer needed.
- Faster archival of rarely used or old data can be migrated to cheaper and slower storage media.
SQL partitioning: Step by Step
With these in mind, let’s move on and see how table partitioning can be implemented in PostgreSQL, MS SQL Server, and Google BigQuery.
Regarding the actual implementation, the main idea behind table partitioning is to create a “parent” table and several “children” tables responsible for holding the data. The number of children is not necessarily constant and can grow as time goes by.
Of course, creating partitions does not mean that the “global” table stops existing. You can still query it for events that span the whole period.
For the sake of simplicity, in this post, we are going to work with a table that contains only two columns.
Over this, we are going to make daily partitions. In real life, databases include much more columns, but the idea remains exactly the same.
PostgreSQL databases table partitioning
So, let’s first create our PostgreSQL sandbox table:
PostgreSQL: Create table
JSX
CREATE TABLE testing_table(receipt_id BIGINT, date DATE);
Now that we have our table, we need to define a function triggered during inserting to check whether a new partition needs to be created based on some criterion.
The date field is the one controlling the name of the newly created partition.
PostgreSQL : Create Function
JSX
CREATE OR REPLACE FUNCTION partition_function() RETURNS trigger AS$BODY$DECLAREpartition_date TEXT;partition TEXT;BEGINpartition_date := to_char(NEW.date,'YYYY_MM_DD');partition := TG_RELNAME || '_' || partition_date;IF NOT EXISTS(SELECT relname FROM pg_class WHERE relname=partition) THENRAISE NOTICE 'A partition has been created %',partition;EXECUTE 'CREATE TABLE ' || partition || ' (check (date = ''' || NEW.date || ''')) INHERITS (' || TG_RELNAME || ');';END IF;EXECUTE 'INSERT INTO ' || partition || ' SELECT(' || TG_RELNAME || ' ' || quote_literal(NEW) || ').* RETURNING receipt_id;';RETURN NULL;END;$BODY$LANGUAGE plpgsql VOLATILECOST 100;
Now we need to declare a trigger that will activate the above function each time an insert action is going to be executed.
PostgreSQL: Create Trigger
JSX
CREATE TRIGGER partition_triggerBEFORE INSERT ON testing_tableFOR EACH ROW EXECUTE PROCEDURE partition_function();
There are, of course, many ideas regarding how you can improve this reasonably simple implementation.
For example, you can create the necessary tables ahead of time (one by day/month/year), and instead of a trigger, use an appropriate rule over them.
A cron job that automatically creates the next X daily/monthly tables would be a good idea in this direction.
You can also check this interesting maintained extension for partition management in PostgreSQL, which can be found here.
MS SQL Server databases table partitioning
When it comes to Microsoft SQL Server, things are a bit different. This database system does not support dynamic partitions, so manually partitioning the table can be a huge maintenance issue.
That being said, to create a partitioned table, a similar procedure to the one previously presented must be followed. This time we will create a monthly partition.
MS SQL Server: Create table
JSX
CREATE TABLE testing_table(receipt_id BIGINT, date DATE);
Now we need to specify exactly how the table is going to be partitioned by the partitioning column. In this case, the date, along with the range of values included in each partition. Regarding partition boundaries, you can specify either the LEFT or the RIGHT.
MS SQL Server: Create Function
JSX
CREATE PARTITION FUNCTION partition_function (int)AS RANGE LEFT FOR VALUES (20170630, 20170731, 20170831);
In this example, we have defined the following four partitions:
Partition No. 1 ⇒ All records with date ≤ 2017-06-30
Partition No. 2 ⇒ All records with date > 2017-06-30 & date ≤ 2017-07-31
Partition No. 3 ⇒ All records with date > 2017-07-31 & date ≤ 2017-08-31
Partition No. 4 ⇒ All records with date > 2017-08-31
MS SQL Server: Create a Partition Scheme
JSX
CREATE PARTITION FUNCTION partition_function (int)AS RANGE LEFT FOR VALUES (20170630, 20170731, 20170831);
Having followed these steps, if you want to determine the partition at which a record will be placed, you can do the following:
JSX
SELECT ‘20170613’ date, $PARTITION.partition_function(20170613) PartitionNumberUNIONSELECT ‘20170713’ date, $PARTITION.partition_function(20170713) PartitionNumberUNIONSELECT ‘20170813’ date, $PARTITION.partition_function(20170813) PartitionNumberUNIONSELECT ‘20170913’ date, $PARTITION.partition_function(20170913)
BigQuery databases table partitioning
For those using BigQuery, partitioning of a table can be done from within the Google Cloud’s BigQuery interface without using any SQL code. From there, you define how to split large tables into smaller ones, where each partition contains monthly or daily data only.
A detailed explanation of the whole procedure can be found in Google’s official documentation here.
Outro
As has been noted, table partitioning may come at a cost.
Nevertheless, if implemented in the right way under the right conditions, the whole concept can be a real lifesaver. You can improve your query performance over huge database tables and handle data in the sense of easier roll-in and roll-out operations and data migration.