Split or leave frequently updated column in PostgreSQL

I have a database migrated from MySQL to PostgreSQL (I had my good reasons but this post is not about that). In MySQL because of MVCC behaviour it makes sense and it’s actually a recommendation to split frequently updated columns from large tables especially if using a web framework like Django which always updates the full row so even if MySQL stores the TEXT column off-page (look at InnoDB Blob storage for more details) the TEXT columns as well will be read from the separate extent and written to the Doublewrite buffer and to the Undo space (and binary logs if it’s set). To overcome this situation it’s beneficial to have the counters in a separate table.

Example table design

MySQL

1

2

3

4

5

6

7

8

9

10

11

12

CREATETABLEpost(

idintunsignedPRIMARY KEYAUTO_INCREMENT,

titlevarchar(255),

bodytext

)Engine=InnoDB;

CREATETABLEpost_counters(

post_idintunsignedPRIMARY KEY,

readintunsignednot nulldefault0,

commentedintunsignednot nulldefault0,

CONSTRAINT`post_id`FOREIGN KEY(`post_id`)REFERENCES`post`(`id`)

)Engine=InnoDB;

or alternatively if we want to sort on popularity for example we want those metrics in the main table so we can split the blob to it’s separate table. It has it’s own benefit of being able to list all the posts with comment and read counters by using one table only. And we join the details table when someone read an individual post.

Alternative design

1

2

3

4

5

6

7

8

9

10

11

12

create table post(

id intunsignedPRIMARY KEY AUTO_INCREMENT,

title varchar(255),

read intunsignednotnulldefault0,

commented intunsignednotnulldefault0,

)Engine=InnoDB;

create table post_details(

post_id intunsignedPRIMARY KEY,

body text

CONSTRAINT`post`FOREIGN KEY(`post_d`)REFERENCES`post`(`id`)

)Engine=InnoDB

Having said all these I was wondering if the same thing is true for PostgreSQL. So I have setup a benchmark.

I’ve been running the test with over 1000 transactions / sec with 3 different ways:

Normal design where everything is in one table

Split counters into their own table

Alternate design where counters are in the main table and blobs are separate

I used the open source shakespeare database for that. Every query was run against the DB 100 times to have a large enough sample.

Results (update times)

Although the average query time for split was the lowest but taking the standard deviation into account you can see there are no significant difference between the different designs.

Avg

StdDev

Min

Max

95% lower

95% upper

Normal

0,658

0,160

0,459

1,886

0,345

0,970

Split

0,613

0,121

0,444

1,200

0,377

0,850

Alternate

0,675

0,144

0,479

1,551

0,393

0,956

Updating the full row

What if we mimic what Django would do and update the full row and not just the counters.

Results (update times)

Avg

StdDev

Min

Max

95% Lower

95% Upper

Normal

0,860

0,181

0,606

1,695

0,504

1,215

Split

0,645

0,164

0,458

1,697

0,324

0,966

Alternate

0,748

0,137

0,541

1,477

0,480

1,017

Here you definitely can notice a slight difference between the split design and the other two. The split design yields 25% better average performance comparing to the large table. But the confidence intervals shows that it is still statistically irrelevant. Although it’s unlikely but possible that the normal is actually the same performant as the split design.

The alternate design gives roughly 13% better average performance comparing to the large table. The same also has to be noted as above. Statistically still irrelevant.

Select performance

Obviously it has it’s own penalty for your selects if you need to join two tables instead of querying one. So let’s see an example where I fetch the top 10 most read posts.

Please note that using the alternative design I didn’t need to join my details table unless I want the full body of the post.

Results (query times)

Avg

StdDev

Min

Max

95% Lower

95% Upper

Normal

1,011

0,216

0,862

3,044

0,587

1,435

Split

1,536

0,109

1,405

2,380

1,321

1,751

Alternate

0,890

0,083

0,766

1,499

0,728

1,052

With the split design the query time increased significantly for the split design by almost 52% (this is statistically relevant as well) while the alternative design even improved the performance by 12% (this is statistically not relevant though).

Disclaimer

You can pretty much achieve the same query time with limiting the columns in your select with the normal large table design. I intended to mimic a web framework’s behaviour so I haven’t done that but that’s an absolutely viable option too.

Conclusion

PostgreSQL handles the large blobs quite well so unless you have updates on your tables in the order of magnitude of 1000 or more per seconds you won’t see a significant benefit of splitting your tables based on the update frequency but you’re going to lose some on the retrieval.

If you like to squeeze the last drop of performance out of your system and don’t mind to pay the complexity price try the alternatively split design which boosts both the read and the write performance.

Taking the normal (large table) design as the baseline here are the statistics about what each improves.

About charlesnagy

I'm out of many things mostly automation expert, database specialist, system engineer and software architect with passion towards data, searching it, analyze it, learn from it. I learn by experimenting and this blog is a result of these experiments and some other random thought I have time to time.

Nice one! I think your second sentence is somehow a mixture of two. It is not very clear there where you refer to PostgreSQL and where to MySQL.

The more important remark is that with those frequently updated columns, there is a real issue in PostgreSQL: if the updates are frequent enough (this could be specified a bit more in detail, I admit), autovacuum cannot keep up. This is also the case, when there is chance for certain tuples on all physical pages never being updated – they will sit there alone in the middle of an 8 kB page, bloating the table by an enormous factor. We used to have, for example, a table where the average row count was around 1 at any given time, while the physical table size was seldom less than 60 MB (!).

You’re absolutely right. Thank you for the feedback! Reading it back indeed it wasn’t clear. The first part was about MySQL and design decisions in the MySQL database which got migrated to PG. I updated the post so hopefully it got clarified.

Very good point! Fortunately my database is not there yet. Only having updates in the order of magnitude of hundreds per day max per row on popular items so vacuuming is not an issue yet. I have a bunch of order by popularity, last_viewed, last_commented queries and those really benefit from having the two tables merged back into one. I will keep an eye on the bloating. Thanks!

Categories

About Charles Nagy

Database specialist

Automation expert

System engineer

Software architect

“An expert is a man who has made all the mistakes which can be made, in a narrow field.” - Niels Bohr

Don't be afraid to fail, to test, to experiment. This is what teaches you things you cannot learn from books. Everything written, told, heard are things somebody already know. If you want to be better in something you have to do things that nobody did before and push those boundaries as much as you can.