Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It only takes a minute to sign up.

You provided some good information. If you need a solution provide complete information as instructed in the tag info for [postgresql-performance]. In particular, definition and purpose of the functions jsonb_array_lower() and score_rank() may be relevant. Show some sample rows and describe what the column tags can contain. Just a simple JSON array of tags? Cardinalities, value frequencies? Normalization may be the answer.
– Erwin BrandstetterJul 26 '18 at 13:03

2 Answers
2

Sorting and paging

Your function score_rank() produces a text from an integer score and the appended PK id. That's not helpful for sorting. Replace it completely, I suspect you do not need it at all. Instead use the two columns score and id directly for sorting:

You later mentioned a new function concatenating strings with base256 etc. All that smart trickery is not going to increase performance. Sorting on an integer is faster than sorting on strings in Postgres. Using integer (or bigint) instead of varchar(20) would actually hep in multiple ways.

Statistics and query plan (a.k.a: Why?)

The main issue is the lack of statistics for values nested in the jsonb column. Postgres consequently sometimes misjudges the selectivity of the predicate jsonb_array_lower(tags) ? lower('Qui') and chooses a bad query plan. In your example with LIMIT 2 the logic of the query planner can be illustrated like this - let's call this "Plan 1":

Only two rows with the highest scores? Let's scan the index posts_lists_idx_score_desc starting with the highest scores. With any luck we'll have the result in no time!

It's a reasonable plan for most cases with at least moderately common tags. But the tag 'qui' turns out to be very rare, and with low scores, too. The worst case. Postgres ends up scanning close to 4 million rows, just to keep 2. A colossal waste of time:

Rows Removed by Filter: 3847383

If the query planner had any idea how rare that tag actually is, it would start with the other index posts_lists_idx_tags like we see in your second example with LIMIT 100 - let's call this "Plan 2":

Find matching rows, then sort by score and take the top N.

Plan 1 is more favorable the smaller the LIMIT and the more frequent the tag. (And if qualifying rows happen to sort on top.)Plan 2 is more favorable the bigger the LIMIT and the less frequent the tag.

Postgres has currently no statistics about nested values in document types like jsonb. And no combined frequencies at all. See:

Helps Postgres to generate better query plans for some constellations, but it's no silver bullet. For starters, only the most common elements are stored. Postgres still doesn't know about the rarest elements.

2. Or normalize your db design and move tags to a separate 1:n table with a single tag per row. That increases the disk footprint because of the added row overhead per tag. (But changing tags becomes much cheaper with less table bloat.) If your tags are stable, consider a full n:m relationship between posts_lists and a new table tags. That's a bit smaller for lots of common tags, too. And it's the "clean" way. You have more detailed statistics and should see fewer bad query plans.

3. Since Postgres 10 there is a variant of to_tsvector() that processes json(b) values. So it's simple to create a text search index and work with text search operators now.

Although the jsonb_path_ops operator class supports only queries with
the @> operator, it has notable performance advantages over the
default operator class jsonb_ops. A jsonb_path_ops index is usually
much smaller than a jsonb_ops index over the same data, and the
specificity of searches is better, particularly when queries contain
keys that appear frequently in the data. Therefore search operations
typically perform better than with the default operator class.

But I do not expect much for your particular case.

5. Use a regime of "granulated" indexes, combined with a procedural solution. See:

Wow, I moved tags to a separate 1:n table with a single tag per row, and then created an index (score DESC, id DESC, tag DESC) which solved the problem, but increased the disk space.
– robxyyJul 28 '18 at 13:40

Disk space has increased 3x, this way is expensive. When using the gin(tags) index, query by WHERE tags @> ARRAY ['x'] ORDER BY score DESC, id DESC LIMIT z if x does not exist or the z is greater than the actual number of rows, it is very slow. why?
– robxyyJul 29 '18 at 3:18

Why does this happen when the limit is greater than the actual number of rows? if so, how do I estimate the limit size?
– robxyyJul 27 '18 at 3:43

I don't think this is caused by the limit. The query planner needs statistics about the data you are working with to decide on the best plan. The usage of functions is a little bit of a problem here, because there are no statistics for the result of a function.
– Thomas BergerJul 27 '18 at 10:46

You are right, when i tested the uneven distribution of tags, i found that it was not the cause of the limit.
– robxyyJul 28 '18 at 2:19