Monday, April 16, 2012

Caching...!!, its little bit hard to go in brief with single article. But will try to share my knowledge learnt from Heikki / Robert Haas / Bruce Momjian in short. In PostgreSQL, there are two layers, PG shared buffers and OS Page cache, any read/write should pass through OS cache(No bypassing till now). Postgres writes data on OS Page Cache and confirms to user as it has written to disk, later OS cache write's to physical disk in its own pace. PG shared buffers has no control over OS Page Cache and it not even know what's in OS cache. So, most of the recommendation's given by Postgres DBA's/Professional's to have faster DISK / better cache.

Caches/buffers in PostgreSQL are stronger like other databases and highly sophisticated. As am from Oracle background (mindset also…:) ), so, my question's from whom I learnt was how/when/what/why etc., regarding Database buffer cache, pinned buffers, Flushing database buffers cache, preloading database etc., I got all my answers from them, however, the approach is bit different. Though my questions were bugging, they answered with great patience and clarifying me to good extent which in result you are reading this blog.... :)..

On some learnings(still learning), I drawn a small overview of how data flow between Memory to Disk in Postgres and also some of the important tools and NEW patch by Robert Haas(pg_prewarm).

pgfincore
It has a functionality to give the information about what data in OS Page Cache. Pgfincore, module become's very handy when it is clubbed with pg_buffercache, now one can get PG buffer cache & OS Page Cache information together. Thanks to Cerdic Villemain. Pgfincore, backbone is fadvise, fincore which are linux ftools. You can also use fincore/fadvise by installing source. Two thing, you can use pgfincore contrib module or ftools both result the same. I tried both, they are simply awesome.

Installation:
Download the latest version: http://pgfoundry.org/frs/download.php/3186/pgfincore-v1.1.1.tar.gz
As root user:
export PATH=/usr/local/pgsql91/bin:$PATH //Set the path to point pg_config.
tar -xvf pgfincore-v1.1.1.tar.gz
cd pgfincore-1.1.1
make clean
make
make install
Now connect to PG and run below command
postgres=# CREATE EXTENSION pgfincore;

pg_prewarm
Preloading the relation/index into PG buffer cache. Is it possible in PostgreSQL? oh yes, Thanks to Robert Haas, who has recently submitted patch to community, hopefully it might be available in PG 9.2 or PG 9.3. However, you can use the patch for your testing on PG 9.1.

dstat
A combination of vmstat,iostat,netstat,top,etc., tool together in one "dstat" linux command. When database behaving unusually, to know the cause from OS level, we open couple of terminals to pull process, memory,disk read/writes, network informations, which is little bit pain to shuffle between windows. So, dstat has serveral options with in it, which helps to show all commands in one output one window.

Linux ftools
Its designed for working with modern linux system calls including, mincore, fallocate, fadvise, etc. Ftools, will help you to figure out what files are in OS cache. Using perl/python scripts you can retrieve OS page cache information on object files (pg_class.relfilenode). pg_fincore is based on this. You can use pgfincore or ftools scripts.

I have written small query by clubbing pgfincore and pg_buffercache to pull information from PG Buffer & OS Page cache. I will be using this query through out my example, only pasting this query outputs.

select rpad(c.relname,30,' ') as Object_Name,
case when c.relkind='r' then 'Table' when c.relkind='i' then 'Index' else 'Other' end as Object_Type,
rpad(count(*)::text,5,' ') as "PG_Buffer_Cache_usage(8KB)",
split_part(pgfincore(c.relname::text)::text,','::text,5) as "OS_Cache_usage(4KB)"
from pg_class c inner join pg_buffercache b on b.relfilenode=c.relfilenode
inner join pg_database d on (b.reldatabase=d.oid and d.datname=current_database() and c.relnamespace=(select oid from pg_namespace where nspname='public'))
group by c.relname,c.relkind
order by "PG_Buffer_Cache_usage(8KB)"
desc limit 10;
object_name | object_type | PG_Buffer_Cache_usage(8KB) | OS_Cache_usage(4KB)
-------------+-------------+----------------------------+---------------------
(0 rows)
Note: I have bounced the cluster to flush PG buffers & OS Page Cache. So, no data in any Cache/buffer.

Preloading relation/index using pg_prewarm:
Before, bouncing the cluster I have fired full table sequential scan query on "Cache" table, and noted the time which is before warming the relation/index.

How to flush/prewarm relation/index in OS cache ?
Using pgfadvise, you can preload or flush the relation from the OS cache. For more information, type \df pgfadvise* in terminal for all functions related to pgfadvise. Below is example of flushing the OS cache.

While these things are going on in one window you can check the read/write ratio by using dstat. For more options use dstat --list
dstat -s --top-io --top-bio --top-mem

Preloading Range of block's using pg_prewarm range functionality.
Assume,due to some reason, you want to bounce the cluster, but one of big table which is in buffer is performing well. On bouncing, your table no more in buffers, to get back to original state as it was before bouncing then you have to know how many table blocks were there in buffers and preload them using pg_prewarm range option.

I have created a table by querying pg_buffercache and later I have sent block range information to pg_prewarm. By this, shared buffers is back with the table earlier loaded in it. See the example.

Firstly, it might be delayed index creation or if you have shared sequences in a database, sharing on two different Primary key Tables might be the cause while restoring the data into table (pg_restore). Secondly, if any huge transaction is taking place on that table and at the backend someone has abruptly stopped the instance, which might also fail the index(primary key) to point to the right row.

How to fix it ?

Well,as common practice, when we encounter a duplicate rows in a table (despite of any reason), we first filter the duplicate rows and delete them, and later by doing REINDEX should fix the issue.

Query for finding duplicate rows:

select count(*),primary_column from table_name group by primary_column having count(*) > 1;

Even after deleting the duplicate rows REINDEX or CREATE UNIQUE INDEX fails, it means your index is not cleaned properly. Above query might not be giving 100% result oriented output what you are expecting, because the query is going to pick the index which is already corrupted with duplicate rows. See the explain plan below.