One of the easiest ways to monitor slave lag when using streaming replication is to turn hot standby on your slave and use pg_last_xact_replay_timestamp() and/or the other recovery information functions. Here’s an example query to run on the slave systems to get the number of seconds behind it is:

The issue with this query is that while your slave(s) may be 100% caught up, the time interval being returned is always increasing until new write activity occurs on the master that the slave can replay. This can cause your monitoring to give false positives that your slave is falling behind if you have things set up to ensure your slaves are no more than a few minutes behind. A side affect of this monitoring query can also give you an indication that writes to your master have stopped for some reason.

One of our clients has a smaller sized database that doesn’t get quite as much write traffic as our typical clients do. But it still has failover slaves and still needs to be monitored just like our other larger clients to ensure it doesn’t fall too far behind. So, my coworker introduced me to the pg_stat_replication view that was added in PostgreSQL 9.1. Querying this from the master returns information about streaming replication slaves connected to it.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

postgres=# select * from pg_stat_replication;

-[RECORD1]----+------------------------------

pid|16649

usesysid|16388

usename|replication

application_name|walreceiver

client_addr|xxx.xxx.xxx.xxx

client_hostname|db1-prod-ca

client_port|58085

backend_start|2013-10-2919:57:51.48142+00

state|streaming

sent_location|147/11000000

write_location|147/11000000

flush_location|147/11000000

replay_location|147/11000000

sync_priority|0

sync_state|async

-[RECORD2]----+------------------------------

pid|7999

usesysid|16388

usename|replication

application_name|walreceiver

client_addr|yyy.yyy.yyy.yyy

client_hostname|db2-prod

client_port|54932

backend_start|2013-10-2915:32:47.256794+00

state|streaming

sent_location|147/11000000

write_location|147/11000000

flush_location|147/11000000

replay_location|147/11000000

sync_priority|0

sync_state|async

He also provided a handy query to get back a simple, easy to understand numeric value to indicate slave lag. The issue I ran into using the query is that this view uses pg_stat_activity as one of its sources. If you’re not a superuser, you’re not going to get any statistics on sessions that aren’t your own (and hopefully you’re not using a superuser role as the role for your monitoring solution). So, instead I made a function with SECURITY DEFINER set, made a superuser role the owner, and gave my monitoring role EXECUTE privileges on the function.

Running this query gives back a few handy columns that should be good enough for most monitoring tools. You can easily add more columns from pg_stat_replication or any other tables you need to join against for more info.

1

2

3

4

5

postgres=# select * from streaming_slave_check();

client_hostname|client_addr|byte_lag

-----------------+-----------------+----------

db1-prod-ca|xxx.xxx.xxx.xxx|160

db2-prod|yyy.yyy.yyy.yyy|160

UPDATE: If you’re running PostgreSQL 9.2+, there is a new, built-in function that avoids needing the above function all together and can just query pg_stat_replication directly.