On 2 October 2012 18:16, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
> 1. Why isn't something like md5() on the reported query text an equally
> good solution for users who want a query hash?
Because that does not uniquely identify the entry. The very first
thing that the docs say on search_path is "Qualified names are tedious
to write, and it's often best not to wire a particular schema name
into applications anyway". Presumably, the reason it's best not to
wire schema names into apps is because it might be useful to modify
search_path in a way that dynamically made the same queries in some
application reference what are technically distinct relations. If
anyone does this, and it seems likely that many do for various
reasons, they will be out of luck when using some kind of
pg_stat_statements aggregation.
This was the behaviour that I intended for pg_stat_statements all
along, and I think it's better than a solution that matches query
strings.
> 2. If people are going to accumulate stats on queries over a long period
> of time, is a 32-bit hash really good enough for the purpose? If I'm
> doing the math right, the chance of collision is already greater than 1%
> at 10000 queries, and rises to about 70% for 100000 queries; see
> http://en.wikipedia.org/wiki/Birthday_paradox
> We discussed this issue and decided it was okay for pg_stat_statements's
> internal hash table, but it's not at all clear to me that it's sensible
> to use 32-bit hashes for external accumulation of query stats.
Well, forgive me for pointing this out, but I did propose that the
hash be a 64-bit value (which would have necessitated adopting
hash_any() to produce 64-bit values), but you rejected the proposal. I
arrived at the same probability for a collision as you did and posted
in to the list, in discussion shortly after the normalisation stuff
was committed.
A more sensible way of assessing the risk of a collision would be to
try and come up with the probability of a collision that someone
actually ends up caring about, which is considerably less than the 1%
for 10,000 entries. I'm not being glib - people are very used to the
idea that aggregating information on query costs is a lossy process.
Prior to 9.2, the only way execution costs could reasonably be
measured on at the query granularity on a busy system was to set
log_min_duration_statement to something like 1 second.
I am also unconvinced by the idea that aggregating historical data
(with the same hash value) in a separate application is likely to make
the collision situation appreciably worse. People are going to be
using something like an RRD circular buffer to aggregate the
information, and I can't see anyone caring about detailed information
that is more than a couple of weeks in the past. The point of
aggregation isn't to store more queries, it's to construct time-series
data from snapshots. Besides, do most applications really even have
more than 10,000 distinct queries?
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services