PGXN

PostgreSQL Extension Network

bitmap_estimator

This Release

bitmap_estimator1.3.5

Date

2015-08-29

Status

Testing

Other Releases

Abstract

Estimates number of distinct elements in a data set (aggregate and a data type).

Description

Provides an alternative to COUNT(DISTINCT) aggregate, computing an estimate of number of distinct values, and a data type that may be used within a table (and updated continuously). This implementation is based on self-learning bitmap (see the paper 'Distinct Counting with a Self-Learning Bitmap' by Aiyou Chen and Jin Cao, published in 2009).

Extensions

README

Contents

Bitmap Distinct Estimator

This is an implementation of self-learning bitmap, as described in the
paper "Distinct Counting with a Self-Learning Bitmap" (by Aiyou Chen
and Jin Cao, published in 2009).

Contents of the extension

The extension provides the following elements

bitmap_estimator data type (may be used for columns, in PL/pgSQL)

functions to work with the bitmap_estimator data type

bitmap_size(real error, item_size int)

bitmap_init(real error, item_size int)

bitmap_add_item(bitmap_estimator counter, item anyelement)

bitmap_get_estimate(bitmap_estimator counter)

bitmap_get_error(bitmap_estimator counter)

bitmap_get_ndistinct(bitmap_estimator counter)

bitmap_reset(bitmap_estimator counter)

length(bitmap_estimator counter)

The purpose of the functions is quite obvious from the names,
alternatively consult the SQL script for more details.

aggregate functions

`bitmap_distinct(anyelement, real, int)

`bitmap_distinct(anyelement)

where the 1-parameter version uses 0.025 (2.5%) and 1.000.000
as default values for the two parameters. That's quite generous
and it may result in unnecessarily large estimators, so if you
can work with lower precision / expect less distinct values,
pass the parameters explicitly.

Usage

Using the aggregate is quite straightforward - just use it like a
regular aggregate function

And this can be done from a trigger (updating an estimate stored
in a table).

Problems

Be careful about the implementation, as the estimators may easily
occupy several kilobytes (depends on the precision etc.). Keep in
mind that the PostgreSQL MVCC works so that it creates a copy of
the row on update, an that may easily lead to bloat. So group the
updates or something like that.

This is of course made worse by using unnecessarily large estimators,
so always tune the estimator to use the lowest acceptable precision
and lowest expected number of distinct elements (because that's what
increases the estimator size).