This library/module is used to access the IDEA
event database. This database contains the main table for storing
IDEA events as well as some additional
tables that implement thresholding and relapse mechanism for reporting. For
performance reasons it is implemented directly on top of the
psycopg2 Python driver for
PostgreSQL.

This library/module is used to access the metadata database and work with any
other objects like user accounts, groups, reporting filters, etc. For convenience
it is implemented on top of sqlalchemy
Python universal DBAL and ORM toolkit.

Following options were changed in the main configuration file /etc/postgresql/11/main/postgresql.conf:

# Original default value 128MBshared_buffers= 16GB

Shared buffers are used to hold database representation in memory before writing it to disk. In INSERT/UPDATE heavy systems, larger values are beneficial. Setting this value too large does not help performance as OS disk cache is also used by PostgreSQL. Shared buffers smooth the IO operations rate and prevent spikes. Value of 16GB seems enough for this system and typical Mentat load. An increase might be in order if either the INSERT rate or mean event size increase substantially to keep the buffering effect.

# Original default value 4 MBwork_mem= 16MB

Specifies the memory size for use in internal operations such as in-memory sorting and hash table calculation. This limit is per internal operation, in typical query multiple such operations are executed. If the operation would require more memory than this limit, a theoretically slower on-disk variant is used (which might still be performed in disk subsystem cache with no real IO operations to storage, possibly even with better performance). Value of 16MB seems to work right for this system. As the internal sort is used in ORDERBY, some queries where index can not be used for sorting might profit from larger values (as the amount of data to sort is seldom too high in Mentat with typical query using LIMIT100).

# Original default value 64MBmaintenance_work_mem= 256MB

Maintenance work memory is similar to work_mem but used for maintenance operations such as VACUUM or CREATEINDEX, larger values (when compared to work_mem) decrease their run time considerably. Only one such operation can run per database at the same time, value of 256MB is generally enough for the typical table sizes of Mentat and IOps reserves of this system.

# Original default value was commented outwal_buffers= -1

Defines the maximum amount of memory (in shared_buffers) for write ahead log (WAL) data buffering. Typically WAL is written to disk after every transaction (hard requirement, otherwise journaling stops functioning as intended) so this should be large enough to hold all WAL data for a typical transaction for optimum performance. Value of -1 means no limit so transactions of arbitrary size (limited by shared_buffers, so essentialy no limit) are supported optimally. Limiting this setting only makes sense on resource constrained systems with small typical transaction size (few queries per transaction with small row size).

# Original default value 4GBeffective_cache_size= 200GB

Effective cache size is an estimate of the amount of system memory usable as disk cache for PostgreSQL. On a modern Linux system, this value is generally the total memory size minus the memory used by application programs. An efficient way to determine a suitable value is to monitor the targer system under expected load and note the amount of memory dedicated to disk buffers over time. On a system running PostgreSQL only, reasonable effective_cache_size is equal to the disk buffer size. If the DB shares the system with other disk write/read (total size, not IOps) intensive applications, this value should be decreased accordingly. This value represents the amount of on-disk data that PostgreSQL can estimate to be accessible in constant time. If it is larger than the sum of sizes of all tables and indices, disk IO is ammortized to initial read at first affected query after startup. Mentat is IOps intensive but quite light on total amout of data read/written.

Following options were changed in the main configuration file /etc/postgresql/11/main/postgresql.conf:

# Original default value 1GBmax_wal_size= 32GB

Write ahead log serves as a journal of changes against the on-disk snapshot of database that is created by checkpoint. In case of failure, the journal is replayed over the checkpoint to get to the state that includes all commited transactions. As checkpoint is a costly process that requires a table lock, it is best not performed too often (with the downside of a slightly longer database rebuild time after failure). Mentat is an INSERT heavy application as new events are recorded at all time at a rate of ~14 per second. Meanwhile Mentat can tolerate slightly longer database rebuild times. An increase in maximum write ahead log size from 1GB to 32GB increased the time between required consecutive checkpoints considerably, reducing overall system load and improving SELECT latency (as analytic queries are less likely to hit the less frequent checkpoints), furthermore this improves INSERT performance for batches of larger events (up to 250kB as opposed to long time running average of 10kB) which appear at an increasing rate lately.

# Original default value 5mincheckpoint_timeout= 15min

Checkpoint timeout defines the maximum time between two consecutive checkpoints. A value too small cases frequent checkpointing, an disk IO intensive operation. As Mentat writes are mostly INSERTs and batch DELETEs - larger values are sustainable. The increase from 5 minutes to 15 minutes leads to lower load of disk subsystem, leaving more bandwidth for SELECT - improving performance in case of hitting an INSERT heavy timeslot (large events), before all three firing caused disk subsystem overload (extremely poor performance), after the change the probability decreased considerably.

Following options were changed in the main configuration file /etc/postgresql/11/main/postgresql.conf:

As of PostgreSQL 9.6, with significant further improvements in 10.0, individual queries can be performed by parallel subworkers. Default settings are very conservative and allow this type of parallelism only in extreme cases and with very few workers. For any system, configuration has to be tuned individually. List of changes to improve parallelism follows:

# Original default value 8max_parallel_workers=36

Defines maximum number of parallel workers shared by all queries. As with Mentat we generally only have a low number of concurrently executed queries, this value was set to accommodate two queries with full parallelism, see max_parallel_workers_per_gather.

# Original default value 2max_parallel_workers_per_gather=18

Defines the maximum number of parallel workers for one query. To allow the OS task scheduler to place all workers on one physical CPU and to stop threads competing for resources because of HyperThreading, the value was set to number of physical cores on one CPU (18 for Intel(R) Xeon(R) CPU E5-2695 v4).

# Original default value 1effective_io_concurrency=4

Defines the optimal number of parallel IO requests for the disk subsystem. Default value of 1 is aimed at non-RAID HDDs; SATA SSDs perform best with values of 4 - 8; NVMe SSDs can support much higher values. Chosen value of 4 seems representative of example server’s storage capabilities.

# Original default value 1.0seq_page_cost=0.1

PostgreSQL query planner uses costs on arbitrary scale for local optimization of function of suboperation costs. These cost values represent relative cost of suboperations to each other. Due to performant storage, the sequential page cost, a representation of cost of sequential write when compared to CPU processing speed was decreased from 1.0 to 0.1 (10x).

# Original default value 4.0random_page_cost=0.15

Similarly to seq_page_cost, this value represents the cost of random IO. The default ratio of random_page_cost:seq_page_cost = 4:1 is aimed at HDDs. With example server’s storage, random reads are almost equal to sequential (on HW level completely equal, cache manager is able to predict and buffer sequential reads so those are a bit more performant).

# Original default value 0.1parallel_tuple_cost=0.005

The cost for parallel tuple retrieval is set to 1/2 of cpu_tuple_cost (for single threaded retrieval) to make parallel execution preferable, using more workers. This reduces the time it takes to perform the parallel portion significantly, making the total execution time converge to the time of serial only portion. This is important for systems with high CPU counts (60+). If the amount of available CPUs is lower, it does not need to be decreased that much. In any case, altering min_parallel_index_scan_size first is preferable as it brings results sooner.

# Original default value 1000parallel_setup_cost=10.0

Represents the cost to set up worker threads (pass the task to them as they already run). This is generally negligible and the default value of 1000.0 limits the parallel execution to extreme cases only.

# Original default value 512kBmin_parallel_index_scan_size= 128kB

Defines the minimum segment size to partition indices for parallel execution. Lowering this value to 128kB enables parallel index scans over the most used index events_cesnet_storagetime_idx to run with optimal number of workers across a large range of stored event counts when filtering for a whole day worth of events. Affected queries no longer run over threshold (2s). Lowering this value helps most if either the system is equipped with a high parallel I/O capable disk subsystem or the whole DB fits into RAM. Otherwise it is advised to watch out for workers starving on I/O.

Note

Conclusion:

The two most important changes enabling parallelism for queries are those for effective_io_concurrency, which limits storage subsystem queues, and parallel_setup_cost, which makes parallel execution seem viable even for simpler queries (and Mentat uses simple queries almost exclusively). The changes to costs are aimed at better prediction of actual run time (comparison of estimated and real run time is provided by EXPLAINANALYZE), after are the estimations better matching (not 1:1 but linear function). Number of workers was set based on current load, common number of concurrent queries and memory consumption settings from initial configuration. If under higher analytical (OLAP) load (multiple Hawat users), those values might be the best place to start further optimization.

PostgreSQL has two main maintenance tasks to maintain performance. VACUUM invalidates vacated entries in both tables and indices, to prevent bloat. ANALYZE performs frequency analysis on table columns, which are then used by query planner. Both procedures are assigned a command of the same name and need to be performed regularly. Ongoing VACUUM and ANALYZE is performed automatically by the AUTOVACUUM daemon. Default AUTOVACUUM settings are suitable for common use, in comparison to which the Mentat’s event table is very different. It is a table with ~125M rows, with constant INSERTs and batch DELETEs. Therefore the settings for this use case have to be altered.

# Original default value 100default_statistics_target=1000

A multiplicative parameter for depth of analysis for ANALYZE, it allows for a trade-off between ANALYZE speed and relative frequency data precision by defining statistic subset size. Default value of 100 is reasonable for smaller tables or tables with roughly uniform distribution of individual values in columns. As AUTOANALYZE is performed in the background and requires no lock, we can allow for a higher value allowing for better query plans. If the number of rows grows in the future, it might be a good idea to scale this value accordingly to allow for good estimations for the more rare of values.

Automatic VACUUM is performed on a table when the number of invalidated rows (most commonly done by DELETE) gets over “threshold + scale_factor * table_size_in_rows”. The default values of threshold = 50 and scale_factor = 0.2 are suitable for smaller tables but mean serious index bloat in our use case (9M invalidated rows takes about 3 weeks under current load). Current setting instruct to perform autovacuum after invalidated 10000 rows, effectively after each batch DELETE, which keeps table and indices much more compact. Occasional REINDEX (best performed by CLUSTER) is still required after change, but the frequency decreased from daily to about monthly.

The meaning and impact of these parameters is similar to their vacuum counterparts with the difference that new rows (INSERT) are also included in the calculation. Under constant INSERT that we see on Mentat’s events table, 20000 is reached roughly every 15 - 20 minutes. This interval allows for good query plan estimation for queries with time based conditions in recent past (select events detected in last 4 hours) as is often the case. If the number of INSERTs per second increases, it might be a good idea to scale this value accordingly.

These settings limit the automatic VACUUM operation, when the cost limit is reached (same scale as other costs), the operation is postponed for the length of delay. Default values of cost 200 and delay 20 ms are aimed at very low additional load to be generated by this background task to affect normal operation. As this is mostly CPU bound and runs in single thread, we can allow for more aggressive values in our use case. For large tables such as events, this allows the autovacuum to finish in a few minutes instead of tens of minutes with default values. The risk of cancelling background VACUUM due to conflicting operation is therefore lowered and the reclamation of invalidated rows is performed much more promptly what reduces the index and table bloat considerably. The chosen value of delay still enables brief autovacuum process sleeps during the run, what is aimed at preventing disk subsystem overload as VACUUM tends to be an IO heavy operation.

The database schema is generated using the sqlalchemy
from model classess and it will not be described here. Because of the database
use-cases in-depth knowledge of this database should not be necessary. If that is
not your case, please study documentation of appropriate modules.