PgCon 2014 Developer Meeting

From PostgreSQL wiki

A meeting of the most active PostgreSQL developers is being planned for Wednesday 21st May, 2014 near the University of Ottawa, prior to pgCon 2014. In order to keep the numbers manageable, this meeting is by invitation only. Unfortunately it is quite possible that we've overlooked important code developers during the planning of the event - if you feel you fall into this category and would like to attend, please contact Dave Page (dpage@pgadmin.org).

Please note that this year the attendee numbers have been kept low in order to keep the meeting more productive. Invitations have been sent only to developers that have been highly active on the database server over the 9.4 release cycle. We have not invited any contributors based on their contributions to related projects, or seniority in regional user groups or sponsoring companies, unlike in previous years.

This is a PostgreSQL Community event. Room and refreshments/food sponsored by EnterpriseDB.

New Committer

The PGDG congratulates Andres Freund on becoming a committer to PostgreSQL.

Meeting

Planned features for the AXLE project - Simon Riggs

"Analytics on eXtremely Large European data". Working with several univerities. Allows them to receive funding from the EU; a lot of the projects are around PostgreSQL. Something like $1m in research funding. Projects are required to do PR for AXLE. Contributions are aimed at Core; if they don't go into core, they'll be extensions. Everything will be open source. Simon is under contract to complete the research.

Simon will be working with the univerities and other AXLE participants to get involved in contributing to PostgreSQL. Focusing on real-world scalability for BI and DW; called "operational business intelligence". Looking to handle mid-sized data warehouses. Plans to use most funding for projects which nobody is working on elsewhere to get the most out of the funding, which is why they're not working on parallel query.

AXLEs interest in security controls has to do with medical data privacy because of some of the early use cases. Peter asked about benchmarks; Simon said that they would be creating a medical data benchmark which will be open, hopefully. He doesn't have control over licensing from external researchers.

AXLE contributions will be for 9.5 and 9.6. Done some work on security. Other projects are:

BDR/Replication

Security/data privacy/auditing

Mixed workload controls/resource management

VLDB management for bigger than 100TB databases,

maybe new partitioning

vacuuming large tables - especially vacuum freeze

Online upgrade via logical replication

Performance & Tuning

Minmax indexes

Event triggers

Column store/compressed data

Aggregate push-down

Bin function for data mining

Data visualization requirements like approximate queries and sampling

Hardware enabling, such as custom joins and custom scans, for GPUs and FGPAs

Materialized view substitution

Haas mentioned that EDB is also working on resource management.

Jeff brought up some of the special cases for hash joins mentioned on the mailing list. Simon is trying to figure out a generic API. Josh asked about matview substitution; Simon is looking for a simple case we can build up from. Josh asked about pluggable storage, and there was some discussion about pluggable storage APIs. However, it will be very difficult to deal with different heaps because of all of the assumptions built into other areas of the code. cstore_fdw was discussed as a example.

The FDW API is a huge success because a lot of people want to use it to access different data. We need to decide if we're going to extend the FDW API or do something else. With cstore, we have an FDW that just creates a local file. But there's no WAL, repication, backup etc.

Replication features in PostgreSQL core - Andres Freund

In 9.4 we have the logical decoding API. By itself it doesn't help users very much, it just helps replication authors. So what do we want in core. Andres is working on BDR/async MM. We have most of the stuff now, but want to submit the rest to core. As a first step, were thinking of integrating basic logical replication into core. External projects are not well-maintained.

The kind of intergration Andres is talking about is event triggers so that we can replicate DDL. Pure external solutions are bad because users don't trust them and some are not maintained. Josh mentioned that the clustering meetings wanted bulk update/delete in core which would also support BDR.

We need code to be reusable for special-case solutions. We want to enable building external solutions. Simon's user really want stuff included in core; they don't like external projects much. There was discussion about how logical replication is not just one thing.

Josh gave some examples of components which could be shared between replication systems. He also suggested that the "replication upgrade" and "replicate one table" cases would be simple and common. Gsmith suggested that we need to have support for simple sharding infrastructures in our roadmap or it'll really hurt us. Discussion about the value of in-core vs. external ensued.

Other things they need for logical replication were discussed.

Commitfest Management app - Magnus Hagander

Discussed commitfest app last year. We wanted something new. So the question is what we want to do ... so we want it, or is it not worth finishing. Magnus sent out an email on May 5th to review it.

Peter G. said that it pushes too much stuff to the list. Suggested that we can flag some items as don't post. Further discussions about app discussions. We need to be able to un-reject patches.

The new app actively pulls in threads from the archives, so that people don't need to paste them in using the message-ids. We'd like to be able to add some kind of metadata at some point though. There was a bunch of discussion of the current features and interface.

Smith asked about spinning up a copy of the app. The archives API is restricted, but the rest should work locally. There's no developer documentation at this point anyway.

Magnus isn't sure if he can have it ready for the first commitfest for 9.5. Maybe the 2nd.

Pull request model for PostgreSQL - Peter Geoghegan

Current patch model doesn't scale as well as Peter would like. Also as a reviewer can't push changes to other people's in-development patches. The pull model also preserves development history. Andres uses this for his replication stuff.

The major problem with this is the lack of history in the archives. But if there was an approved workflow, this would make things a lot easier for reviewers. We want to constrain how we rebase things. There was a discussion about rebasing. Andres has a procedure for rebasing.

We could manage the archiving issue with git.postgresql.org. There are some issues with that. Peter really only wants it for large patches with long development, maybe only major developers. Currently we don't accept pull requests. We'd need to change the developer and submitting a patch docs. For history reasons, it's important that it's in a git repo that the project controls. Simon brought up licensing issues.

Peter will write up a doc page on the wiki with a procudure for how you would do this. Then we can try it out before we bless it. Haas suggested that we don't need a preferred way to do it, and that one is possible. Josh said that he prefers pulling from git repos to patches for testing. There was more discussion about merge vs. rebase.

What Peter wants are feature branches, which we could reference by commit hash in the CF app. Folks suggested that maybe we need several defined workloads. We have a challenge in how we integrate large patches. Haas thinks that creates as many problems as it solves though. There was a lot of back-and-forth and small fixes four per hour for logical replication. And we don't want all that in the history.

Discussions of various workflows ensued.

Making the buffer manager scalable - Peter Geoghegan, Amit Kapila

Peter has done some benchmarking on the buffer manager. This benchmark needs some development, but already shows that we can saturate the buffer manager. He's been able to use this to test better buffer management algorithms. He wants to coordinate his efforts with Amit. His benchmark used entirely unlogged tables.

What Amit is doing to make the buffer manager more scalable. There are two bottlenecks: the buffreelist lock, and the bufmappinglock. The first has been a topic of discussions; a new algorithm can decrease this contention by chaning how pages are freed. But that shifts contention to the mapping lock. Amit thinks there shouldn't be much contention at low numbers of buffers, but Haas contends that it's a matter of how many backends you have. Amit wants to make the hash map for the buffer much more concurrent. Also make the bufmappinglocks proportional to the number of clients or the size of the buffer.

Andres removed read-only lwlock, which removed the contention on the bufmapping locks entirely. Amit will test that. There's contention on the root page even if we increase the number of partitions.

Peter is trying to make the algorithm better in terms of what to cache and what not to, and Amit is trying to improve freeing of locks. Just increasing the maximum usage count will break high-concurrency workloads.

Josh brought up the ARC cache, and the changes to the Linux cache. Peter mentioned CAR which is a successor to ARC. Andres disputed that Linux was moving in that direction. Simon would like the ability to track buffers outside of the buffer pool so that we can know if we want to increase the size of the cache.

Peter and Andres discussed buffer algorithms at some length. We need to look harder at how our clock-sweep works. We can profile this with perf. Greg suggested that the problem is optimizing when to throw out data we don't need anymore.

Jeff mentioned that pinning shared buffers also needed optimization. Amit says that the major lock he found was lwlocks. Maybe we could have a local recent pinned cache. Peter found that reference period was a much better idea than anything tied to the transaction, based on wall clock time instead of anything else. Checking clock time is expensive, though. A single lookup in the buffer mapping hash is 6000ns, it's way high. The pg_test_timing tool defines this. Greg's only workaround for checking clock time was to have a daemon which cyclically checks clock time in the background.

The alternative is to count accesses in one operation or how many buffers are accessed at the same time. Haas says moving to a slower system should still evict buffers in the same way.

Scalability issues in Postgres - Andres Freund

PostgreSQL doesn't scale as big as we want, because of heavily contended locks. The locks Andres found were different from what he expected; ProcArrayLock wasn't nearly as bad as expected. BufMappingLock was a big problem, though, as discussed. We hold that lock so shortly that it's all lwlock overhead. We want to rejigger the lock allocation, but we need to support atomic ops on the CPU in the future.

Tom questioned whether atomic ops actually worked because of cache lines. Andres said that Intel has optimizations for this. But a benchmark was 400% faster. How do we deal with the portability issues? Andres proposal is that we use spinlocks for the atomic ops we don't have, or semaphores on the platforms which do them. We could just desupport all non-atomic-ops platforms, or we could require no performance regression for non-ops platforms on the other hand. Andres is proposing a compromise.

The platforms we're talking about are ARMv6, which is the old ARM. And HPA-RISC. Those are really old platforms. Some worry that those code paths will get no testing. We're talking about compare-and-swap and atomic increment. How many ops will we require? There are different platforms which support different sets of atomic ops. We don't want too many combinations to support. We need a matrix of what different platforms support and what we want to use. Andres thinks that platforms either support all of them, or test-and-set only.

Mostly non-atomic or limited platforms are older systems. How much do we care about those?

The main issue is the need for barriers in atomic ops, which is not supported by spinlocks. Also, are we going to have atomic store and load?

Andres to put out an email and matrix of supported operations.

Once Andres removes the lwlock, contention on buffer pins was the next congestion. Andres wants to do buffer pins/unpins as atomic ops. Couldn't figure out how to remove the spin locks for all things, but could remove the lock for pinning without replacement.

We should put up wiki docs for testing users how they could trace things to see what's being pinned/unpinned so that they can test these changes. We should get those patches in early because it will uncover other bottlenecks.

Heikki will be working on CSN and replacing how snapshots are formed. There was discussion about who was going to be working on this for what.

Scaling out to Many Nodes

Bruce brought up the fact that PostgreSQL needs to think about scale out in core. Greg said that we can already build dumb non-relational sharding. Others questions about what we want more than that.

Haas discussed stuff from PostgresXC which we might want to bring into mainstream Postgres. Like we might want a GTM in mainstream postgres.

Verifying the integrity of indexes - Peter Geoghegan

Peter submitted a patch called "poor man's normalizing keys", which did text sorting much faster. This is from a research paper. This technique is widely used commercially and is why hash indexes aren't much used anywhere. The way this works is that you use a cheap, broad comparitor, and then only compare items in detail if they show up as approx. equal. It's 3X faster.

This needs to be generalized to all scans and btree operations, not just a tuplesort. We'd have these broad keys in the btree branch nodes for comparison. But there's an aversion to messing with the btree code because you can seriously break stuff.

Having some kind of automated way to verify the integrity of indexes is critically important to that we can find index bugs before they happen. Without such a test, we can't do much experimentation with index optimizations. Also we'll need index testing for UPSERT.

Haas thinks that the poor man's technique will regress for some use cases. Simon suggested some kind of automated function scan. Peter dismissed this because this is meant to be a general optimization which make all btrees faster. What we're discussing here is to have a way to check that the indexes are valid.

We have to make this a switch per index anyway to make pg_upgrade work.

Discussion ensued about the btree indexing patch. Examples were given about cases where it would cause a regression. However, we don't really have data on this; Heikki's worst case was caused by a specific bug.

We would like both a index integrity check as well as a heap validity check. People in Stephen's class thought we should have a background worker which checks data integrity.

Auditing, changes to logging, etc. - Stephen Frost

This is about how to change logging to improve it and meet regulatory requirements. Stephen also wants to meet requirements in 800-53 publication. We need to be able to log information about a particular session without logging everything. We don't want to log all statements because we want to turn logging on all the time. Selects are as important as updates.

pg_audit was be a good start, but just a start. But we need to track whether particular tables have been touched, like credit card information. This needs to happen even if the query goes through a view or function. Also Stephen wants auditing in core eventually. Haas argues that it doesn't need to be in core.

Also, how do you check if a table is flagged for auditing? Stephen wants a reloption here, but that's a tradeoff. Stephen also would like to have syntax, not just functions. What if you want to have different auditing on the replica? The reloption should just be "audit=True". A background worker which injects log messages into a queue would be good too.

Greg says that this is part of a general class of data that we can get from logs or system views which we want to have persistent information for. Like people want snapshots of stats, or last checkpoint data. Users should have a way to send a message of "please log this for me". Where we log it might be different. Greg likes a pg_stat_statements type model where we can see cumulative data in memory. Some users will require persistence of all messages, though.

Kevin suggested using replication, which would be possible since you can write miscellaneos messages to the replication stream. Haas pointed out that "in-memory queue table" was another example of a different kind of storage.

The rest of this can be implemented with event triggers or logging hooks and a background worker. Greg's employer is talking about having him work on 5 different areas:

Do we have access to the write data to log? Some things are not really visible to Postgres.

Filtering for audits at table/user level, whatever. Also look at where filtering is applied.

Low-overhead auditing for selects.

Storing audit data somewhere -- options other than text.

At what point does stuff go into core?

Greg also remarked that the reason people like stuff in core is because it's guaranteed to be maintained, which stuff outside core is not. But it's a matter of ownership, not of which tarball it's in. Further discussion about in/out of core ensued.

Permissions, ROLE attributes vs. GRANT options, etc. - Stephen Frost

First, we need to reduce the need for superuser rights for so many things. Some discussion of the limitations on DEFAULT PRIVILEGES. Setting a role as "read-only" would also be nice. We would also like the ability of non-superusers to be able to create extensions. Or to limit the ability to SET certain GUCs. There is already something for extensions which Heroku uses. Haas suggested an event trigger to override default access policies.

Greg brought up his security use case, which basically amounts to eliminating the superuser by assigning all of the superuser's rights to other users. There's roughly 50 places we call is_superuser. We want to figure out ways around this for all 50 and create 30 or 40 individual permissions.

Haas said there's a more complicated part. What about doing things which are more fine-grained. Stuff which is boolean is easy. But stuff which is not just true or false would require other stuff. Stephen brought up the example of allowing users to use COPY, only from a specific directory.

Noah suggested just using security definer functions. Dave pointed out that users don't want us loading a whole bunch of functions into the database just for monitoring. But read-only-everything for pgdump is a good simple case. What if we removed the ROLE requirement for DEFAULT PRIVs? But there are some object types which don't have permissions.

Josh brought up example of webusers who run webapps as superuser because they ran into issues with permissions. Greg said that we never really examined a lot of these permissions. The Linux kernel went through this for root, and capabilities, etc. And it's still a work in progress.

Ways to avoid the serious bugs that appeared in 9.3 - Bruce Momjian

One of the proposed slogans for 9.4 was "no more multixact bugs". Bruce couldn't figure out whether bugs were new or old or already fixed or something else. And if Bruce can't track it, our users can't track it. But how do we prevent it from happening again.

Kevin mentioned that the bugs were created by race conditions we couldn't get in a simple test. But some bugs just required a bad vacuum. And some were WAL replay bugs, which didn't show up without full page writes. The patch touched heap_am.c on a low level, and that file is a mess so we didn't anticipate the issues.

Noah pointed out that as commitfests get later, people push in things they really shouldn't. Simon pointed out that that particular patch had been hanging around for 2 years, and there wasn't really time pressure. The two years were part of the problem because of incompatible concurrent changes. Simon wants to categorize changes and issues into critical/noncritical. Stuff which can break datastructures is really critical.

Committer trust might have hurt Alvaro's patch. Haas didn't really look at it, nor did Tom because they mostly assumed Alvaro was correct. Like when Tom submits something, Haas doesn't look at it, which can be bad. Noah suggested asking someone else to commit your patch. Simon disagreed that it was an issue of review. The problem is lack of testing.

Josh pointed out that people found the issues as soon as they upgraded. People postpone upgrading until version .4, so they won't test anything.

Stephen has talked about building out a performance farm. People suggested releating in June or July instead, if people are not going to test. People test exciting new features, they don't test under-the-hood stuff. People tested 9.0 because they were waiting for replication.

This was a major feature of 9.3, it was one of the main reasons to upgrade. We wanted the patch. Andres pointed out that the corruption was invisible most of the time. Maybe we should run the heap_am checker at the end of the regression tests. Stephen wants to have a heap_am background worker.

Improving testing and beta testing - Josh Berkus

See above, plus:

Josh asked if we really have no interest in getting additional beta testers. People suggested a bug bounty. This was suggested as unrealistic. Haas asked where are our worst potential bugs in 9.4, and mentioned some patches. Maybe we should have some regression tests for them.

Simon suggested crediting people on the footer of the release notes for bug fixes. As well as reviewers. Andrew suggested that we need ways to get customers involved, not just random folks on their laptops.

Stephen wants to put together a performance farm, or a high-stress testing environment. We could gather sample data sets and queries to run to get more of a variety of stuff.

Heikki's wal testing tool would be helpful if it could be automated. And a heap sanity checker. Noah suggest that when a test infrastructure is in the tree, people use it, like clobber_cache_always.

Maybe we need a fuzz tester for PostgreSQL like the one Linux had which they spent a year fixing. It's a little harder with PostgreSQL, because they're more flexible.

We also have no way for Django and other downstream projects to report test suite results. We could run those too, but it would be good to get the results back. And people QA stuff against PostgreSQL versions.

Kevin suggested running DBT2 for a week on a machine. Josh suggested a performance farm could be used for this together with checkers. We should also look at the coverage of our current regression tests. We should allow adding new tests during beta. Especially additional tests which are run under different suites. Stephen suggested that we're too restrictive about accepting new regression tests. We should also untangle regression tests to that they are more idempotent. Peter suggested throwing out tests which have been passing for two years, or moving them out of the main suite.

Stephen asked about getting test suites from people, like EDB. But most of the EDB tests are for EDBAS.

Funding developers for maintenance - Josh Berkus

One of the ways which other projects have dealt with not having a reviewer/maintainer time is by paying a few maintainers to work on the project full-time reviewing and maintaining. People asked where we would get the money. Haas pointed out that the Linux foundation pays people quite well. Haas loves the idea, but is dubious about the money.

People generally approved the idea. Not sure about where the foundation would be. We do have a few people who get to work on Postgres full time for their employers. Also there are issues of governance. We would need to have trustworthy people to manage that person.

This works for Linux because they have a benevolent overlord. Tom isn't quite that, do we want to give him more power. The Linux Foundation is a good example and could also be an incubator for us. Companies might not want to pay the amounts required.

This bears investigation, as to whether we could make it work. It could also fund testing.

9.5 Schedule

Do we want to release 9.4 early? Once concern is whether the JSONB format is going to change at all, we want to make sure that it's right. How do we know when we're done with that.

Simon asked about the release date and the close date of the CommitFests. It would be better if we had more distance between the final and the first release. So for 9.5, maybe we should talk about having different schedule. Robert suggested a year-and-a-half release schedule. Fewer releases would be fewer to support. But people wouldn't like that.

Having release dates we know about is useful. Getting the release out in June would require terminating CF4 sooner. But there's too much pressure to get everyone's patch in.

Simon would like to do more in the summer. Right now we just use 6 months out of the CF. We can't move it earlier because of summer, though. We work on stuff in June. Robert suggested that we have enough people on our project to do a release in the summer. Peter says, no, we don't.

Maybe we should add another commitfest, in Feburary then? Isn't there some overlap with the beta? Not really. We'd need to branch at the same time we do beta.

We also talked about moving CF4 to Feb 15th instead of Jan 15th. There was debate and discussion about whether or not that was a good thing. Or we could have 5 commitfests.

Here is the commitfest schedule for this year:

June 15

August 15

October 15

December 15

February 15

Beta in June for 9.5

Other Business

Robert asked about inviting Jeff Janes and Dean Rasheed.

Clustering meeting report -- Josh Berkus

End of Meeting

Proposed Agenda Items

Please list proposed agenda items here:

Planned features for AXLE project (Simon) - details of which features are planned, when

Verifying the integrity of indexes (e.g. structural invariants, that index comports with the heap) with a utility command (Peter Geoghegan)(Directly related to Bruce's item. I Would mostly like to improve the regression tests, to make other new improvements around the B-Tree AM more palatable.)

Auditing, changes to logging, etc. (Stephen Frost)

Permissions, ROLE attributes vs. GRANT options, etc. (Stephen Frost)

Pull Request Model for PostgreSQL Contributions (Geoghegan). There is something to be said for the LKML model, and I believe we should adopt some aspects of that model. Note that I am not proposing that we adopt merge commits, nor that we fundamentally alter our workflow in any other way; I am proposing an approved workflow that some contributors can opt for as an alternative to patch files where that makes sense. Contributors are disinclined to submit code using a workflow that is not officially approved of. I would like to have a process formalized and approved outlining how large patches may be developed in a feature branch of a remote under the control of the author. We can version revisions of the patches using commit hash references, preserving most of the advantages of patch files (perhaps snapshotting with the CF app, so there is a patch file for each version in the archives). With large, complex patches, preserving the history of a patch as it is worked on, and the author's commit message commentary seems quite valuable. Hopefully we can come up with something that weighs everyone's concerns here.