PgCon 2011 Developer Meeting

From PostgreSQL wiki

A meeting of the most active PostgreSQL developers and senior figures from PostgreSQL-developer-sponsoring companies is being planned for Wednesday 18th May, 2010 near the University of Ottawa, prior to pgCon 2011. In order to keep the numbers manageable, this meeting is by invitation only. Unfortunately it is quite possible that we've overlooked important code developers during the planning of the event - if you feel you fall into this category and would like to attend, please contact Dave Page (dpage@pgadmin.org).

This is a PostgreSQL Community event. Room and refreshments/food sponsored by EnterpriseDB. Other companies sponsored attendance for their developers.

Move To Git Recap

Robert, Stephen: things are good, nothing necessary. Andrew thinks that it would be nice to be able to pull a big patch and build. Fetter wants buildfarm to subscribe to alternate repos. Maybe we should bring back the Hudson server. Buildfarm isn't right for this.

Real issue with Hudson was maintaining it. Nobody was using it or cared about it.

Some projects haven't switched. pg-translation hasn't switched yet; being worked on now. Do we need to get other projects to switch? JDBC. CVS is a bottleneck. pg-translation needs some work, it's documented but nobody does it but Peter. Should we merge it with the main repo?

Should we be mailing patches around? Haas finds it less work for him. Smith agrees. With branches it's hard to figure out diff. Could use github compareview. For now, we want people to submit the link to the patch and the github repo. Maybe add something to commitfest app with a link to repo?

The real problem is that the archives mangles patches. Need to find out why mhonarc mangles patches. Inlining of patch text mangles them.

Build and Test Automation

What problem are we trying to solve? Performance regressions, for one. Also, proving performance gains from specific patches.

Haas doesn't think we can find regressions. But we've definitely had cases. There are tools which automate specific query regressions. Josh can work on a tool for regression tests.

Regression tests are a separate case than performance progress. pgbench doesn't test the things we care about very much. Greg Smith has build the first incarnation of a performance test farm. Has measured in-memory select-only performance. Greg gets 56K selects/second on an in-memory database for key lookups. Frost and Smith hacked up the the buildfarm structure to do performance tests. Once they have the code, people will provide hardware.

Good progress on server-side code, but need to integrate with the buildfarm server for performance test results. Client part is done, but need server part. Andrew can put some time into integrating the server. Need to specify data structures. The webapp side isn't done.

Bruce wants something which allows testing configuration changes and similar. Andrew has just completed a feature for the buildfarm which allows adding new modules, like running other things, pulling from repos, building drivers.

What about cloud servers? Too much variability? In the future, we could measure things like CPU ticks to measure efficiency. But not now. Clouds could be used to test different memory sizes.

SE-Linux/PG

Kaigai prepared handout [link]. Last year tried to add many things, today will be more focused. Kaigai wants to share information from the SE-Linux community, and talk about row-level security.

In the 9.2 development cycle, Kaigai wants to implement a userspace cache for authorization. The cache needs to be validated.

The named type_transition rule to assign default labels for new objects, such as temporary tables. Doesn't require any core code according to Stephen & Haas.

Leaky views and row-level-security (RLS). This is an old problem. Functions can be used to find things which the user isn't entitled to. This isn't just user-defined functions, but all functions. For example, division-by-zero errors can be used to expose numeric values. Or casting values to the wrong data types. So invisible columns aren't so invisible.

Need to discriminate what are problematic and not problematic scenarios. For example, if you prevent pushing down quals to relations, it improves security, but you have to decide what to restrict. Heikki proposed that indexable operators get pushed down, and not anything else. We have 2000 system-defined functions.

Tom thinks it's not worth trying to fix UDFs, since there's too many ways to circumvent. What about contrib modules? We could look at indexes on that specific table, those are the only ones we care about. We're trying to get a view which isn't leaky and has reasonable performance. So we need some push-down.

We need infrastructure in the core to decide which push-downs are OK. We need the checking function to be applied before the qual functions get applied. Checking if there is an index which could use the operator wouldn't be workable according to Tom. We could check if the user has permissions on everything underlying the view, but this is not the use case we care about.

Also, none of this gets us tagged RLS instead of views. For that matter, it would be good to have RLS which wasn't dependant on SE-Postgres. Haas suggests imposing mandatory filtering conditions per user. It could be using the same framework as security views. There's a big use-case for virtual private databases.

Haas warns that this can be complicated for managing security for large numbers of objects. Predicate-based RLS can be used to implement label-based. Kaigai says that they can be reconciled.

Could we use triggers to apply labels to rows? Kaigai says yes. Need to beware of multiple before-insert triggers.

Stark suggests that the user could declare with quals can be pushed down on a per-view basis for security views. This is rejected as unworkable. Kaigai originally proposed supressing error messages. This only eliminates one side channel though, not all of them. Also, supressing error messages is a bad idea.

There was a lot of further discussion about possible approaches to prevent side channels.

Kaigai suggest that it's not worth pursuing covert channel supression or preventing probing. Heikki gave the example of a user/password table and they explored this a bit. It was suggested that rows you can't see be nullified. But this doesn't solve the qual push-down issue.

Stephen suggested that we can push WHERE clauses down into set-returning functions. That would be useful anyway. This is like the FDW API, but it's different. Maybe we could just have FDWs to local tables. But it doesn't actually help.

Kaigai summarized. We don't have a solution for leaky views, though.

MaxAllocSize

Issue is that we'd like to be able to allocate more than 1GB for some things. Hashtables, sorts, maintenance memory. Hash aggregates don't spill to disk. Stephen has a solution for this with doing additional palloc requests. He'll work on this soon.

Currently palloc is fairly inefficient for vaccum; we palloc based on table size, so often we overallocate a lot. Maybe we could make multiple palloc calls, but that would increase overhead. This will be a bigger issue if users can allocate 8GB to vacuum.

9.2 Schedule

Discussion on pgsql-hackers didn't reach any conclusions. Could we make a decision about what the schedule would be here?

Committers don't want to change the format of the CFs. When should the first CF start though? We could do the first one earlier than July 15 this year. But will that pull people off getting the release out? People are already working on 9.2 features anyway. Shooting for a slightly earlier branch/initial 9.2 CommitFest in June helps some with patch developer bit-rot, and may let developers who are focused on new features be productive for more of the year.

If we want people to work on the 9.1 beta, we have to give them specific things to do. Most people don't know what to do for 9.1 now. And the list of open items hasn't been addressed.

Part of the issue is that we don't have any formal structure to the beta process. We'd have a lot more to do then.

Last CF of the release (January 15) is tough to reschedule usefully due to concerns about December/beginning of the year holidays.

Work in August is particularly difficult to line up with common summer schedules around the world. Having the other >1 month gap in the schedule go there makes sense.

Should we do more than four? Can't make that work. Hard to adopt without more active volunteers working on review (both at the initial and committer level) and an increase in available CF manager time. Should we reject large patches submitted for the last CF? Discussion of that later.

The first CF goes very quickly, so we don't need to optimize for that. So the new schedule is:

June 15

September 15

November 15

January 15

Need to publicize it this year, send to announce etc. Greg, Selena to update web pages.

Cluster Meeting Summary

Addition: For parser export, it was suggested that the lexer is enough for a lot of cases. We just need to take the parts of the psql lexer and bundle it as a library. Or we could generalize the ECPG hack for scanning the grammar to support what pgPool needs.

DDL triggers would also be useful for SE-Postgres.

Improving Logging

MySQL has the ability to log stuff to different files rather than all going to one big file which is nice. Stephen proposing making our log tag-based which get sent to specific files based on filtration. We need to decide a set of tags, and put multiple tags on each log line.

Magnus mentioned that he proposed something similar which got blocked because of STDERR messages. We'd need to not send everything through stderr, then. The logging collector would need to accept data structures. Would also support third-party filters.

One of the problems with this is that it makes the log_collector more complex, and thus less reliable. We could have a default log, though, and thus only add fix error messages a little at a time.

Marko wants to send the log directly to a network syslog instead of a local syslog.

Greg Smith thinks this is going down the wrong path, we're just making a bad system more complicated. He'd rather that everything go to a table. Or to pipes. Magnus has something working for logging to pipes.

Tags are still useful, but where it goes is a separate question. Several people think that sending stuff to different log files is not that much of a problem. We could just split stuff into a default log and a tagged log.

What Josh really wants is a table. There are issues with that.

If we log to a pipe, then people can do what they want with the output. People will use the CSV format. Do we want to make the csv format configurable? Josh and Dave Page think it's not that useful to make it configurable. This is the lowest priority. The text of the query is the dominating factor.

Slave-only Base Backups

Pre-9.0 we could make backups based only on a slave for PITR. But we can't do that for streaming replication. The issue is that the marker for the ending location doesn't get sent over the stream.

Treat wants to be able to take backups on slave machines without touching the master at all. We can't do this because we can't run pg_stop_backup on the slave. Heikki thinks this ought to work now. But this doesn't work when you want to promote the standby to a standalone.

Treat will post his testing information to hackers. The real thing is to support pg_basebackup off the standby, but that requires cascading replication. Heikki and Treat discussed this problem for a while. The issue seems to be the backup labels. They will follow up on this.

Resource Control

How we control multiple queries executing in the same environment. We have issue with IO and memory. One issue is because work_mem is locally settable we can overallocate. Plus it's hard to count work memory. For disk IO, it's common to want to run large queries in a slow mode so they don't have a big query having an impact on shorter, more important queries.

The way we solved that in Greenplum was resource queues. There are other possible implementations though. Global resource pools is how you did it in the old days.

Jeff comments that our operators don't obey work mem even locally.

Josh discussed that admissions control for queries at estimate time would actually work, or more that it would actually improve things. The issue is that we'd have to replan the query, which would be costly.

Kevin suggested that we would queue queries rather than replanning them. This seemed generally a good idea, much better than replanning.

We also don't track the amount of memory used, but we could do that. Drawing from the pool at estimate time appeals.

One thing to minimize effects on disk-io is to do "query_delay" like we do vacuum_delay. Also for DDL operations, which can take a really long time. Greg Smith tried to build this once. The problem is how do we accumulate costs? The stuff in vacuum is pretty buried and nonportable.

The main point is to get resource control on the agenda so that the idea doesn't get kicked off pgsql-hackers.

Simon's experience in priorities is from Teradata (low, medium, high) is that that's a terrible model which doesn't work in practice.

I/O and WAL traffic are resources we need to control. Replication delay is very spiky based on what's happening on the master, which is a problem due to data loss. Need to discuss on lists.

DB2 allocates work_mem out of a shared pool. This is relevant to parallel query, because it would require shared memory for sorts.

David Fetter wants to look at sort algorithms for SSD or ramdisk. Discussion about algorithms ensued.

Lunch

Sandwiches, salad, cake.

Authorization Issues

Are we able to drop priviliges at appropriate times? One thing is that the SQL standard does not have RESET ROLE, so they don't cover this problem. SET LOCAL is limited to the current transations but not to subtransactions. Security definer functions which call ordinary functions after a SET ROLE don't work correctly ... they can RESET to the higher ROLE.

One possibility is to have an actual stack of ROLEs. We could extend what we did for autovacuum.

Alvaro suggested removing RESET ROLE. Or disable it within Security Definer. You'd need a stack for the current session.

This is definitely an issue. Haas suggested making RESET ROLE a protocol-level call. Marko objected that you don't want to be invalidating the cache every time you change ROLE.

DBT-2 I/O Performance

A comparison of DBT-2 with PostgreSQL against certain other databases. Our performance is superior on many real-world cases, but the other database really outstrips us on DBT-2. One critical issue is the amount of IO we do.

Particularly, the amount of WAL writing we do is almost three times as much. Full_page_writes is one big cause of the additional logging. Turning off full_page_writes (FPW) decreased WAL logging by 70% and increased throughput by 25%.

We also do a lot more writing of the database files, and checkpoint sheduling improvements might help that.

The DBT2 benchmark includes a table which is too big to be cached, so its dominated by I/O performance. Since the table isn't cached, full_page_writes are more frequent. TPCC spec only checkpoints once per benchmark run.

Folks in the NTT group are worried about the I/O performance, and are nervous about using PostgreSQL in I/O-bound workloads.

Have we considered compressing full_page_writes? No, not yet. We could test it with no WAL logging at all.

One, we can make full_page_writes configurable per table. This doesn't work if we don't have a recovery strategy. One way to solve this is checksums. Also a problem is that we don't detect the corruption immediately, we'll have the corruption and not encounter it for a long time.

Heikki suggested that if it was always safe to replay the log without FPW multiple times. This might bloat the logs just as much as FPW, though. The big issue is that FPW are occurring more often than we expect them to, and we should figure out why.

Koichi wants to solve write order for checkpoints. This speeds up recovery by 5X.

Stark suggests that the main issue is that our database is just larger, and that's the source of a lot of I/Os. We don't have the data for this.

One thing we do is that when we log the WAL for a row, we log the whole row. We also don't have compressed indexes.

DBT-2 is completely I/O bound, so you can sacrifice CPU to improve I/O, like by doing file compression.

Our row header is 24bytes. This isn't a good area of performance.

Archive logs are uncompressed. pg_lesslog + compression can shrink the archive logs by 85%. pg_clear_xlog_tail is more safe.

Haas suggested writing only the page header and pointers instead of the full page.

We can't assume that the OS is passing full 8Ks to the storage, which is why you can't turn off FWP even if you have BBWC. Heikki's suggestion would fix this, though. InnoDB does "double-writes", where full pages get written to a separate file.

Various strategies were discussed for fixing full page writes.

Tom liked the idempotent writes approach. The NTT group plans to work on these problems and submit some patches.

Increasing the num_buffer_partitions might scale us to more cores. We've been hung up on lack of performance testing for these changes though. Kevin cited a real-world case where increasing drove performance down. There is also CPU sharing on the LWLocks, which means that we have a cacheline for each buffer partition. We could improve this.

Why are the LWlocks in one huge array? Why not parititon them? One issue is looking up all the locks belonging to one partition. There was more discussion about possible LWLocks structures.

We can't make this configurable at runtime because we'd lose a lot of compiler optimizations, athough there may be a way around that. We can't ignore the one-core use case though.

Clustering Tables Concurrently

For 9.1 Simon proposed a command which allowed you to switch which file a table was associated with at runtime. The use for that was clustering a table concurrently. The syntax of this switch operation was problematic.

This is important because it affects the amount of I/O we do on large tables. For example, master/child table setups (like invoice and invoice_items) are very common. It's really good to keep child records together.

The way you do this, is you replicate the database via londiste, you make all your changes to cluster on the copy, and recopy it over. But you need to capture all the intermediate changes. Simon wants a Queueing operation built into the core which will capture intermediate changes.

Simon wants to put the queueing mechanism into core, for cluster concurrently, for materialized views, for replication. This isn't a queueing mechanism, so much as logical replication buffer. This is kind of like putting pg_reorg into core, only according to Treat pg_reorg doesn't work this way.

Haas suggested that what we really want is index-organized tables rather than this whole mechanism. Heikki suggested that a background daemon which rewrites a bit of the table at a time. But that sounds like vacuum full.

This is two patches, one is generalized queueing mechanism, the second is cluster concurrently. There was more discussion of alternatives. Koichi tested multiprocess clustering of tables, but it didn't work very well.

Database Federation Support

This is another proposal for contributing PostgresXC code. PostgresXC contains a bunch of code for doing queries on remote database nodes, including cross-node joins. We can submit patches, but they will be slow. For one thing, PostgresXC will be very busy with alternative development. Probably for 9.3.

Code is appropriately licensed, just not documented very well. Maybe be some issues with old EDB code, but working that out.

Haas mentioned the need to break this up into a lot of small patches.

User-configurable Daemons

User-defined functions could have a loop and execute for many hours, or forever. Basically what we need is just startup and shutdown.

What about pgAgent? This is supposed to be something different. And pgAgent is C++/GUI tool.

Another idea is the stored procedures we're talking about, which aren't one big transaction. We probably need that first. Some people want to write a helper in C do help replication or whatever. This sounds like writing your own application server.

There was some discussion of how this relates to parallel query.

This is essentially a backend which doesn't have a frontend attached to it.

If you didn't have to have a single long-running transaction, then you could ad-hoc this with init scripts. That's ugly, though. You could have two processes with a pipe, Heikki suggested.

Stark said that this whole thing could be developed as an extension. Listen/notify is a big use case for this.

None of this works without a non-transaction backend code segment. That has to come first, but it's easier than general stored procedures.

Moving Contrib Around

Smith would like to move several contrib extensions into core. He actually wants to this for 9.1 for packaging. People don't like to run things in contrib because they don't trust them. Greg has moved six and they're so low impact that we could move them over even in beta.

If you rearrange the source tree, it doesn't modify the packages at all. You need to modify the build scripts. The place where the extensions end up is /share/postgresql/extension. The PLs are there. Doing that is a matter of getting the packagers to change things.

Some extensions are ready for production and safe, others are not. We need to break down contrib into a way sysadmins can understand it. Extensions with external dependancies need to be separate packages. The only contribs which have external dependencies are UUID, and a couple others. So we can add the stuff without dependencies.

We need to clearly document this for the packagers. There is a precedent for this with Perl and Python etc. Dave, Magnus, and David Wheeler were talking about the PGXN site, and the Perl modules are a PITA. Users hate having loads and loads of packages. We do need to be careful not to oversplit it.

We need to flag extensions as maintained by core postgresql. But sysadmins are not going to install anything optional.

First we need to agree on the categorization, then we need to communicate that to packagers. Moving the source code around is a last step.

Adminpack is a borderline case. Should we allow it or not? Some people don't like it because you can write files. Probably defer for now.

We need regression tests, documentation improvements, and Greg's patch. Not useful to set up a make target for the packagers. And we need documentation for packagers.

Make Install should maybe build those seven as well.

There was consensus to go ahead with Greg's patch.

Managing Release Schedules

Dave would like to make the schedule more regular without having crappy releases. Stephen suggested having more information about patches in the CF app. Haas thinks we should just have a time-based schedule, absolutely. Getting new features is not a problem.

The problem is at the end of every cycle we end up slipping. Or at the end of the cycle we get into major bikeshedding on an issue right at the end of a schedule. We could have release manager who would keep things on schedule.

Haas says the real problem is huge stuff getting submitted at the end of the cycle. We had 6 major features contributed for the last CF. The problem is people starting from scratch at the beginning of every cycle. People arrange their own schedules based on the time they have available.

Fetter pointed out that no other DB releases a major release every year. So we're not doing that badly Haas suggested that he could stop watching the schedule. He doesn't like being the schedule jerk.

Stark asked why we're landing large patches at the end instead of a bit at a time. Linux has a tree called Linux-next where they accept a lot of interim patches. Or you could bump the patches to the next release.

Tom pointed out that 9.1 is coming out more-or-less on time. But that's partly because Haas and Tom were working on other people's code for 3 months. Collations also should not have gone in when it did.

Selena suggested that we actually need a process for dealing with bad patches and reversing stuff. The problem with Collations was not that we wanted to reject it right away, we didn't know until later it was an issue and it would take a lot of effort to back out.

We need to distinguish patches which need more review and patches which the author hasn't fixed. Haas is unhappy with system where some people's patches are treated differently.

Simone pointed out that we also argue about patches until the deadline as well as people working until the end. Fetter proposed again that we have a release manager. Josh suggested that the schedule is not the problem, the problem is the use of Haas's and Tom's time. Haas points out that people argue against rejecting patches.

There was a lot of further discussion regarding patch arguement process, which would continue over beer.

Other Business

Marc Fournier has retired from the Core Team. Marc is the fourth person who has left. It's really hard when people leave -- it's usually because their lives change. Marc was unbelievably important in providing the community with infrastructure at the beginning of the project, without with PostgreSQL could not have been successful. But he couldn't even make it to pgCon this year. We have a good infrastructure team, so the handoff is fine. We should thank him for the time when he was there when we had nothing.

Marc will still be contributing to our infrastructure and we will be using hub.org for some things for a while. Mainly he's not going to be around day-to-day, and and won't be building the tarballs.