professional automator

When you are coding in a source tree as big as PostgreSQL’s, you will
at some point want to look into some kind of source code indexing.
It’s often convenient not to bother, since git grep is actually
superfast. But when you want to find where a function is defined
among all the call sites, some more intelligence is useful.

The traditional tools for this are ctags and etags, which create
index files intended for use by vi and Emacs, respectively. The
PostgreSQL source tree has some customized support for these in the
tools src/tools/make_ctags and src/tools/make_etags. Because
these tools operate on a directory level, those wrapper scripts create
a single tag file (named tags or TAGS respectively) in the
top-level directory and symlink it to all the other directories. This
allows you to easily look for entries across the entire source tree.
But it’s clearly a hack, and at least Emacs is often somewhat confused
by this setup.

But there is something much better that works very similarly:
GNU GLOBAL. A main difference
is that GNU GLOBAL works on a project basis not on a directory basis,
so you don’t need to do contortions to create and manage tags files
all over your source tree. Also, GLOBAL can be used from the command
line, so you don’t need to be an editor wizard to get started with it.
Plus, it appears to be much faster.

The whole thing is very simple. Install the package, which is usually
called global and available in most operating system distributions.
To start, run

$ gtags

in the top-level directory. This creates the files GPATH, GRTAGS,
and GTAGS.

Then you can use global to search for stuff, like

$ global elog
src/include/utils/elog.h

Or you can look for places a function is called:

$ global -r write_stderr

You can run global in any directory.

Or how about you want to look at the code where something is defined:

$ less -t elog

Note no file name is required. (See the manual for the required setup
to make this work with less.)

Here is some fine-tuning for use with the PostgreSQL source tree.
Generally, I don’t want to index generated files. For example, I
don’t want to see hits in gram.c, only in gram.y. Plus, you don’t
want to index header files under tmp_install. (Super annoying when
you use this to jump to a file to edit and later find that your edits
have been blown away by make check.) But when you run gtags in a
partially built tree, it will index everything it finds. To fix that,
I have restricted gtags to only index files that are registered in
Git, by first running

git ls-files >gtags.files

in the top-level directory. Then gtags will only consider the
listed files.

This will also improve the workings of the Emacs mode, which will at
random times call global -u to update the tags. If it finds a
gtags.files file, it will observe that and not index random files
lying around.

I have a shell alias pgconfigure which calls configure with a
bunch of options for easier typing. It’s basically something like

Here are some tweaks to git log that I have found useful. It might
depend on the workflow of individual projects how applicable this is.

Git stores separate author and committer information for each commit.
How these are generated and updated is sometimes mysterious but
generally makes sense. For example, if you cherry-pick a commit to a
different branch, the author information stays the same but the
committer information is updated. git log defaults to showing the
author information. But I generally care less about that than the
committer information, because I’m usually interested in when the
commit arrived in my or the public repository, not when it was
initially thought about. So let’s try to change the default git log
format to show the committer information instead. Again, depending on
the project and the workflow, there can be other preferences.

To create a different default format for git log, first create a new
format by setting the Git configuration item pretty.somename. I
chose pretty.cmedium because it’s almost the same as the default
medium but with the author information replaced by the committer
information.

I was idly wondering what was taking pg_dump so long and noticed
that it always seemed to be pegged at 100% CPU usage on the client.
That was surprising because naively one might think that the
bottleneck are the server’s or the client’s disk or the network.
Profiling quickly revealed that the compression library zlib was
taking most of the run time on the client. And indeed, turning
compression off caused pg_dump to fly without getting anywhere near
100% CPU.

When using the custom output format in pg_dump (-Fc), the output
is automatically compressed, using the same default level that gzip
uses. By using the option -Z, one can select a compression level
between 0 (off) and 9 (highest). Although it is not documented, the
default corresponds to level 6.

Some simple testing has shown that lowering the level from 6 to 1 can
speed up the dump run time by a factor of 3 or more while only
increasing the output size by 10%. Even the levels in between give
significant speed increases with only minimal differences in output
size. On the other hand, increasing the compression level to 9 only
decreases the output size by about 1% while causing slow downs by a
factor of 2. (In this test, level 1 was about twice as slow as no
compression, but the output size was about 40% of the uncompressed
size. So using at least some compression was still desirable.)

I encourage experimentation with these numbers. I might actually
default my scripts to -Z1 in the future.

So I had this PostgreSQL database that was getting a bit too big, and
since it was really only for analytics, I figured it would be a good
fit for putting in Hadoop+Hive instead.

(For those not completely familiar with this: Hadoop is sort of a job
tracker and distributed file system. Hive is an SQL-like layer on top
of that. I know the cool kids are now using Spark. Maybe for another
day.)

The first thing you need to learn about the Hadoop ecosystem is its
idiosyncratically fragmented structure. With PostgreSQL, you
basically have the community website, the community mailing lists, the
community source code distribution, the community binaries, and a
handful of binaries made by Linux distributions. If you search the
web for a problem with PostgreSQL, you will normally gets hits on one
or more of: the documentation, the mailing lists, third-party mirrors
of the mailing lists, or Stack Overflow. With Hadoop, you have the
resources provided by the Apache Software Foundation, including the
source distribution, bug tracker, documentation, and then bunch of
commercial vendors with their parallel universes, including their own
mutually incompatible binary distributions, their own copy of the
documentation, their own mailing lists, their own bug trackers, etc.
When you search for a problem with Hadoop, you will typically get hits
from three separate copies of the documentation, about eight mailing
lists, fifteen tutorials, and one thousand blog posts. And about 20
unanswered posts on Stack Overflow. Different vendors also favor
different technology extensions. So if, say, you read that you should
use some storage method, chances are it’s not even supported in a
given distribution.

The next thing to know is that any information about Hadoop that is
older than about two years is obsolete. Because they keep changing
everything from command names to basic architecture. Don’t even
bother reading old stuff. Don’t even bother reading anything.

So Hive. The basic setup is actually fairly well documented. You set
up a Hadoop cluster, HDFS, create a few directories. Getting the
permissions sorted out during these initial steps is not easy, but it
seldom is. So you can create a few tables, load some data, run a few
queries.

Nevermind that in its default configuration hive spits out about a
dozen warnings on every startup about deprecated parameters and jar
file conflicts. This is apparently well known. Look around in the
internet for hive examples. They show the same output. Apparently
the packaged versions of Hadoop and Hive are not tuned for each other.

Then you learn: In the default configuration, there can only be one
Hive session connected at once. It doesn’t tell you this. Instead,
when the second session wants to connect, it tells you

followed by hundreds of lines of exception traces. This is Hive-speak
for: “there is already one session connected”.

You see, Hive needs a, cough, cough, relational database to store its
schema. By default, it uses embedded Derby, which allows only one
connection at a time. If you want to connect more than one session at
once, you need to set up an external “Hive metastore” on a MySQL or
PostgreSQL database.

Nevermind that Derby can actually run in server mode. That’s
apparently not supported by Hive.

So I had a PostgreSQL database handy and tried to set that up. I
installed the PostgreSQL JDBC driver, created an external database,
changed the Hive configuration to use an external database.

At this point, it turned out that the PostgreSQL JDBC driver was
broken, so I had to downgrade to an older version. (The driver has
since been fixed.)

After I got one that was working, Hive kept complaining that it
couldn’t find a driver that matches the JDBC URL
jdbc:postgresql://somehost/hive_metastore. The PostgreSQL JDBC
driver explains in detail how to load the driver, but how do I get
that into Hive?

The first suggestion from the internet was to add something like this
to .hiverc:

add jar /usr/share/java/postgresql-jdbc.jar;

That doesn’t work. Remember, don’t believe anything you read on the
internet.

In between I even tried download the MySQL JDBC driver (no, I don’t
want to sign in with my Oracle account), but it had the same problem.

hive is actually a shell script which loads another shell script
which loads a bunch of other shell scripts, which eventually starts
java. After randomly poking around I determined that if I did

export HIVE_AUX_JARS_PATH=/usr/share/java/

it would pick up the jar files in that directory. OK, that worked.

Now I can create tables, load data, run simple queries, from more than
one session. So I could do

SELECT * FROM mytable;

But as soon as I ran

SELECT count(*) FROM mytable;

it crapped out again:

java.io.FileNotFoundException: File does not exist: hdfs://namenode/usr/share/java/jline-0.9.94.jar

So it’s apparently looking for some jar file on HDFS rather than the
regular file system. Some totally unrelated jar file, too.

The difference between the two queries is that the first one is
answered by just dumping out data locally, whereas the second one
generates a distributed map-reduce job. It doesn’t tell you that
beforehand, of course. Or even afterwards.

After a while I figured that this must have something to do with the
HIVE_AUX_JARS_PATH setting. I changed that to

export HIVE_AUX_JARS_PATH=/usr/share/java/postgresql-jdbc.jar;

so it would look at only one file, and sure enough it now complains

java.io.FileNotFoundException: File does not exist: hdfs://namenode/usr/share/java/postgresql-jdbc.jar

Apparently, the HIVE_AUX_JARS_PATH facility is for adding jars that
contain user-defined functions that you need at run time. As far as I
can tell, there is no separate setting for adding jars that you only
need locally.

There are workarounds for that on the internet, of varying
bizarreness, none of which worked. Remember, don’t believe anything
you read on the internet.

In the end, I indulged it and just uploaded that jar file into HDFS.
Whatever.

I then put my data loading job into cron, which quickly crapped out
because JAVA_HOME is not set in the cron environment. After that
was fixed, I let my data loading jobs run for a while.

Later, I wanted clear out the previous experiments, drop all tables,
and start again. Apparently, dropping a table in Hive takes a very
long time. Actually, no. When you use PostgreSQL for the Hive
metastore, any attempt to drop a table will hang indefinitely.

You are the first person I have heard of using postgres. I commend
you for not succumbing to the social pressure and just installing
mysql. However I would advice succumbing to the social pressure and
using either derby or mysql.

The reason I say this is because jpox
“has support” for a number of data stores (M$ SQL server) however,
people have run into issues with them. Databases other then derby
and mysql ‘should work’ but are generally untested.

Not that actually testing it would take much work. It’s not like Hive
doesn’t have any tests. Just add some tests.

It’s funny that they didn’t write “You are the first person I have
heard of using hive”. Clearly, nobody has ever actually used this.

Anyway, somehow I ended up creating the metastore schema manually by
copying and pasting various pieces from the internet and raw files.
Shudder.

How about more fun? Here is a run-of-the-mill SQL parse error:

12345678910111213141516171819202122232425262728293031323334

NoViableAltException(26@[221:1: constant : ( Number | dateLiteral | StringLiteral | stringLiteralSequence | BigintLiteral | SmallintLiteral | TinyintLiteral | DecimalLiteral | charSetStringLiteral | booleanValue );])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:116)
at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.constant(HiveParser_IdentifiersParser.java:4377)
at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.partitionVal(HiveParser_IdentifiersParser.java:8444)
at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.partitionSpec(HiveParser_IdentifiersParser.java:8283)
at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.tableOrPartition(HiveParser_IdentifiersParser.java:8161)
at org.apache.hadoop.hive.ql.parse.HiveParser.tableOrPartition(HiveParser.java:31397)
at org.apache.hadoop.hive.ql.parse.HiveParser.insertClause(HiveParser.java:30914)
at org.apache.hadoop.hive.ql.parse.HiveParser.regular_body(HiveParser.java:29076)
at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatement(HiveParser.java:28968)
at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:28762)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1238)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:938)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:190)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1000)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:348)
at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:446)
at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:456)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:737)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 4:20 cannot recognize input near 'year' '(' 'event_timestamp' in constant

In PostgreSQL, this might say

1

syntax error at or near "("

with a pointer to the actual query.

I just put a function call somewhere where it didn’t belong. The
documentation is very terse and confusing about a lot of these things.
And the documentation is kept as a series of wiki pages.

So now I have a really slow distributed version of my PostgreSQL
database, which stores its schema in another PostgreSQL database.
I forgot why I needed that.

It took a few more hours and days after this to refine some details,
but I have now tagged the first
release of
this extension. Give it a try and let me know what you think. Bug
reports and feature requests are welcome.

(I chose to name the data type uri instead of url, as originally
suggested, because that is more correct and matches what the parsing
library calls it. One could create a domain if one prefers the other
name or if one wants to restrict the values to certain kinds of URIs
or URLs.)

Unfortunately, different versions of PgBouncer return a different
number of columns for some commands. Then you will need different
view definitions. I haven’t determined a way to handle that
elegantly.

2001: PostgreSQL 7.1: write-ahead log

PostgreSQL 7.1 introduced the write-ahead log (WAL). Before that
release, all open data files had to be fsynced on every commit, which
is very slow. Slow fsyncing is still a problem today, but now we’re
only worried about fsyncing the WAL, and fsyncing the data files
during the checkpoint process. Back then, we had to fsync
everythingall the time.

In the original design of university POSTGRES, the lack of a log was
intentional, and contrasted with heavily log-based architectures such
as Oracle. In Oracle, you need the log to roll back changes. In
PostgreSQL, the nonoverwriting storage system takes care of that. But
probably nobody thought about implications for fsyncing back then.

Note that the WAL was really just an implementation detail at this
point. You couldn’t read or archive it.

2004: Slony

2005: PostgreSQL 8.0: point-in-time recovery

PostgreSQL 8.0 added the possibility to copy the WAL somewhere else,
and later play it back, either all the way or to a particular point in
time, hence the name point-in-time recovery (PITR) for this feature.
This feature was mainly intended to relieve pg_dump as a backup
method. Until then, the only backup method was a full dump, which
would get impractical as databases grew. Hence this method to take an
occasional base backup, which is the expensive part, and then add on
parts of the WAL, which is cheaper.

The basic configuration mechanisms that we still use today, for
example the recovery.conf file, were introduced as part of this
feature.

But still no replication here.

2008: PostgreSQL 8.3: pg_standby

Crafty people eventually figured that if you archived WAL on one
server and at the same time “recovered” endlessly on another, you’d
have a replication setup. You could probably have set this up with
your own scripts as early as 8.0, but PostgreSQL 8.3 added the
pg_standby program into contrib, which gave everyone a standard
tool. So, arguably, 8.3 is the first release that contained a
semblance of a built-in replication solution.

The standby server was in permanent recovery until promotion, so it
couldn’t be read from as it was replicating. This is what we’d now
call a warm standby.

I think a lot of PostgreSQL 8.3 installations refuse to die, because
this is the first version where you could easily have a reasonably
up-to-date reserve server without resorting to complicated and
sometimes problematic tools like Slony or DRBD.

2010: PostgreSQL 9.0: hot standby, streaming replication

In PostgreSQL 9.0, two important replication features arrived
completely independently. First, the possibility to connect to a
standby server in read-only mode, making it a so-called hot standby.
Whereas before, a standby server was really mainly useful only as a
reserve in case the primary server failed, with hot standby you could
use secondary servers to spread out read-only loads. Second, instead
of relying solely on the WAL archive and recovery functionalities to
transport WAL data, a standby server could connect directly to the
primary server via the existing libpq protocol and obtain WAL data
that way, so-called streaming replication. The primary use in this
release was that the standby could be more up to date, possibly within
seconds, rather than several minutes with the archive-based approach.
For a robust setup, you would still need to set up an archive. But
streaming replication was also a forward-looking feature that would
eventually make replication setups easier, by reducing the reliance on
the old archiving mechanisms.

PostgreSQL 9.0 was the first release where one could claim that
PostgreSQL “supports replication” without having to make
qualifications or excuses. Although it is scheduled to go EOL later
this year, I expect this release will continue to live for a long
time.

2011: PostgreSQL 9.1: pg_basebackup, synchronous replication

pg_basebackup was one of the features facilitated by streaming
replication that made things easier. Instead of having to use
external tools like rsync for base backups, pg_basebackup would
use a normal libpq connection to pull down a base backup, thus
avoiding complicated connection and authentication setups for external
tools. (Some people continue to favor rsync because it is faster
for them.)

PostgreSQL 9.1 also added synchronous replication, which ensures that
data is replicated to the designated synchronous standby before a
COMMIT reports success. This feature is frequently misunderstood by
users. While it ensures that your data is on at least two servers at
all times, it might actually reduce the availability of your system,
because if the standby server goes down, the primary will also go
down, unless you have a third server available to take over the
synchronous standby duty.

Less widely know perhaps is that PostgreSQL 9.1 also added the
pg_last_xact_replay_timestamp function for easy monitoring of
standby lag.

In my experience, the availability of pg_basebackup and
pg_last_xact_replay_timestamp make PostgreSQL 9.1 the first release
were managing replication was reasonably easy. Go back further, and
you might feel constrained by the available tools. But in 9.1, it’s
not that much different from what is available in the most recent
releases.

2012: PostgreSQL 9.2: cascading replication

Not as widely acclaimed, more for the Slony buffs perhaps,
PostgreSQL 9.2 allowed standbys to fetch their streaming replication
data from other standbys. A particular consequence of that is that
pg_basebackup could copy from a standby server, thus taking the load
off the primary server for setting up a new standby or standalone
copy.

2013: PostgreSQL 9.3: standby can follow timeline switch

This did not even make it into the release note highlights. In
PostgreSQL 9.3, when a primary has two standbys, and one of the
standbys is promoted, the other standby can just keep following the
new primary. In previous releases, the second standby would have to
be rebuilt. This improvement makes dynamic infrastructure changes
much simpler. Not only does it eliminate the time, annoyance, and
performance impact of setting up a new standby, more importantly it
avoids the situation that after a promotion, you don’t have any up to
update standbys at all for a while.

2014: PostgreSQL 9.4: replication slots, logical decoding

Logical decoding got all the press for PostgreSQL 9.4, but I think
replication slots are the major feature, possibly the biggest
replication feature since PostgreSQL 9.0. Note that while streaming
replication has gotten more sophisticated over the years, you still
needed a WAL archive for complete robustness. That is because the
primary server didn’t actually keep a list of its supposed standby
servers, it just streamed whatever WAL happened to be requested if it
happened to have it. If the standby server fell behind sufficiently
far, streaming replication would fail, and recovery from the archive
would kick in. If you didn’t have an archive, the standby would then
no longer be able to catch up and would have to be rebuilt. And this
archiving mechanism has essentially been unchanged since version 8.0,
when it was designed for an entirely different purpose. So a
replication setup is actually quite messy: You have to configure an
access path from the primary to the standby (for archiving) and an
access path from the standby to the primary (for streaming). And if
you wanted to do multiple standbys or cascading, maintaining the
archive could get really complicated. Moreover, I think a lot of
archiving setups have problematic archive_command settings. For
example, does your archive_command fsync the file on the receiving
side? Probably not.

No more: In PostgreSQL 9.4, you can set up so-called replication
slots, which effectively means that you register a standby with the
primary, and the primary keeps around the WAL for each standby until
the standby has fetched it. With this, you can completely get rid of
the archiving, unless you need it as a backup.

2015? PostgreSQL 9.5? pg_rewind?

One of the remaining problems is that promoting a standby leaves the
old primary unable to change course and follow the new primary. If
you fail over because the old primary died, then that’s not an issue.
But if you just want to swap primary and standby, perhaps because the
standby has more powerful hardware, then the old primary, now standby,
needs to be rebuilt completely from scratch. Transforming an old
primary into a new standby without a completely new base backup is a
rather intricate problem, but a tool that can do it (currently named
pg_rewind) is proposed for inclusion into the next PostgreSQL
release.

Beyond

One of the problems that this evolution of replication has created is
that the configuration is rather idiosyncratic, quite complicated to
get right, and almost impossible to generalize sufficiently for
documentation, tutorials, and so on. Dropping archiving with 9.4
might address some of these points, but configuring even just
streaming replication is still weird, even weirder if you don’t know
how it got here. You need to change several obscure configuration
parameters, some on the primary, some on the standby, some of which
require a hard restart of the primary server. And then you need to
create a new configuration file recovery.conf, even though you don’t
want to recover anything. Making changes in this area is mostly a
complex political process, because the existing system has served
people well over many years, and coming up with a new system that is
obviously better and addresses all existing use cases is cumbersome.

Another issue is that all of this functionality has been bolted on to
the write-ahead log mechanism, and that constrains all the uses of the
write-ahead log in various ways. For example, there are optimizations
that skip WAL logging in certain circumstances, but if you want
replication, you can’t use them. Who doesn’t want replication? Also,
the write-ahead log covers an entire database system and is all or
nothing. You can’t replicate only certain tables, for example, or
consolidate logs from two different sources.

How about not bolting all of this on to the WAL? Have two different
logs for two different purposes. This was discussed, especially
around the time streaming replication was built. But then you’d need
two logs that are almost the same. And the WAL is by design a
bottleneck, so creating another log would probably create performance
problems.

Logical decoding breaks many of these restrictions and will likely be
the foundation for the next round of major replication features.
Examples include partial replication and multimaster replication, some
of which are being worked on right now.

What can we expect from plain WAL logging in the meantime? Easier
configuration is certainly a common request. But can we expect major
leaps on functionality? Who knows. At one point, something like hot
standby was thought to be nearly impossible. So there might be
surprises still.

There is a lot of helpful information about screen out there, but I
haven’t found anything about this. I don’t want to “forget” any
screen sessions, so I’d like to be notified when I log into a box and
there are screens running for me. Obviously, there is screen -ls,
but it needs to be wrapped in a bit logic so that it doesn’t annoy
when there is no screen running or even installed.

After perusing the screen man page a little, I came up with this for
.bash_profile or .zprofile:

The second redirection will override the first one. You’d have to use
an external tool to make this work, maybe something like:

program | tee firstfile secondfile

But with zsh, this type of thing actually works. It will duplicate
the output and write it to multiple files.

This feature also works with a combination of redirections and
pipes. For example

ls > foo | grep bar

will write the complete directory listing into file fooand print
out files matching bar to the terminal.

That’s great, but this feature pops up in unexpected places.

I have a shell function that checks whether a given command produces
any output on stderr:

! myprog "$arg" 2>&1 >/dev/null | grep .

The effect of this is:

If no stderr is produced, the exit code is 0.

If stderr is produced, the exit code is 1 and the stderr is shown.

(Note the ordering of 2>&1 >/dev/null to redirect stderr to stdout
and silence the original stdout, as opposed to the more common
incantation of >/dev/null 2>&1, which silences both stderr and
stdout.)

The reason for this is that myprog has a bug that causes it to print
errors but not produce a proper exit status in some cases.

Now how will my little shell function snippet behave under zsh? Well,
it’s quite confusing at first, but the following happens. If there is
stderr output, then only stderr is printed. If there is no stderr
output, then stdout is passed through instead. But that’s not what I
wanted.

This can be reproduced simply:

ls --bogus 2>&1 >/dev/null | grep .

prints an error message, as expected, but

ls 2>&1 >/dev/null | grep .

prints a directory listing. That’s because zsh redirects stdout to
both/dev/null and the pipe, which makes the redirection to
/dev/null pointless.

Note that in bash, the second command prints nothing.

This behavior can be changed by turning off the MULTIOS option (see
zshmisc man page), and my first instinct was to do that, but options
are not lexically scoped (I think), so this would break again if the
option was somehow changed somewhere else. Also, I think I kind of
like that option for interactive use.

My workaround is to use a subshell:

! ( myprog "$arg" 2>&1 >/dev/null ) | grep .

The long-term fix will probably be to write an external shell script
in bash or plain POSIX shell.

In
part 1
and
part 2
I investigated how to use ccache with clang. That was more than
three years ago.

I got an email the other day that
ccache bug 8118,
which I filed while writing part 1, was closed, as ccache 3.2 was
released. The
release notes of ccache 3.2
contain several items related to clang. So it was time to give this
another look.

Basically, the conclusions from part 2 still stand: You cannot use
ccache with clang without using CCACHE_CPP2. And it is now
becoming clear to me that this is an issue that is not going to go
away, and it’s not really even Clang’s fault.

Warnings!

The problem is that clang’s -Wall can cause warnings when
compiling the preprocessed version of otherwise harmless C code.
This can be illustrated by this piece of C code:

You wouldn’t normally write code like this, but the C preprocessor
could create code with self-assignments, self-comparisons, extra
parentheses, and so on.

This example represents the issues I saw when trying to compile
PostgreSQL 9.4 with ccache and clang; there might be others.

You can address this issue in two ways:

Use CCACHE_CPP2, as discussed in part 2. With ccache 3.2, you
can now also put this into a configuration file: run_second_cpp =
true in ~/.ccache/ccache.conf

Turn off the warnings mentioned above: -Wno-parentheses-equality,
-Wno-tautological-compare, -Wno-self-assign (and any others you
might find). One might think that these are actually useful warnings
that one might want to keep, but GCC doesn’t warn about them, and if
you develop primarily with GCC, your code might contain these issues
anyway. In particular, I have found that -Wno-tautological-compare
is necessary for legitimate code.

I think CCACHE_CPP2 is the way to go, for two reasons. Firstly,
having to add more and more options to turn off warnings is obviously
somewhat stupid. Secondly and more importantly, there is nothing
stopping GCC from adding warnings similar to Clang’s that would
trigger on preprocessed versions of otherwise harmless C code. Unless
they come up with a clever way to annotate the preprocessed code to
the effect of “this code might look wrong to you, but it looked OK
before preprocessing, so don’t warn about it”, in a way that creates
no extra warnings and doesn’t lose any warnings, I don’t think
this issue can be solved.

Speed!

Now the question is, how much would globally setting CCACHE_CPP2
slow things down?

To test this, I have built PostgreSQL 9.4rc1 with clang-3.5 and
gcc-4.8 (not gcc-4.9 because it creates some unrelated warnings
that I don’t want to deal with here). I have set export
CCACHE_RECACHE=true so that the cache is not read but new cache
entries are computed. That way, the overhead of ccache on top of
the compiler is measured.

Results:

clang-3.5

Using ccache is 10% slower than not using it at all.

Using ccache with CCACHE_CPP2 on is another 10% slower.

gcc-4.8

Using ccache is 19% slower than not using it at all.

Using ccache with CCACHE_CPP2 is another 9% slower.

(There different percentages between gcc and clang arise because
gcc is faster than clang (yes, really, more on that in a future
post), but the overhead of ccache doesn’t change.)

10% or so is not to be dismissed, but let’s remember that this applies
only if there is a cache miss. If everything is cached, both methods
do the same thing. Also, if you use parallel make, the overhead is
divided by the number of parallel jobs.

With that in mind, I have decided to put the issue to rest for myself
and have made myself a ~/.ccache/ccache.conf containing

run_second_cpp = true

Now Clang or any other compiler should run without problems through
ccache.

Color!

There is one more piece of news in the new ccache release: Another
thing I talked about in part 1 was that ccache will disable the
colored output of clang, and I suggested workarounds. This was
actually fixed in ccache 3.2, so the workarounds are no longer
necessary, and the above configuration change is really the only thing
to make Clang work smoothly with ccache.