Dear pgdevs,
I've just completed the final version of a survey of database schema quality in
open source software. The survey covers 512 projects which use MySQL and/or
PostgreSQL for storing their data. Automatic analyses are performed by querying
the information schema. Statistical validations are computed on the results.
Here are some findings of specific interest to the list.
About MySQL vs PostgreSQL usage:
- MySQL is much more often used than PostgreSQL... not a surprise.
- Even for projects which seems to support PostgreSQL, this support is often
an afterthough and not necessarily functional. For instance, you may find
MySQL-specific syntax in the PostgreSQL-specific script, which were clearly
never tested.
- Projects which use PostgreSQL statistically have a better quality compared
to projects which use MySQL. However, the difference is mostly due to
issues with MySQL.
- Projects which use PostgreSQL are more often maintained than projects
with MySQL.
Some features or default behavior of PostgreSQL seem especially error-prone:
- SERIAL attributes seem to be considered automatically as a primary key,
so that the primary key declaration is often forgotten.
This suggests that:
* the documentation should insist on the potential issue.
* a WARNING should be displayed when SERIAL is used without an associated
PRIMARY KEY, or possibly UNIQUE. It should be very rare to desire a
SERIAL which is not a PK, so this is worth a warning.
- when loading a schema definition from an SQL script, the default behavior of
"psql" is to ignore errors and go on.
This lead to projects with failing declarations to be ignored because the
ERROR is lost in the flow of WARNING and NOTICE. If a table is missing,
the error will be detected because the project is not functional, but if a
constraint is missing, it will just be lost. This occur in about 10% of
pg projects!
In order to avoid this behavior, one must do a "\set ON_ERROR_STOP 1"
at the beginning of the script. However, this is never done. Moreover,
there is no simple way to trigger the safer behavior from the command
line but quite a long "-v ON_ERROR_STOP=1".
It seems to me that:
* the documentation should suggest to use an explicit stop on error setting
in every script.
* a "psql -C foo.sql" (check?) or equivalent short option would help?
* The current client default verbosity is counter productive for quality.
It should be reduced to WARNING and above, but should *not* include NOTICE
which add a lot of noise ignored by the user and which hides more
important messages. For instance, a PK implies an INDEX, a SERIAL implies
a SEQUENCE, fine, but what is the point of telling it over and over?
So I suggest to choose a default "client_min_messages = warning".
Finally, some issues where found and already reported some time ago about the
implementation of the information schema by PostgreSQL. For instance,
auto-generated constraint names are not unique as they should be, which
makes having a "standard" information schema a little bit pointless, as
querying it returns wrong results:-( The summary of the answer was "do not use
the information schema", or "give unique names", which does not make much sense
for me who is analysing existing projects, and as most constraint names are
generated automatically by PostgreSQL.
For those interested in more details about the survey, a preprint of the paper
is available here :
http://www.cri.ensmp.fr/classement/doc/A-488.pdf
And the tool used for analysing the projects is available at :
http://coelho.net/salix/
Comments are welcome.
--
Fabien.