Contents

Add a checksum field to the revision table that reflects the contents of the "old_text" column.

Reasoning:

A checksum field will allow to verify the validity of the XML dump files.

This column could then be checked against at each insertion into the text table in order to avoid saving the same text multiple times. This could reduce the necessary disc footprint required to store an amount of revisions.

Exposing this checksum field via the API would allow developers to know when it is necessary to request the text of a revision. They may already have it a copy of it that was previously requested and can simply use that.

The checksum field would provide a quick mechanism for detecting identity reverts. By identity revert, I mean a reversion of a page to an *exact* previous state such that both revisions would contain text of the same checksum.

I will consider the following hash functions to calculate the proposed checksum: MD5, SHA1 and different flavors of the Murmur hash. Both MD5 and SHA1 are cryptographic hashes while Murmur hash is not.

This is for relative performance and not absolute performance comparison, benchmarks were generated using a 2.8Ghz Intel Core 2 Duo 4Gb Macbook Pro and conducted in Python. The SHA1 and MD5 benchmarks in PHP are slightly slower but I guess they use the same underlying C library.

Rob Lanphier suggested that SHA1 might ease integration with Github, see comment. (Note from Rob: Brion is right; this particular reason isn't a great one).

When you encode the 128 bits of MD5 (example) in base-16 aka hexadecimal, you need CHAR(32). When you use the technique of enoding the 128 bits in base-62 with characters from [A-Za-z0-9], you'll need CHAR(22). See http://de.php.net/manual/en/function.md5.php#83321 for one implementation. --Wikinaut 06:08, 16 September 2011 (UTC)

Total space requirements is an approximation based on 411,257,397 revisions. Total time is an approximation of how time it would take to calculate the checksums. This is based on the fact that the sum of rev_len for all revisions is 7599309679117 bytes. It seems that the length of a murmur3 hash varies.

Two field types are possible for the MD5 and SHA1 hashes: CHAR and BINARY. The murmur hash can be stored either as a BIGINT, DOUBLE a VARBINARY or a VARCHAR. Binary seems to give a minor performance gain and takes less space to store. From the manual [1] this is the main difference:

The BINARY and VARBINARY types are similar to CHAR and VARCHAR, except
that they contain binary strings rather than non-binary strings. That is,
they contain byte strings rather than character strings. This means that
they have no character set, and sorting and comparison are based on the
numeric values of the bytes in the values.

An ALTER TABLE query on Innodb databases makes a full copy of the table. According to Asher Feldman, the current revision table is 79Gb. The current number of rows in revision is around 412M. We would need at least 90Gb on each server to run the ALTER TABLE query and add the new checksum field.