Hi! I am starting this thread because Brion's revision r94289 reverted r94289 [0] stating "core schema change with no discussion" [1]. Bugs 21860 [2] and 25312 [3] advocate for the inclusion of a hash column (either md5 or sha1) in the revision table. The primary use case of this column will be to assist detecting reverts. I don't think that data integrity is the primary reason for adding this column. The huge advantage of having such a column is that it will not be longer necessary to analyze full dumps to detect reverts, instead you can look for reverts in the stub dump file by looking for the same hash within a single page. The fact that there is a theoretical chance of a collision is not very important IMHO, it would just mean that in very rare cases in our research we would flag an edit being reverted while it's not. The two bug reports contain quite long discussions and this feature has also been discussed internally quite extensively but oddly enough it hasn't happened yet on the mailinglist.

Bumping this: What are the remaining open questions regarding this schema change? As Diederik explains, this has a number of potential benefits especially for research/analysis, and I think it makes sense to target it for 1.19 or 1.20 unless there are scary implications that haven't been surfaced yet. :-)

> Hi! > I am starting this thread because Brion's revision r94289 reverted > r94289 [0] stating "core schema change with no discussion" [1]. > Bugs 21860 [2] and 25312 [3] advocate for the inclusion of a hash > column (either md5 or sha1) in the revision table. The primary use > case of this column will be to assist detecting reverts. I don't think > that data integrity is the primary reason for adding this column. The > huge advantage of having such a column is that it will not be longer > necessary to analyze full dumps to detect reverts, instead you can > look for reverts in the stub dump file by looking for the same hash > within a single page. The fact that there is a theoretical chance of a > collision is not very important IMHO, it would just mean that in very > rare cases in our research we would flag an edit being reverted while > it's not. The two bug reports contain quite long discussions and this > feature has also been discussed internally quite extensively but oddly > enough it hasn't happened yet on the mailinglist. > > So let's have a discussion! > > [0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94289> [1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94541> [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=21860> [3] https://bugzilla.wikimedia.org/show_bug.cgi?id=25312> > Best, > > Diederik > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l [at] lists > https://lists.wikimedia.org/mailman/listinfo/wikitech-l> _______________________________________________ Wikitech-l mailing list Wikitech-l [at] lists https://lists.wikimedia.org/mailman/listinfo/wikitech-l

On 11-09-02 05:20 PM, Asher Feldman wrote: > Would it be possible to generate offline hashes for the bulk of our revision > corpus via dumps and load that into prod to minimize the time and impact of > the backfill? > > When using for analysis, will we wish the new columns had partial indexes > (first 6 characters?) > > Is code written to populate rev_sha1 on each new edit? > > On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere <dvanliere [at] gmail>wrote: > >> Hi! >> I am starting this thread because Brion's revision r94289 reverted >> r94289 [0] stating "core schema change with no discussion" [1]. >> Bugs 21860 [2] and 25312 [3] advocate for the inclusion of a hash >> column (either md5 or sha1) in the revision table. The primary use >> case of this column will be to assist detecting reverts. I don't think >> that data integrity is the primary reason for adding this column. The >> huge advantage of having such a column is that it will not be longer >> necessary to analyze full dumps to detect reverts, instead you can >> look for reverts in the stub dump file by looking for the same hash >> within a single page. The fact that there is a theoretical chance of a >> collision is not very important IMHO, it would just mean that in very >> rare cases in our research we would flag an edit being reverted while >> it's not. The two bug reports contain quite long discussions and this >> feature has also been discussed internally quite extensively but oddly >> enough it hasn't happened yet on the mailinglist. >> >> So let's have a discussion! >> >> [0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94289>> [1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94541>> [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=21860>> [3] https://bugzilla.wikimedia.org/show_bug.cgi?id=25312>> >> Best, >> >> Diederik >> >> _______________________________________________ >> Wikitech-l mailing list >> Wikitech-l [at] lists >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l>> -- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen <lists [at] nadir-seen-fire> wrote: > On 11-09-02 05:20 PM, Asher Feldman wrote: >> When using for analysis, will we wish the new columns had partial indexes >> (first 6 characters?) > Bug 2939 is one relevant bug to this, it could probably use an index. > [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939

I generally suspect that a standard index is going to be a waste for the most urgent uses of this. It will rarely be interesting to search for common hashes between articles. The far more common case will be to search for duplicate hashes within the history of a single article. My understanding is that having a normal index on a table the size of our revision table will be far too expensive for db writes.

This is the first I've heard of partial indexes (/me researches) I don't know if a partial index is going to be cheap enough that we can use it, and useful enough that we'd want to. Would this be a faster query in a world with a partial index on the first six characters? SELECT rev_id FROM revision WHERE rev_page=12345 AND rev_sha1='4cdbd80be15fcfff139fb8a95f2ca359520939ee'

...or would we have to run a query like this to get the benefit of the index? SELECT rev_id FROM revision WHERE rev_page=12345 AND rev_sha1 like '4cdbd8%'

...and would either of these queries be considered too expensive to run without a partial index? How about with a partial index?

I've suggested to generate bulk checksums as well but both Brion and Ariel see the primary purpose of this field to check the validity of the dump generating process and so they want to generate the checksums straight from the external storage.

In a general sense, there are two use cases for this new field: 1) Checking the validity of the XML dump files 2) Identifying reverts

I have started to work on a proposal for deployment (and while being incomplete) it might be a good start to further plan the deployment. I have been trying to come up with some back-of-the-envelope calculations about how much time and space it would take but I don't have all the required information yet to come up with some reasonable estimates.

On Sat, Sep 3, 2011 at 2:20 AM, Asher Feldman <afeldman [at] wikimedia> wrote: > Is code written to populate rev_sha1 on each new edit? > I believe that was part of Aaron's code that got reverted, yes.

Offline generation of hashes is definitely possible, but the only reason you'd do it is to minimize the time during which some rows have an empty hash. It's not strictly offline in that the text will still have to be pulled from ES, but it can be done before the field is added.

Diederik van Liere wrote: > I've suggested to generate bulk checksums as well but both Brion and Ariel see > the primary purpose of this field to check the validity of the dump generating > process and so they want to generate the checksums straight from the external > storage. > > [...] > > PS: not sure if this proposal should be on strategy or mediawiki...

On Sat, Sep 3, 2011 at 12:33 AM, Rob Lanphier <robla [at] wikimedia> wrote: > I generally suspect that a standard index is going to be a waste for > the most urgent uses of this. Â It will rarely be interesting to search > for common hashes between articles. Â The far more common case will be > to search for duplicate hashes within the history of a single article. > Â My understanding is that having a normal index on a table the size of > our revision table will be far too expensive for db writes.

If it's useful for Wikimedia and the index is expensive to maintain, the index can be kept on only one slave per DB cluster to minimize cost, the way we've always done for RC. If it's not useful for Wikimedia, only third-party researchers, then we could leave it unindexed on the cluster and have an index only on the toolserver or whatnot.

It's not so much that an extra index would be untenably expensive in principle, more that it's not worth the cost if it's not used for anything important. If it's just for statistical analysis and not for the live site, there's no reason to have it on all servers, except maybe administrative simplicity. The toolserver already has a bunch of indexes that Wikimedia doesn't, for exactly this sort of reason.

> This is the first I've heard of partial indexes (/me researches) Â I > don't know if a partial index is going to be cheap enough that we can > use it, and useful enough that we'd want to. Â Would this be a faster > query in a world with a partial index on the first six characters? > SELECT rev_id FROM revision WHERE rev_page=12345 AND > rev_sha1='4cdbd80be15fcfff139fb8a95f2ca359520939ee'

I'd think so, yes. MySQL should be smart enough to use partial indexes at least that far -- otherwise there'd be no point in supporting them. I'd think an index on the first several bytes should be almost as effective as one on the whole value, if you just want to do filtering or joining (not sorting). It might be more effective, since more would fit in RAM. However, I don't know for sure offhand.

> ...and would either of these queries be considered too expensive to > run without a partial index? Â How about with a partial index?

Without a suitable index, running either of these queries would scan the entire revision history of the article in question. That would certainly not be acceptable on the cluster if the query is to be run with any frequency on arbitrary articles. It would be okay if it were just an occasional thing run on a limited number of articles, or only on articles with few revisions in their histories.

All of the above should be interpreted with the understanding that I stand a decent chance of knowing what I'm talking about, but have no say at this point and shouldn't be taken too seriously. :)

For some applications, I use the technique of representing the 128 bit of md5 or other checksums

as base-62 character strings instead of hexadecimal (base-16) strings.

When you need a non-binary, i.e. character representation for displaying, storing or transmitting checksums, you should always consider to use base-62 instead of base-16 representations (for md5 you need 22 bytes instead of 32 bytes).

When you encode the 128 bits of MD5 (example) in base-16 aka /hexadecimal/, you need CHAR(32). When you use the technique of enoding the 128 bits in /base-62/ with characters from [A-Za-z0-9], you'll need CHAR(22).

Am 16.09.2011 11:24, schrieb Roan Kattouw: > For some applications, I use the technique of representing the 128 bit > of md5 or other checksums > > as base-62 character strings > instead of hexadecimal (base-16) strings.

> MediaWiki already uses a similar technique, storing SHA-1 hashes of > images in base 36. Was there a certain reason to chose base 36 ? Why not recoding to base 62 and saving 3 bytes per checksum ?

This format was chosen for hashes to be used as filenames for uploaded file storage (currently used only for storing deleted files, I think, but there's long been a long-term plan to switch primary image storage to this as well some day).

For greatest compatibility with all filesystems, we only use characters that are safe (ASCII digits and letters) and don't rely on case distinctions which are not always preserved (Windows and Mac OS X systems default to case-insensitive filesystems).

The reason we're not using hex here is that a more compact representation makes the filenames, and thus any URL references including them in the path, shorter. On img_sha1 I guess we just kept using it for compatibility/similarity with the deleted file archives?

On Fri, Sep 16, 2011 at 6:48 PM, Thomas Gries <mail [at] tgries> wrote: > Was there a certain reason to chose base 36 ? > Why not recoding to base 62 and saving 3 bytes per checksum ? > I don't know, this was way, way before my time. But then, why use base 62 if you can use base 64? Encoders/decoders for that are much more widely available. PHP's base_convert() will go up to 36, I think, so you'd have to write a base 62 encoder/decoder yourself, but PHP has built-in functions to deal with base 64.

Roan Kattouw wrote: > On Fri, Sep 16, 2011 at 6:48 PM, Thomas Gries<mail [at] tgries> wrote: >> Was there a certain reason to chose base 36 ? >> Why not recoding to base 62 and saving 3 bytes per checksum ? >> > I don't know, this was way, way before my time. But then, why use base > 62 if you can use base 64? Encoders/decoders for that are much more > widely available. PHP's base_convert() will go up to 36, I think, so > you'd have to write a base 62 encoder/decoder yourself, but PHP has > built-in functions to deal with base 64.

Brion explained it on previous mail. And yes, we have our own converter in MediaWiki.

On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde <rarohde [at] gmail> wrote: > Is there a good reason to prefer SHA-1? > > Both have weaknesses allowing one to construct a collision (with > considerable effort)

Considerable effort? I can create an MD5 collision in a few minutes on my home computer. Is there anything even remotely like this for SHA-1?

> MD5 is shorter and in my experience about 25% faster to compute. > > Personally I've tended to view MD5 as more than good enough in offline analyses.

For offline analyses, there's no need to change the online database tables.

On Sat, Sep 17, 2011 at 4:56 PM, Anthony <wikimail [at] inbox> wrote: > On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde <rarohde [at] gmail> wrote: >> Is there a good reason to prefer SHA-1? >> >> Both have weaknesses allowing one to construct a collision (with >> considerable effort) > > Considerable effort? I can create an MD5 collision in a few minutes > on my home computer. Is there anything even remotely like this for > SHA-1?

If I've been keeping up to date, the collision complexity for MD5 is about 2^21 operations, and runs in a few seconds (not minutes); and for SHA-1 down to about 2^52 with current results. The latter represents about 100 cpu-years, which is within the realm of supercomputers. That time will probably continue to come down if people find ways to improve the attacks on SHA-1. (The existing attacks usually require the ability to feed arbitrary binary strings into the hash function. Given that both browsers and Mediawiki will tend to reject binary data placed in an edit window, I'm not sure if any of the existing attacks could be reliably applied to Mediawiki editing.)

If collision attacks really matter we should use SHA-1. However, do any of the proposed use cases care about whether someone might intentionally inject a collision? In the proposed uses I've looked at it, it seems irrelevant. The intentional collision will get flagged as a revert and the text leading to that collision would be discarded. How is that a bad thing?

It's a not a big deal, but if I understand prior comments correctly, most of the existing offline infrastructure uses MD5, so I'm wondering if there is a distinct use case for favoring SHA-1.

>> MD5 is shorter and in my experience about 25% faster to compute. >> >> Personally I've tended to view MD5 as more than good enough in offline analyses. > > For offline analyses, there's no need to change the online database tables.

Need? That's debatable, but one of the major motivators is the desire to have hash values in database dumps (both for revert checks and for checksums on correct data import / export). Both of those are "offline" uses, but it is beneficial to have that information precomputed and stored rather than frequently regenerated.

> > For offline analyses, there's no need to change the online database tables. > > Need? That's debatable, but one of the major motivators is the desire > to have hash values in database dumps (both for revert checks and for > checksums on correct data import / export). Both of those are > "offline" uses, but it is beneficial to have that information > precomputed and stored rather than frequently regenerated.

If we don't have it in the online database tables, this defeats the purpose of having the value in there at all, for the purpose of generating the XML dumps.

Recall that the dumps are generated in two passes; in the first pass we retrieve from the db and record all of the metadata about revisions, and in the second (time-comsuming) pass we re-use the text of the revisions from a previous dump file if the text is in there. We want to compare the has of that text against what the online database says the hash is; if they don't match, we want to fetch the live copy.

I refer folks to bug 23264 [1] as proof that mismatch between the metadata and the text has crept in in the past; changes to MW code in other places than the backups scripts can cause quite subtle breakage.

On Sat, Sep 17, 2011 at 4:56 PM, Anthony <wikimail [at] inbox> wrote: > On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde <rarohde [at] gmail> wrote: >> Is there a good reason to prefer SHA-1? >> >> Both have weaknesses allowing one to construct a collision (with >> considerable effort) > > Considerable effort? I can create an MD5 collision in a few minutes > on my home computer. Is there anything even remotely like this for > SHA-1?

If I've been keeping up to date, the collision complexity for MD5 is about 2^21 operations, and runs in a few seconds (not minutes); and for SHA-1 down to about 2^52 with current results. The latter represents about 100 cpu-years, which is within the realm of supercomputers. That time will probably continue to come down if people find ways to improve the attacks on SHA-1. (The existing attacks usually require the ability to feed arbitrary binary strings into the hash function. Given that both browsers and Mediawiki will tend to reject binary data placed in an edit window, I'm not sure if any of the existing attacks could be reliably applied to Mediawiki editing.)

If collision attacks really matter we should use SHA-1. However, do any of the proposed use cases care about whether someone might intentionally inject a collision? In the proposed uses I've looked at it, it seems irrelevant. The intentional collision will get flagged as a revert and the text leading to that collision would be discarded. How is that a bad thing?

It's a not a big deal, but if I understand prior comments correctly, most of the existing offline infrastructure uses MD5, so I'm wondering if there is a distinct use case for favoring SHA-1.

>> MD5 is shorter and in my experience about 25% faster to compute. >> >> Personally I've tended to view MD5 as more than good enough in offline analyses. > > For offline analyses, there's no need to change the online database tables.

Need? That's debatable, but one of the major motivators is the desire to have hash values in database dumps (both for revert checks and for checksums on correct data import / export). Both of those are "offline" uses, but it is beneficial to have that information precomputed and stored rather than frequently regenerated.