Messages customization statistics

I went to a few Wikimedia projects, in English and in other languages, and browsed Special:AllMessages.

There are several repeating phenomena:

Messages are often modified, even thought the wording is identical. These can probably be deleted in the project, because otherwise they won't be updated when they are updated in the software.

Modified messages often differ very little from the original. For example, the original may have a period or a colon in the end, and the modified messages doesn't have it, but except that they are identical. This often happens with input box labels and error messages. Many of these probably should be deleted too.

Sometimes groups of messages (not necessarily related to TranslateWiki groups) have changed the same word, because the project decided to change the wording. For example, in en.wp "content page" is replaced with "article" and a similar thing is done in he.wp and probably in many other projects. In the fr.wp, abuse filter is called "filtre anti-erreur", but in the source it's called "filtre antiabus". This causes a lot of messages to be modified, even though only a small part is actually modified, so maybe this could be parametrized. And in some cases changing the message in the source may be better.

In some languages, for example Ossetian (os) a lot of messages are properly translated in the project, but the translations are not kept in the source. If they are good enough for the functioning project, they should probably be imported to the source.

In older, but medium-traffic projects, such as he.wikisource, many translations are outdated, because they were manually modified before the days of proper MediaWiki localization and Betawiki.

Can anyone create statistics of modified messages in all the 700+ Wikimedia projects? These are the curious things that i can think of now:

How many messages are modified?

How many are modified, but identical to the source?

What is the most frequently modified message (i would bet on Common.css or aboutsite...)?

If it's not too hard - how many messages are only slightly modified (let's say, identical to the source except 15% different characters).

Messages are often modified, even thought the wording is identical. These can probably be deleted in the project, because otherwise they won't be updated when they are updated in the software.

These may be modifications that admins were waiting for too long.
Best make them aware, I'd suggest.

Modified messages often differ very little from the original. For example, the original may have a period or a colon in the end, and the modified messages doesn't ...

Probably worth investigating. May happen with translations made here, too, since adding/deleting a colon often does not lead to fuzzying. So it may be translatewiki.net that errs.

For example, in en.wp "content page" is replaced with "article" ..., so maybe this could be parametrized.

I think, it is worth thinking this over. Rotem Liss, Gangleri, me, and possibly others had been proposing this already before translatewiki.net begun as betawiki, but it has been deemed "too expensive" by other developers. I am currently contemplating a kind of offline MessageXzz.php file preprocessor or converter, that may be able to make such changes as well - a kind of prompted general search and replace operation. This may become increasingly impractical, to moer messages are amended or added over time, and the more extensions are being used in a wiki. So I am not really happy with this kind of approach.

Can anyone create statistics of modified messages ...

Imho, this is a typical toolserver project, and it is not hard to make at least a simple working prototype. If I had the time atm, I'd give it a try, but there are likely more ones who can do that, too. Maybe, you wanna try the toolserver mailing list ?

Messages are often modified, even thought the wording is identical. These can probably be deleted in the project, because otherwise they won't be updated when they are updated in the software.

These may be modifications that admins were waiting for too long. Best make them aware, I'd suggest.

The modifications could have been done manually in the project, as suggested above, but I seem to remember that we were told it is a quirk of 'Localisation Update' extension, that it sometimes generates these duplicates. The advice when 'Localisation Update' was first introduced was to delete the duplicate messages in the project manually because there wasn't a way to do it automatically back then, if I remember right.

I have now deleted 370+ translations on Wikimedia Commons where the message was identical (or less useful) than the default translation. In nearly all of these cases, the translation message was created before 2008. We should create a list of all languages, and check them off as we have reviewed all of the custom messages on Commons and/or <lang>.Wikipedia.