KDE almost lost all 1500 of their Git repositories

One of the administrators of the infrastructure of the KDE project detail described what happened a few days before the incident, which could be called the “Great Disaster KDE 2013.”

As a result of the incident, the KDE developers had nearly lost the contents of the 1500 Git-repository project.

It all started with the damage to the contents Ext4 file system on the primary Git-server after a failed virtual machine is restarted after the updates are applied to one server project. An error occurred and the file system was the integrity of the primary Git-repository, the contents of which were destroyed and many data repositories lost. The situation began to resemble a disaster, when administrators began to restore data from a backup. The fact that the backup practices a mirror Git-repository. Fingering mirror administrators was terrified – mirroring managed system automatically synchronize the erroneous data on all secondary servers, content repositories, which have also fallen into disrepair or been removed.

Story was a happy ending – was found a copy of the information and content repositories, managed to fully recover. If it was not a coincidence, the copy might well not show up and restore data would have to be bit by bit. The fact that the day before the incident, in the transfer of the contents of one of the servers to new hardware has been further configured system cloning repositories to Git-not yet commissioned a new server. In this case, synchronization is configured to run every 20 minutes and the beginning of another cycle of the problem had to reboot the server, leading to the completion of the full synchronization run the script to timeout and performance of the following script just download the latest revision from the repository on the failed server, which also failed because the server is not able to form a correct set of data. As a result, have a copy on the server as a repository to Git-reload the server.

The main mistake of the design was the over-reliance primary server git.kde.org, which was seen as a reference and as a result became a single exact failure. Backup system was designed for a full recovery in the event of data loss or server crash, but has not been evaluated from a position of partial damage to data and metadata. As a result, developers are given a good lesson and colleagues were quick to warn about the dangers of over-confidence in the distributed nature of Git and use “git – mirror” as a backup method.

It was found that the mechanism samoverifikatsii running when executing git, does not work for “git – mirror”. When you run “git – mirror” zekralirovanie performed without verifying the integrity and without any warning, even in the event of a damaged repository commit objects, the error is displayed only if the git-operation, bypassing all the tree commits, or when you run “git fsck”.