We have recently encountered several occurrences of corrupted multipart objects. In all cases, the corrupted files are partially retrievable. In all cases, the corruption takes the form of a zero-byte _multipart rados object, and in all cases, the tail _shadow files are complete. In other words, we can retrieve all parts up until the 0-byte multipart object, and the objects are all missing exactly 4 MiB of data. We have observed this corruption in objects as small as 18MB and as large as 19 GB. In all cases, we have only found a single multipart object that has been zeroed out out of the entire object. In all cases, it has been one of the first few parts of the object that are affected. If we stat all of the parts, we'll see something akin to the following:

The corruption has occurred across multiple S3 clients and chunk sizes.

All instances of corruption have occurred long ago enough that we do not have the corresponding rgw logs. We do feed logs from our system into an ELK stack, so I have the ability to revisit some data, but the rgw logs are not part of the ELK feed. Using these data, I can correlate the times of corruption to instances when the cluster was reporting slow requests (longest slow requests were 120s). None of the occurrences correspond to any network or software stack failures. Typically, we are processing multiple object PUTs per second, and we've found no corruption in objects written to the cluster at the same time (where 'same time' means within seconds/milliseconds).

We haven't been able to identify any other patterns or clues to reproduce the issue, although we are actively working to do so. If and when we can reproduce, we will attach details on reproduction and include rgw logs with debugging enabled. If there are any additional details we can provide or reproduction steps we can attempt, please ask.

it would point at a retry scenario, where an upload took too long so the client library retried sending the part. One thing also to note is that the first object in the part (the one that is prefixed by multipart) has mtime that is later than the last object in that part. This would make sense for regular objects, but shouldn't happen for multipart objects, so it might be the key to understanding the issue.

There was no timeout or retry on the client's end -- I have the complete set of haproxy logs to verify that. We spotted the same gap between the completion of the third part and beginning of the PUT on the fourth, but there were no errors returned and no retry attempts from the client. But yes, we also noticed and found the mtime difference very interesting.

As for the prefix, it was the same for this object, but it contains some data that I needed to scrub. Here's another that requires less sanitation:

This showed the same pattern of behavior in the haproxy logs (read: no errors, no retries), albeit with a far smaller gap between parts 2 and 3.

As for the mtime being more recent on the multipart object, this is common across most all multipart objects in our system. I haven't dug through the code; I assumed that the mtime adjustment was due to rgw touching/updating an xattr on the object.

The mtime of the first object in the part is later because we finish by writing its metadata, so the first object of every part is expected to be touched twice, once when we write its data, and once when we finish writing the part. My guess is that it was written correctly, something removed it, then we finished by writing the metadata.

I did manage to spot a bug in the code that might cause the issue, but it requires that radosgw receives multiple requests with the same uploadId. If it's possible that haproxy replays the requests somehow, then it might be it.

The issue I'm seeing is as follows:There's a code that guards that we don't have multiple writers to the same rados object, by using the multipart object (for each part) as a guard. However, there's a bug there that when we discover that there's already such an object written, we end up removing it. Usually it doesn't matter because we then continue to write the entire part from the beginning (with a different object names for the part). However, there's the case where the original write actually ends up finishing, and if it finishes after the second writer then it's really going to miss its first part (as we can see here).

Our haproxy installation is not set up with any retry logic, so retries would have to be initiated by the client. According to the logs for the 'backup-2016.04.18/4/__2c' object, the client did resend part 2. Logs follow:

I've another observation to share. In trying to recreate this issue, I've been trying various permutations of uploading and re-uploading parts of a multipart object. In none of the cases have I been able to reproduce the issue. In observing the object in the cluster, though, I've noticed something interesting that doesn't appear in any of the broken objects.

This is the output of radosgw-admin object stat for my test object (part 3 was reuploaded):

So what's interesting to me is that reuploading a part in my tests causes two things -- 1) the prefix for the object is recalculated; 2) a different override prefix is added to the reuploaded part, while the non-re-uploaded parts retain the object's original prefix. The effect this has on the object in the cluster is that the objects are "scattered" around the cluster -- i.e. one has to search for the pieces by following the various override_prefix values. In all of the broken objects in our cluster, no override prefix has ever been calculated; all parts (including those of zero size) are stored under the prefix (i.e. <bucket_id>__multipart_<object_name>.<multipart_prefix>.<multipart_id> and <bucket_id>__shadow_<object_name>.<multipart_prefix>.<multipart_id>_<part_id>.

Right. This reaffirms my understanding of the bug. The problem happens if there are two uploads of the same part, and that the first upload completes successfully (as observed by the client) after the second upload. So basically it's like that:

This patch is NOT working for me. I have been able to successfully deploy and install in some test environments, and while the patch allows for the provided test case to pass, it fails when actually trying to retrieve multipart files. To reproduce (and I assume you can create a valid s3cfg and point things to a particular rgw):

Can someone else perform this testing and let me know if this is a bug or if I've done something very, very wrong here? My RGW is built from 0.94.6 with the addition of the patches for this issue and http://tracker.ceph.com/issues/15745. I suspect building with 0.94.6 + this patch will be sufficient. I had missed this earlier because I wasn't verifying the file size and because I could see the parts were in the cluster.

To add some clarity to the above, the issue seems to be on insertion. If I use an unpatched rgw to retrieve from the bucket populated by the patched rgw, the resulting files will also be truncated to 4MiB. If I use a patched rgw to retrieve data from a bucket populated by an unpatched rgw, the retrieved object has the appropriate size.

Furthermore, if I attempt to put files that are not multipart, the put fails completely. Similar tests against an unpatched rgw work as expected.

Brian, I was trying to reproduce the issue you described, and couldn't. Just to confirm, you're running 0.94.6 + the following commits:fec62882726beed8e7455bba2be6f72cdbba054254910f843a0c78ab7ae1e2c225e8ce39624e23c5

and have ec (7 + 2) backed .rgw.buckets pool. You also mentioned 4MB stripe size, did you mean the 4MB chunk size? How did you configure that?

I grabbed commits b7a7d41839558b1a8786a66e27d9d80e606f61c7 (issue 15745) and 8c469997e4f03e316487770b8e22d27f7e4930ec (this issue); the commits were from the hammer PRs listed in the corresponding issues. Both of my test clusters are using EC-backed .rgw.buckets pools; one is 7+2, the other is 3+1. Both clusters are running 0.94.6-1 with those patches (except the "unpatched" RGW, which is 0.94.6-1 with the 15745 patch).

RGW is using a 4MiB stripe size (the default setting, I believe). The chunk size is the multipart chunk size, which is configurable via your s3 client. In s3cmd, this is via the --multipart-chunk-size-mb flag. In my tests, I am setting the s3 chunk size to 5MiB. When I write a multipart object with a patched rgw, though, 'radosgw-admin object stat' shows that the "part_size" is 4MiB instead of the expected 5 MiB. It also sets the "part_size" of the final part to 0 instead of whatever the expected remainder should be.

Brian, when you specify 0.94.6-1, do you mean vanilla 0.94.6, or some other distro's version? Commit b7a7d41839558b1a8786a66e27d9d80e606f61c7 cannot be applied cleanly in hammer, I assume you took another commit (maybe the one that I specified)?Compare your branch to this branch https://github.com/yehudasa/ceph/tree/wip-15886-hammer-test, so that we're sure we're talking about the same code base.

I appreciate the sanity check here. I had pulled the correct patch for 15745, but I definitely hosed up what I merged in for this issue. I'm going to fix this, kick off a build, and test again in the morning. My apologies for any false alarms.