If I interrupt the download, I can see the object having exactly 524288 bytes, corresponding to the head_size in the manifest. So it seems like the head object has been written correctly and can be read, but the next tail object has gone missing. If I check the rgw log, the PUT for that object has been finished successfully though, returning a 200 response.

The cluster is running version 0.94.10-1trusty on Ubuntu Trusty, has 388 osds, 494 TB data and about 1300 Mobjects. It is unclear how many objects are affected, we can see errors for about 100 per day.

History

It seems this is happening when an object is copied, the copy has the same prefix for shadow objects as the original object. So when the copied object is deleted, garbage collection will remove the shadow object and render the original object truncated.

A copy operation should increase the tail's refcount. The garbage collection will try to remove it by dropping the refcount. However, it's not going to remove a tail that is shared by multiple entries. There was a bug that was fixed that dealt with objects that were copied across buckets, maybe that's the issue that you're seeing?

The bug I mentioned was dealing with copying objects that resided on different pools. Do you have a scenario where this reproduces easily? If so, can you turn set 'debug rgw = 20', and 'debug ms = 1', and provide logs?

Do you have a reference to that bug? I haven't been able to find anything that looked like it would match. Also, the copies that I am seeing are taking place within a single bucket

I did verify multiple occurences now and in each of them I can see N copies being made, the N objects being deleted again, but N+1 matching gc:process removing entries, each time one pair of them with timestamps only a couple of milliseconds apart.

Note that this affects only very few objects, like maybe 100 per day, with 100M being processed in total. Most of them are more than a week old, though, so its difficult to gather the appropriate information from the logs.

Looking at the code, I don't see anything obvious that could explain this, it could either be the object getting placed twice onto the gc queue or the gc job processing a single entry twice.

And no, with the volume of requests we are dealing with, we are already at about 50GB logs per day, no way to increase the logging level to more than the default. We could consider adding some specific logging code, though, if you have some idea for that.

Also, I haven't been able to reproduce this on a non-production cluster yet, sadly.

So I can reproduce this issue now, after finding and working around #20234:

1. Run the python script below, which uploads 4000 objects, copies them and deletes the copy again.2. Wait for all gc timers to expire.3. Run radosgw-admin gc process to handle all 4000 removals in one batch.

As a result, I'm seeing about 400 of the removals being processed twice as seen in the log in comment #2, resulting in the corresponding objects being truncated.

Please confirm if my understanding of how refcounting works is reasonable:

1. PUT obj1 => shadow object is created with no refcount xattr.2. COPY obj1 obj2 => shadow object gets refcount xattr with tag_for_obj2 plus a wildcard entry.3. DEL obj2 => when GC runs, the tag_for_obj2 is removed from the refcount, but since there is still the wildcard entry, shadow object is kept in place.4. GC processes the same entry a second time, wildcard is removed and now refcount is empty => shadow object is deleted.

After noticing that there is always a gc_list operation between the duplicated gc actions, I think I found the bug:

gc fetches up to 100 tags for processing. Deleting these tags is done in batches of 16 (or 17?) though. So when fetching the next batch of tags, there are some tags that have been processed, but not deleted from the list yet, resulting in duplicates. Fix should be pretty easy:

Yeah, your analysis seems right. I think the big issue here is that we don't set a tag on the first write, but rather rely on a wildcard entry. iirc, the wildcard entry was used for backward compatibility when the refcounting was introduced, but we shouldn't rely on it on newer objects (though, it has been a while, I might be wrong).

actually, I take this backward compatibility claim back a bit. Looking at the code again, I think this is how it was designed to work. The problem here as you pointed out is that now the removal is not idempotent. One way to fix it would be to drop the use of wildcard, which I think will be tricky in terms of backward compatibility. Another way would be to keep behind a list of the references that were removed, and to not check that list every time.We need to also fix the gc, but that will not solve all cases as it's not atomic, and will probably just make it harder to happen.