We see a segmentation fault when trying to export rgw bucket using nfs-ganesha (with latest master and ceph v 12.1.0) It happens only on Intel arch. Looking at the core dump, the segmentation fault occurs at __lll_unlock_elision (). It occurs because unlock is called on object on which no lock was taken. From man page of pthread_mutex_unlock: "f a thread attempts to unlock a mutex that it has not locked or a mutex which is unlocked, undefined behavior results." With lock elision unlocking a free lock is not tolerated anymore. Hence it occurs only on intel arch.

History

Releasing lat.lock having taken fh->mtx is not per-se a smoking gun, as we have taken lat.lock at find_latch unconditionally wrt the value of fh, at l. 997. Whether we take (and maybe hold) fh->mtx deals with serialization on the new handle that callers may require. By the time we return from this routine, latch will be unlocked, but fh->mtx may not be. fh_cache.insert_latched should be releasing the lock taken at 997.

This looks like it could be a bug in using Mark's export by bucket syntax? Could you share more of your config (e.g., ganesha.conf, indicate whether demo-demo exists and should be visible to the export credential)--feel free to share by email.

One thing that would cause this is simple corruption of the filehandle table, which I guess would have to have happened almost immediately. Just to rule it out, could you maybe run with valgrind (default engine)?

Releasing lat.lock having taken fh->mtx is not per-se a smoking gun, as we have taken lat.lock at find_latch unconditionally wrt the value of fh, at l. 997. Whether we take (and maybe hold) fh->mtx deals with serialization on the new handle that callers may require. By the time we return from this routine, latch will be unlocked, but fh->mtx may not be. fh_cache.insert_latched should be releasing the lock taken at 997.

As you said we are unlocking the lock on "lat" unconditionally in function find_latch (as we pass flag FLAG_LOCK when calling fh_cache.find_latch()). Then why do we need to again call unlock on latch in fh_cache.insert_latch?

This looks like it could be a bug in using Mark's export by bucket syntax? Could you share more of your config (e.g., ganesha.conf, indicate whether demo-demo exists and should be visible to the export credential)--feel free to share by email.

I think we lock lat.lock unconditionally in find_latch, but should NOT be unlocking it. we would do the unlock if the FLAG_UNLOCK is given, but the intent is rather to hold the partition lock until the caller either fails or has inserted positionally into the already-locked partition. then we want to unlock lat.lock.

I think we lock lat.lock unconditionally in find_latch, but should NOT be unlocking it. we would do the unlock if the FLAG_UNLOCK is given, but the intent is rather to hold the partition lock until the caller either fails or has inserted positionally into the already-locked partition. then we want to unlock lat.lock.

One thing that would cause this is simple corruption of the filehandle table, which I guess would have to have happened almost immediately. Just to rule it out, could you maybe run with valgrind (default engine)?

Matt

I will try to run again on hardware where we see the issue and report back.

One thing that would cause this is simple corruption of the filehandle table, which I guess would have to have happened almost immediately. Just to rule it out, could you maybe run with valgrind (default engine)?

Matt

I will try to run again on hardware where we see the issue and report back.