https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2019-02-04T22:18:06ZCeph RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1287122019-02-04T22:18:06ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Project</strong> changed from <i>Ceph</i> to <i>RADOS</i></li></ul> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1287162019-02-04T22:23:13ZDarius Kasparavičiusdaznis@gmail.com
<ul><li><strong>File</strong> <a href="/attachments/download/3948/ceph-osd.tar.gz">ceph-osd.tar.gz</a> added</li></ul><p>Hello,</p>
<p>I have collected additional information Sage asked. Attached log has debug_osd=20 set.</p>
<p>How this happened:<br />1. One of the nodes had all it's osd's set to out. To clean them up for replacement. <br />2. Noticed that a lot of snaptrim was running. <br />3. Set nosnaptrim flag on the cluster. <br />4. Once mon_osd_snap_trim_queue_warn_on appeared. Removed nosnaptrim flag. <br />5. All osds on the cluster crashed and started flapping. Set nosnaptrim flag back on.</p> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1289102019-02-06T22:13:36ZNeha Ojhanojha@redhat.com
<ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>High</i></li></ul> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1289122019-02-06T22:42:00ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>I was theorizing in a bug scrub that maybe the PG was running behind on OSDMaps and so missing the nosnaptrim flag update, but that isn't the case — the OSD doesn't look at it directly at all, just the PG when it activates a map.</p>
<p>However, since the crash came from the WaitTrimTimer state's timer triggering a transition into NotTrimming and posting a KickTrim event, I think it's safe to say there's some race or missed timer cleanup that causes this when the flag changes state. Since the timer is cleaned up when you exit the WaitTrimTimer state, that also seems a bit odd, but maybe...oh, in fact, I don't see anything that directly kills it. Maybe we do a wide reset() somewhere? Kinda looks like we only do that in PrimaryLogPG::on_change() and that is only triggered in interval changes, though.</p>
<p>(Also interesting: the WaitRepops state goes into WaitTrimTimer if !can_trim() when the replies come back.)</p> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1311202019-03-08T08:19:14ZDarius Kasparavičiusdaznis@gmail.com
<ul></ul><p>Hello,</p>
<p>any updates regarding this bug? I would love a patch to resolve this issue ASAP. One of my monitors just died and I can't add new one. As it's throwing slow io errors while trying to synchronise.</p> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1322512019-03-18T14:22:11ZErikas Kučinskis
<ul></ul><p>Hello any updates about this?</p> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1334132019-04-02T13:02:21ZErikas Kučinskis
<ul></ul><p>Hello it's been two months now is there any update about this bug?</p> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1355822019-04-26T23:19:24ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li><li><strong>Assignee</strong> set to <i>David Zafman</i></li></ul><p>I am able to reproduce this, so I'll work on a fix.</p> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1355892019-04-27T23:26:28ZDavid Zafmandzafman@redhat.com
<ul></ul><p>The following script sometimes hits the race and crashes an OSD. I've removed the assert and the script has been running in a loop without seeing any other core dumps.</p>
<pre>
#! /bin/bash -x
../src/stop.sh
MGR=1 MON=1 MDS=0 OSD=5 ../src/vstart.sh -l -d -n -o osd_snap_trim_sleep=5.0 2&gt; /dev/null
sleep 5
bin/ceph osd pool create test 1 1 2&gt; /dev/null
sleep 5
sleep 2
bin/ceph pg dump pgs 2&gt; /dev/null
for s in $(seq 1 20)
do
dd if=/dev/urandom bs=1m count=1 of=data
for i in $(seq 1 100)
do
bin/rados -p test put obj$i data 2&gt; /dev/null
done
bin/rados -p test mksnap snap${s} 2&gt; /dev/null
done
while(true); do bin/ceph osd set nosnaptrim; sleep 1; bin/ceph osd unset nosnaptrim; done &#38;
for s in $(seq 1 20)
do
bin/rados -p test rmsnap snap$s
sleep 3
done
sleep 60
bin/ceph status
bin/ceph osd dump
kill %%
wait
bin/ceph status
bin/ceph osd dump
</pre> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1355902019-04-28T00:25:50ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Pull request ID</strong> set to <i>27830</i></li></ul> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1355912019-04-28T00:29:12ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Need Review</i></li></ul> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1362152019-05-07T10:15:58ZErikas Kučinskis
<ul></ul><p>Hi is there any ETA when the bug will be live?</p> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1362162019-05-07T10:16:40ZErikas Kučinskis
<ul></ul><p>Erikas Kučinskis wrote:</p>
<blockquote>
<p>Hi is there any ETA when the bug fix will be live?</p>
</blockquote> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1363572019-05-08T21:33:14ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Status</strong> changed from <i>Need Review</i> to <i>Pending Backport</i></li><li><strong>Backport</strong> set to <i>mimic, nautilus</i></li></ul><p>No ETA; it'll have to wend its way through the backports process. I don't think any releases are imminent so it should be the next point release though.</p> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1363692019-05-09T07:10:14ZErikas Kučinskis
<ul></ul><p>Greg Farnum wrote:</p>
<blockquote>
<p>No ETA; it'll have to wend its way through the backports process. I don't think any releases are imminent so it should be the next point release though.</p>
</blockquote>
<p>Thank you for the information.</p> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1365102019-05-10T11:00:15ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/39698">Backport #39698</a>: mimic: OSD down on snaptrim.</i> added</li></ul> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1365122019-05-10T11:00:22ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/39699">Backport #39699</a>: nautilus: OSD down on snaptrim.</i> added</li></ul> RADOS - Bug #38124: OSD down on snaptrim.https://tracker.ceph.com/issues/38124?journal_id=1405272019-07-12T12:33:54ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul>