2184522017-04-07 04:49:38 +0000I can cause an assertion with a dtrace script and when it does happen, dtrace fails until the module dtraceall is reloaded2019-03-21 02:54:43 +0000111UnclassifiedBase Systemkern11.0-STABLEAnyAnyNew---Affects Some People---1kungfujesus06markjfklinimonmarkjmartinpeteoldest_to_newest9293730181559kungfujesus062017-04-07 04:49:38 +0000Created attachment 181559
Dtrace script that eventually fails with an assertion
I've attached a dtrace script where this problem happens reliably for me within a stable VM. The assertion is about a timestamp.
Here's the assertion:
Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file /usr/src/cddl/lib/libdtrace/../../../cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, line 3328.9293941fk2017-04-07 10:44:56 +0000In ElectroBSD I replaced this and another timestamp-related assertion
with warnings:
https://www.fabiankeil.de/sourcecode/electrobsd/ElectroBSD-r312620-6cfa243f1516/0129-dtrace_consume-Warn-if-cpu-cores-are
https://www.fabiankeil.de/sourcecode/electrobsd/ElectroBSD-r312620-6cfa243f1516/0165-libdtrace-Replace-another-timestamp-related-assert-wi.diffn-t-completely-in.diff
Obviously that's not a real fix but in many cases the timestamp mismatch
doesn't matter and in these cases it seemed preferable to prematurly
aborting the trace and dumping core.9293972fk2017-04-07 11:27:09 +0000Copy&paste is complicated but thanks to redirects the links should work now.9294233kungfujesus062017-04-07 14:16:10 +0000(In reply to Fabian Keil from comment #2)
Based on your description of what this assertion really means I wonder if the hypervisor I'm using exacerbates this issue. I'm using VirtualBox.9294414avg2017-04-07 16:00:26 +0000I've seen an issue like this in the past. It was on a physical machine where TSC locks were noticeably different between processors. Unfortunately, right now FreeBSD measures TSC differences only once and doesn't recalibrate deltas periodically like illumos does.9294445fk2017-04-07 16:30:57 +0000I'm seeing the issue on a pysical system as well.
dmesg.boot excerpt:
CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4600+ (2542.18-MHz K8-class CPU)
Origin="AuthenticAMD" Id=0x60fb2 Family=0xf Model=0x6b Stepping=2
Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
Features2=0x2001<SSE3,CX16>
AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow!+,3DNow!>
AMD Features2=0x11f<LAHF,CMP,SVM,ExtAPIC,CR8,Prefetch>
SVM: NAsids=64
real memory = 4294967296 (4096 MB)
avail memory = 3984990208 (3800 MB)
Event timer "LAPIC" quality 100
/var/log/messages frequently contains messages like:
Apr 7 18:10:04 kendra kernel: calcru: runtime went backwards from 2321 usec to 2192 usec for pid 43082 (getty)
Apr 7 18:10:04 kendra kernel: calcru: runtime went backwards from 33837 usec to 32484 usec for pid 46789 (sshd)
Apr 7 18:10:04 kendra kernel: calcru: runtime went backwards from 236 usec to 233 usec for pid 45792 (g_eli[0] gpt/swap-a)
Apr 7 18:10:04 kendra kernel: calcru: runtime went backwards from 878257 usec to 829154 usec for pid 88073 (perl)
Apr 7 18:10:04 kendra kernel: calcru: runtime went backwards from 1228 usec to 1159 usec for pid 70123 (nginx)
Apr 7 18:10:04 kendra kernel: calcru: runtime went backwards from 930 usec to 878 usec for pid 68524 (nginx)
Apr 7 18:10:04 kendra kernel: calcru: runtime went backwards from 1297 usec to 1225 usec for pid 16461 (casperd)
which I believe is another symptom of the same problem.
My other systems use different hardware and don't trigger the DTrace sanity check.
I don't know if VirtualBox exacerbates the problem or if it simply
inherits it from the system it's running on.
I don't use VirtualBox and I'm not seeing the issue when using bhyve,
but then again, I can only run bhyve on systems where the host system
doesn't trigger the sanity checks either.9294516kungfujesus062017-04-07 17:34:48 +0000(In reply to Fabian Keil from comment #5)
Interesting, I don't see those in /var/log/messages but maybe you're building with more warning/debug stuff turned on (perhaps the invariant options?). I'm encountering this issue with just STABLE and the generic kernel and no additional build flags. It's possible it's machine and or virtual machine specific, it's also possible that machines on release aren't seeing this because assertions are turned off. Then again, I thought that was the case for everything but CURRENT, but evidently I'm wrong on that part.9294637kungfujesus062017-04-07 20:54:19 +0000(In reply to Andriy Gapon from comment #4)
Andriy - given that this is happening, can I expect these microsecond timestamp deltas to be reasonably accurate or am I screwed without a fix for this?9295128fk2017-04-08 10:34:21 +0000The calcru complaints I'm getting don't depend on debug flags:
sys/kern/kern_resource.c-914- } else { /* tu < ruxp->rux_tu */
sys/kern/kern_resource.c-915- /*
sys/kern/kern_resource.c-916- * What happened here was likely that a laptop, which ran at
sys/kern/kern_resource.c-917- * a reduced clock frequency at boot, kicked into high gear.
sys/kern/kern_resource.c-918- * The wisdom of spamming this message in that case is
sys/kern/kern_resource.c-919- * dubious, but it might also be indicative of something
sys/kern/kern_resource.c-920- * serious, so lets keep it and hope laptops can be made
sys/kern/kern_resource.c-921- * more truthful about their CPU speed via ACPI.
sys/kern/kern_resource.c-922- */
sys/kern/kern_resource.c:923: printf("calcru: runtime went backwards from %ju usec "
sys/kern/kern_resource.c-924- "to %ju usec for pid %d (%s)\n",
sys/kern/kern_resource.c-925- (uintmax_t)ruxp->rux_tu, (uintmax_t)tu,
sys/kern/kern_resource.c-926- p->p_pid, p->p_comm);9641379commit-hook2017-12-18 17:26:37 +0000A commit references this bug:
Author: markj
Date: Mon Dec 18 17:26:25 UTC 2017
New revision: 326935
URL: https://svnweb.freebsd.org/changeset/base/326935
Log:
Avoid CPU migration in dtrace_gethrtime() on x86.
dtrace_gethrtime() may be called outside of probe context, and in
particular, from the DTRACEIOC_BUFSNAP handler.
Disable interrupts rather than using sched_pin() to help ensure that
we don't call any external functions when in probe context.
PR: 218452
MFC after: 1 week
Changes:
head/sys/cddl/dev/dtrace/amd64/dtrace_subr.c
head/sys/cddl/dev/dtrace/i386/dtrace_subr.c96413910markj2017-12-18 17:27:42 +0000I haven't been able to reproduce this in bhyve, but would be interested in knowing whether it still occurs after r326935.97822911linimon2018-02-19 01:36:40 +0000To submitter: does this problem still recur?97824312kungfujesus062018-02-19 02:16:15 +0000(In reply to Mark Linimon from comment #11)
I do still have the VM this occurs in, but I'm presently out of town for the week. I may have time to test it out remotely sometime during the week, though.97888913martin2018-02-22 11:35:52 +0000(In reply to Mark Johnston from comment #10)
The change in r326935 does not fix it for me ('dtruss pwd' triggers the original problem for me sometimes, but I've never seen any calcru messages).
Initially I was running the GENERIC 10.3-RELEASE from the iso, updated to p26 with freebsd-update (no virtualization). I then recompiled dtruss.ko with the patched dtrace_gethrtime, but the problem persists. I checked the code in the new module using kgdb.
My hardware is similar to comment #5:
CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ (2216.54-MHz 686-class CPU)
Origin="AuthenticAMD" Id=0x20f32 Family=0xf Model=0x23 Stepping=2
Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
Features2=0x1<SSE3>
AMD Features=0xe2500800<SYSCALL,NX,MMX+,FFXSR,LM,3DNow!+,3DNow!>
AMD Features2=0x3<LAHF,CMP>
real memory = 2147483648 (2048 MB)
avail memory = 2081153024 (1984 MB)
Event timer "LAPIC" quality 400
ACPI APIC Table: <Nvidia AWRDACPI>
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)
cpu0 (BSP): APIC ID: 0
cpu1 (AP): APIC ID: 1
I think this CPU is known to have TSC synchronization issues across cores.98176114markj2018-03-10 16:08:03 +0000Would anyone able to reproduce this problem be able to test the patch here and confirm whether it fixes the problem? https://reviews.freebsd.org/D1463898497815martin2018-03-29 11:46:15 +0000(In reply to Mark Johnston from comment #14)
I applied the D14638 patch to 10.3-RELEASE-p26, but still get the original assertion:
Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file /usr/src/cddl/lib/libdtrace/../../../cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, line 3328.104318916pete2019-03-20 19:01:02 +0000Hi, I would like to bump this ticket. I am running into this issue on an AWS EC2 instance running 12.0-RELEASE-p2. Is additional debugging info needed to help squash this guy?104320917markj2019-03-20 21:25:13 +0000(In reply to pete from comment #16)
I haven't been able to reproduce the issue in a long time despite a fair amount of effort. Could you attach a dmesg from the system and the dtrace invocation you're using?104321018markj2019-03-20 21:30:24 +0000(In reply to pete from comment #16)
You may also wish to try applying r344366, though one user confirmed that it did not solve their problem.104325719commit-hook2019-03-21 02:52:51 +0000A commit references this bug:
Author: markj
Date: Thu Mar 21 02:52:23 UTC 2019
New revision: 345359
URL: https://svnweb.freebsd.org/changeset/base/345359
Log:
Don't attempt to measure TSC skew when running as a VM guest.
It simply doesn't work in general since VCPUs may migrate between
physical cores. The approach used to measure skew also doesn't
make much sense in a VM.
PR: 218452
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Changes:
head/sys/cddl/dev/dtrace/amd64/dtrace_subr.c
head/sys/cddl/dev/dtrace/i386/dtrace_subr.c104325820markj2019-03-21 02:54:43 +0000If anyone able to reproduce the problem is willing, please try r345359. (It can be backported to any stable branch easily.) I don't believe it will fix the problem entirely, but I suspect that the attempt to correct for TSC skew hurts more than it helps when running in a VM.1815592017-04-07 04:49:38 +00002017-04-23 17:39:14 +0000Dtrace script that eventually fails with an assertioncksum.dtext/plain440kungfujesus06IyEvdXNyL3NiaW4vZHRyYWNlIC1zCgojcHJhZ21hIEQgb3B0aW9uIHF1aWV0CiNwcmFnbWEgRCBv
cHRpb24gc3dpdGNocmF0ZT05OQoKQkVHSU4KewogICAgcHJpbnRmKCJFTkRUSU1FKHVzKSBMQVRF
TkNZKG5zKSBUWVBFIFNJWkUoYnl0ZXMpIFpPTkVOQU1FIFBST0NFU1NcbiIpOwogICAgc3RhcnQg
PSB0aW1lc3RhbXA7CiAgICBuID0gMDsKfQoKZmJ0OjpmbGV0Y2hlcl80X25hdGl2ZTplbnRyeQov
ZXhlY25hbWUgIT0gImR0cmFjZSIvCnsKCXNlbGYtPnRzID0gdGltZXN0YW1wOwoJc2VsZi0+c2l6
ZSA9IGFyZ3NbMV07Cn0KCmZidDo6ZmxldGNoZXJfNF9uYXRpdmU6cmV0dXJuCi9leGVjbmFtZSAh
PSAiZHRyYWNlIi8KewoJcHJpbnRmKCJmbGV0Y2hlciB0b29rICVkIG5zIGZvciAlbHUgYnl0ZXNc
biIsIHRpbWVzdGFtcCAtIHNlbGYtPnRzLCBzZWxmLT5zaXplKTsKfQo=