On Jun 6, 2012, at 8:01 AM, Sašo Kiselkov wrote:
> On 06/06/2012 04:55 PM, Richard Elling wrote:
>> On Jun 6, 2012, at 12:48 AM, Sašo Kiselkov wrote:
>>
>>> So I have this dual 16-core Opteron Dell R715 with 128G of RAM attached
>>> to a SuperMicro disk enclosure with 45 2TB Toshiba SAS drives (via two
>>> LSI 9200 controllers and MPxIO) running OpenIndiana 151a4 and I'm
>>> occasionally seeing a storm of xcalls on one of the 32 VCPUs (>100000
>>> xcalls a second).
>>
>> That isn't much of a storm, I've seen > 1M xcalls in some cases...
>
> Well it does make one of the cores 100% busy for around 10-15 seconds,
> so it is processing at the maximum rate the core can do it. I'd call
> that a sign of "something bad(tm)" going on.

Advertising

It is unpleasant, for sure...
>>> The machine is pretty much idle, only receiving a
>>> bunch of multicast video streams and dumping them to the drives (at a
>>> rate of ~40MB/s). At an interval of roughly 1-2 minutes I get a storm of
>>> xcalls that completely eat one of the CPUs, so the mpstat line for the
>>> CPU looks like:
>>>
>>> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
>>> 31 0 0 102191 1 0 0 0 0 0 0 0 0 100
>>> 0 0
>>>
>>> 100% busy in the system processing cross-calls. When I tried dtracing
>>> this issue, I found that this is the most likely culprit:
>>>
>>> dtrace -n 'sysinfo:::xcalls {@[stack()]=count();}'
>>> unix`xc_call+0x46
>>> unix`hat_tlb_inval+0x283
>>> unix`x86pte_inval+0xaa
>>> unix`hat_pte_unmap+0xed
>>> unix`hat_unload_callback+0x193
>>> unix`hat_unload+0x41
>>> unix`segkmem_free_vn+0x6f
>>> unix`segkmem_zio_free+0x27
>>> genunix`vmem_xfree+0x104
>>> genunix`vmem_free+0x29
>>> genunix`kmem_slab_destroy+0x87
>>> genunix`kmem_slab_free+0x2bb
>>> genunix`kmem_magazine_destroy+0x39a
>>> genunix`kmem_depot_ws_reap+0x66
>>> genunix`taskq_thread+0x285
>>> unix`thread_start+0x8
>>> 3221701
>>>
>>> This happens in the sched (pid 0) process. My fsstat one looks like this:
>>>
>>> # fsstat /content 1
>>> new name name attr attr lookup rddir read read write write
>>> file remov chng get set ops ops ops bytes ops bytes
>>> 0 0 0 664 0 952 0 0 0 664 38.0M /content
>>> 0 0 0 658 0 935 0 0 0 656 38.6M /content
>>> 0 0 0 660 0 946 0 0 0 659 37.8M /content
>>> 0 0 0 677 0 969 0 0 0 676 38.5M /content
>>>
>>> What's even more puzzling is that this happens apparently entirely
>>> because of some factor other than userland, since I see no changes to
>>> CPU usage of processes in prstat(1M) when this xcall storm happens, only
>>> an increase of loadavg of +1.00 (the busy CPU).
>>
>> What exactly is the workload doing?
>
> As I wrote above, just receiving multicast video streams and writing
> them to disk files, nothing else. The fsstat lines above show that -
> pure write load.
>
>> Local I/O, iSCSI, NFS, or CIFS?
>
> Purely local I/O via the two LSI SAS controllers, nothing else.
>
>>> I Googled and found that
>>> http://mail.opensolaris.org/pipermail/dtrace-discuss/2009-September/008107.html
>>> seems to have been an issue identical to mine, however, it remains
>>> unresolved at that time and it worries me about putting this kind of
>>> machine into production use.
>>>
>>> Could some ZFS guru please tell me what's going on in segkmem_zio_free?
>>
>> It is freeing memory.
>
> Yes, but why is this causing a ton of cross-calls?
To better understand xcalls, see Adam's excellent blog on the topic
https://blogs.oracle.com/ahl/entry/debugging_cross_calls_on_opensolaris
The scenario described there probably isn't your problem, but the explaination
of xcalls is quite good.
>>> When I disable the writers to the /content filesystem, this issue goes
>>> away, so it has obviously something to do with disk IO. Thanks!
>>
>> Not directly related to disk I/O bandwidth. Can be directly related to other
>> use, such as deletions -- something that causes frees.
>
> When I'm not writing to disk it doesn't happen, so my guess that it
> indeed has something to do with (perhaps) ZFS freeing txg buffers or
> something...
>
>> Depending on the cause, there can be some tuning that applies for large
>> memory machines, where large is >= 96 MB.
>> -- richard
>
> I'll try and load the machine with dd(1) to the max to see if access
> patterns of my software have something to do with it.
OK, next step is to determine if this is part of the normal ARC resizing or not.
If it is, then tuning can help. I've got a few scripts to do this that aren't
posted.
Fortunately, Brendan posted his, solook for the arcreap.d script at
http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/
If you are seeing lots of xcall activity related to arc_shrink, then this can
be tuned.
-- richard
--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422