Recent changes to 394: Ixgbe driver sets affinity_mask for wrong NUMA nodehttps://sourceforge.net/p/e1000/bugs/394/2014-02-05T22:27:51Z#394 Ixgbe driver sets affinity_mask for wrong NUMA node2014-02-05T22:27:51Z2014-02-05T22:27:51ZTodd Fujinakahttps://sourceforge.net/u/htfujina/https://sourceforge.net0c319b121e67c3874a32d03fc4d6c8a75d04e88c<div class="markdown_content"><ul>
<li><strong>status</strong>: open --&gt; closed</li>
</ul></div>#394 Ixgbe driver sets affinity_mask for wrong NUMA node2014-01-27T08:27:23Z2014-01-27T08:27:23ZArkadiusz Bhttps://sourceforge.net/u/arko-pl/https://sourceforge.net6307aefbd382cf216c37216b631e0b655c5488c0<div class="markdown_content"><p>Thank you for explanation. I'll do some tests with ignore policy.</p></div>#394 Ixgbe driver sets affinity_mask for wrong NUMA node2014-01-26T15:03:50Z2014-01-26T15:03:50ZNeil Hormanhttps://sourceforge.net/u/nhorman/https://sourceforge.netfcbebc5ced6e6cc150a92d7be3cb0b03aa6b14a0<div class="markdown_content"><p>"Hint policy ignore won't resolve the problem of cross NUMA traffic."</p>
<p>It will if thats the only problem you're having. irqbalance, if it can ignore affinity_hint and is running on a system with a properly populated acpi SLIT table, will keep each irq on the local numa node of the device. It might help to understand here that affinity_hints from the driver in ixgbe, IIRC do not honor numa node locality. Because ixgbe creates a queue per cpu, it sets the affinty hint for the irq tied to each queue to a unique cpu, ignoring any numa locality. Thats a perfectly reasonable thing to do, as it prioritizes parallel operation over numa locality. If thats not the right choice for you howver, the answer is to ignore the affinity hint, and let irqbalance preform its default operation, which is to spread irqs in the same way, but keep them all local to the devices reported numa node.</p>
<p>"I have a question. Assume that there are two NUMA nodes. Ethernet ..."</p>
<p>No, it won't cause packet drops, at least not in and of itself. The answer to your question is really you're choice. Certainly transmits from the remote node will be delayed slightly due to the additional pci bus traversals it has to do, but weather or not that additional delay is better or worse than moving the additional processes to the more local numa node (incurring the additional process competition for cpu time and memory), is up to you and your workload requirements.</p>
<p>"Does attached patch make sense if I want to avoid that situation?"</p>
<p>Honestly, no. Its certainly functional, but from a philosophical point I don't like it. As discussed above, enforcing numa node locality or maximizing parallel behavior is really a policy decision, which places it in the realm of user space function. Adding it to the kernel is really enforcing a policy that isn't in the best interests of all users. You can do the same thing by just ignoring the affinity_hint in irqbalance.<br />
</p>
<p>In fact, affinity_hint is becomming something of a hold over from before irqbalance was rewritten to parse sysfs properly. A few years ago, irqbalance wasn't really msi-aware and had a very hard time determining how to ballance these interrupts. Intel at the time solved this problem by creating the affinity_hint interface to drive more correct mappings. Since then irqbalance has been re-written and can now gather information about irqs from sysfs, and make decisions better than the driver can. I should probably change the irqbalance hint policy default to ignore soon.</p></div>#394 Ixgbe driver sets affinity_mask for wrong NUMA node2014-01-25T19:22:26Z2014-01-25T19:22:26ZArkadiusz Bhttps://sourceforge.net/u/arko-pl/https://sourceforge.net3b77b28a65dfc1a40261350679702aeac8ce043e<div class="markdown_content"><p>Thank you for your response. Yes I'm using the latest irqbalance and it works as expected. Hint policy ignore won't resolve the problem of cross NUMA traffic. </p>
<p>I have a question. Assume that there are two NUMA nodes. Ethernet adapter is connected to first node and queues are on both nodes. All cores/threads on the first node are under heavy load and there are data to send to this adapter on the second NUMA node. Won't this situation cause problems (timeouts / dropped packages etc.)?</p>
<p>Does attached patch make sense if I want to avoid that situation?</p>
<p>I have a lot of problems with NUMA architecture and cross node traffic before. So I want to be sure that resources on the second NUMA node won't cause much more problems in the future. </p></div>#394 Ixgbe driver sets affinity_mask for wrong NUMA node2014-01-24T18:48:55Z2014-01-24T18:48:55ZNeil Hormanhttps://sourceforge.net/u/nhorman/https://sourceforge.net96d4498cdb5a5915e2ee9fd27953b6384952ec15<div class="markdown_content"><p>Hey there, I'm Neil, and I'm the irqbalance maintainer. Emil got in touch with me and asked me to look at this.</p>
<p>its a recent irqbalance version I presume? If so, the answer is in the irqbalance man page:</p>
<p>--hintpolicy=<span>[exact | subset | ignore]</span><br />
Set the policy for how irq kernel affinity hinting is treated.<br />
Can be one of:<br />
exact - irq affinity hint is applied unilaterally and never viloated<br />
subset - irq is balanced, but the assigned object will be a subset of<br />
the affintiy hint<br />
ignore - irq affinity hint value is completely ignored</p>
<p>If they want irqbalance to ignore the affinty hint provided by the e1000 driver,<br />
they should add --hintpolicy=ignore to IRQBALANCE_ARGS in<br />
/etc/sysconfig/irqbalance, or in the unit file if using systemd.<br />
</p>
<p>If the version of irqbalance is sufficiently recent, and you want more fine <br />
grained control than whats detailed above, you can also check out the <br />
--policyscript option. That will run a user specified script for each <br />
discovered irq, and that script can return on stdout an series of key=value <br />
pairs that specifies per irq configuration overrides. That doesn't support <br />
affinty hint honoring levels yet, but it certainly can. Open a feature request <br />
at the github project page for irqbalance if thats something you're interested <br />
in.</p></div>#394 Ixgbe driver sets affinity_mask for wrong NUMA node2014-01-24T15:49:58Z2014-01-24T15:49:58ZTodd Fujinakahttps://sourceforge.net/u/htfujina/https://sourceforge.netaaf1ad5e89d0dcecdde9fb060ccad4ef7e372142<div class="markdown_content"><p>I think we should close this as a bug and you should pose this question on open mailing lists such as e1000-devel. The cross-node traffic is something you need to manage, but there are degenerate cases where costly memory accesses increase when you put all the queues on one node.</p>
<p>I am also forwarding this question to the TME in charge of performance, and Emil is following up with the maintainer of irqbalance.</p></div>#394 Ixgbe driver sets affinity_mask for wrong NUMA node2014-01-24T07:50:48Z2014-01-24T07:50:48ZArkadiusz Bhttps://sourceforge.net/u/arko-pl/https://sourceforge.netc14cf3fab369b0e1d52f45e55e65aa6e7988afbd<div class="markdown_content"><p>Yes, but if device is physically connected to device on first NUMA node and half of queues are allocated on another NUMA node doesn't it cause cross trafic between NUMA nodes?</p></div>#394 Ixgbe driver sets affinity_mask for wrong NUMA node2014-01-23T17:32:43Z2014-01-23T17:32:43ZTodd Fujinakahttps://sourceforge.net/u/htfujina/https://sourceforge.netf9ecc4c7c51df5474e3c4f4cb89c2dd377803cee<div class="markdown_content"><p>The set_irq_affinity script is not meant to be more than a simple way to spread the queues amongst all the available cores; it does not distinguish between packages.</p>
<p>As I said before, the driver doesn't and shouldn't have to know about the system configuration. It provides hints to irqbalance and so you should be asking the questions to the irqbalance maintainers.</p></div>#394 Ixgbe driver sets affinity_mask for wrong NUMA node2014-01-23T08:23:23Z2014-01-23T08:23:23ZArkadiusz Bhttps://sourceforge.net/u/arko-pl/https://sourceforge.net8728bf151dbf1f8a94df0e9abcdc7499fa891eea<div class="markdown_content"><p>Doesn't this soultion cause situation that device needs to communicate with CPU on another NUMA node?</p>
<p>Irqbalance doesn't touch smp_affinity when this warning occurs. It is set to defalt (all CPUs on device NUMA node).</p></div>#394 Ixgbe driver sets affinity_mask for wrong NUMA node2014-01-22T23:09:37Z2014-01-22T23:09:37ZEmil Tantilovhttps://sourceforge.net/u/emiltan/https://sourceforge.netcdee835d0273a469c3e91931177a8cc4327fba7c<div class="markdown_content"><p>By default the ixgbe driver loads with number of queues = number of CPUs, and because ideally we want to spread the queues on each CPU there will be some queues on a CPU from the oposite node. The way the driver handles this is to allocate memory from the node which is local to the CPU on which the IRQ is handled. This goes along with the set_irq_affinity script provided with the ixgbe driver.</p>
<p>The affinity_hints are provided on driver load and irqbalance should be able to override them by setting /proc/irq/smp_affinity. Is irqbalance not setting the smp_affinity correctly because of the affinity_hint, or is this just a warning?</p></div>