On Tue, May 05, 2009 at 12:48:48PM +0200, Andreas Herrmann wrote:> On Tue, May 05, 2009 at 11:35:20AM +0200, Andi Kleen wrote:> > > Best example is node interleaving. Usually you won't get a SRAT table> > > on such a system.> > > > That sounds like a BIOS bug. It should supply a suitable SLIT/SRAT> > even for this case. Or perhaps if the BIOS are really that broken> > add a suitable quirk that provides distances, but better fix the BIOSes.> > How do you define SRAT when node interleaving is enabled?> (Defining same distances between all nodes, describing only one node,> or omitting SRAT entirely? I've observed that the latter is common> behavior.)

Either a memory less node with 10 distance (which seems to be envogue recently for some reason) or multiple nodes with 10 distance.

> > > > Thus you see just one NUMA node in> > > /sys/devices/system/node. But on such a configuration you still see> > > (and you want to see) the correct CPU topology information in> > > /sys/devices/system/cpu/cpuX/topology. Based on that you always can> > > figure out which cores are on the same physical package independent of> > > availability and contents of SRAT and even with kernels that are> > > compiled w/o NUMA support.> > > > So you're adding a x86 specific mini NUMA for kernels without NUMA> > (which btw becomes more and more an exotic case -- modern distros> > are normally unconditionally NUMA) Doesn't seem very useful.> > No, I just tried to give an example why you can't derive CPU topology

First I must say it's unclear to me if CPU topology is really generallyuseful to export to the user. If they want to know how far coresare away they should look at cache sharing and NUMA distances (especiallycache topology gives a very good approximation anyways). For otherpurposes like power management just having arbitary sets (x is sharedwith y in a set without hierarchy) seems to work just fine.

Then traditionally there were special cases for SMT and for packages(for error containment etc.) and some hacks for licensing, but thesedon't really apply in your case or can be already expressed in otherways.

If there is really a good use case for exporting CPU topology I wouldargue for not adding another adhoc level, but just export a SLITstyle arbitary distance table somewhere in sys. That would support to express any possible future hierarchies too. But again I have doubts that's reallyneeded at all.

> > and you're making it even worse, adding another strange special case.> > It's an abstraction -- I think of it just as another level in the CPU

It's not a general abstraction, just another ad-hoc hack.

> hierarchy -- where existing CPUs and multi-node CPUs fit in:> > physical package --> processor node --> processor core --> thread> > I guess the problem is that you are associating node always with NUMA.> Would it help to rename cpu_node_id to something else?

Nope. It's a general problem, renaming it won't make it better.

> or something entirely different?> > > On the other hand NUMA topology is comparatively straight forward and well > > understood and it's flexible enough to express your case too.> > > > > physical package == two northbridges (two nodes)> > > > > > and this needs to be represented somehow in the kernel.> > > > It's just two nodes with a very fast interconnect.> > In fact, I also thought about representing each internal node as one> physical package. But that is even worse as you can't figure out which> node is on the same socket.

That's what the physical id is for.

> The best solution is to reflect the correct CPU topology (all levels> of the hierarchy) in the kernel. As another use case: for power> management you might want to know both which cores are on which> internal node _and_ which nodes are on the same physical package.

The powernow driver needs to know this?

The question is really if it needs to be generally known. Thatseems doubtful.

> > > > > Who needs this additional information?> > > > > > The kernel needs to know this when accessing processor configuration> > > space, when accessing shared MSRs or for counting northbridge specific> > > events.> > > > You're saying there are MSRs shared between the two in package nodes?> > No. I referred to NB MSRs that are shared between the cores on the> same (internal) node.