Monday, April 30, 2007

I just committed a change to move 7 previously closed drivers in Nevada to the open source tree under usr/src. This change involved nothing more than Makefile and copyright block editing, so it was pretty much a no-brainer. (Though the heavy lifting of the legal review had already been done.)

Admittedly, none of these are likely to exist on your hardware, but it does help to have more bits open. Hopefully someday /usr/closed will either cease to exist or become its own consolidation separate from Nevada.

Thursday, April 26, 2007

FYI, the afe and dmfe cases I had at PSARC (2007/229 and 2007/221 respectively) were approved. I've already put back the dmfe code. The afe code will be committed by Alan DuBoff. I've got pre-approval to do a follow-up putback to convert afe to GLDv3 afterwards.

Note that as a result of Crossbow, there are some changes coming in GLDv3, so it is still inappropriate to use GLDv3 for unbundled drivers. (The biggest of these changes is support for "polling", where the network stack can disable interrupts on the NIC and run a separate thread to poll the device for inbound packets. On extremely high traffic systems, this can have a big impact on overall system throughput by avoiding the extra context switches.)

Wednesday, April 25, 2007

The PSARC fasttrack to integrate afe into Nevada was assigned case number PSARC 2007/229. Notably, this case was not submitted by me (I'm not even on the interest list!), and is being done as a result of the BSD license terms for afe. It will probably be reviewed at next week's PSARC meeting.

Sunday, April 22, 2007

As predicted, the area of biggest risk in my conversion of eri to GLDv3 was in fact the kstat handling. However, I appear to have that all worked out now, and the binary is working flawlessly on my SunBlade 100. Even suspend/resume works fine. However, I've not yet integrated this code properly into a workspace to generate a webrev, but I will do so soon. (Probably tomorrow... I'd like to get my two other RTIs put back first.)

One of the biggest concerns about this effort was the added risk that doing this conversion might bring to the "stable" eri driver. So, I'm asking the community for help. If you want to help out with testing, especially if you have higher end systems or want to do some benchmark comparisons, please let me know.

(I don't have specific test suites to give out that this time... its of more value frankly to have people using their own tests right now, that way we get broader test coverage than perhaps we might with a single test suite.)

Please let me know. Thanks! (Oh yeah, if you have an eri you want to try with new GLDv3-based 802.3ad link aggregation features, I'd be game for that, too!)

(PS. An obvious consequence of this effort is that it will be easy to do the work to convert hme, gem, and qfe, which share a lot common heritage with the eri driver. So, maybe there is yet hope for those, as well.)

Friday, April 20, 2007

I've just finished (still testing!) my port of eri to GLDv3. Between that and eri, and looking at existing GLDv3 drivers (bge, rge, e1000g), I think I have gathered some operational experience that I hope we can use to improve Nemo. (So, anyone who says my time spent on converting eri was wasted is wrong... because if nothing else it gained some more operational experience with GLDv3.)

Executive summary of the takeaways I have gotten so far, that I think are worth noting:

There is still a lot of code duplicated across even GLDv3 drivers (more below)

some drivers can probably be changed internally to work even better with GLDv3 than a naive port

So here's the detailed stuff.

code duplication

The duplicated code falls into three major areas. ioctls (mostly ndd(1M) and loopback handling for SunVTS), kstats, and MII. For now I want to focus on the MII bit. It turns out that pretty much every Ethernet device on the planet talks to a transceiver (whether integrated into the same chip as the MAC controller or not) using MII/GMII. We have tons of logic surrounding MII and GMII replicated across each driver, and frequently the decisions made by one driver are different than those in another.

There exists an old i386 driver called mii, which was an abortive attempt to create a common module/framework for MII and PHY handling. (Only used by the obsolete dnet driver at present.) I think this should be revived. Its been shown to work well for BSD Unix (at least NetBSD, but I'm pretty sure all of them), and it would really help simplify a lot of code. The eri driver, for example, probably has a couple thousand lines of MII related auto-negotiation logic in it.

And of course, each of these negotiation frameworks takes a slightly different set of tunables and configuration parameters, exports different statistics, etc.

Lock management is so much simplified

It's reallyeasy to write a GLDv3 driver that doesn't hold locks across GLDv3 routines. I suspect a lot of deadlocks/hangs/panics are going to be solved by moving drivers to GLDv3. (Of course, we've seen locking problems higher in the stack as a result... see recent deadlocks in dls, etc. But we only need to solve those once with GLDv3. Yay.)

The kstat framework for GLDv3 is just plain broken.

There are several problems here.

All kstats for a media type are included, regardless of whether or not they make sense for a specific device. For example, the cap_rem_fault is not supported by most of the drivers yet, but yet, when the driver doesn't have support in mac_stat(), the statistic is included in kstat output as 0. However, pretty much any system with an 802.3u compliant MII does in fact support the rem_fault MII field. So in this case, just because the driver isn't exporting the stat, the framework is creating an outright lie. This is probably true of other stats as well. For example, if hardware isn't prepared to report runt_errors, then it doesn't make sense to claim that value as "zero".... because you might be flooding the device with bad packets, which just get dropped on the floor (perhaps getting accounted in some other, less granular "BadPackets" counter or somesuch.) Better to say nothing than to tell a lie, IMO.

kstat's are normally "snapshotted", so that you can take a snapshot of all stats in time at once. This is common with some hardware devices, too. Getting these stats may be expensive though. (For example reclaiming transmit buffers, so you can collect transmit status, etc. Acquiring locks. With some devices you might even have to do an expensive collection effort that would normally cv_wait for an interrupt.) Having to go through this several times (once for each stat collected) for a single snapshot is ... inefficient. It would be nice to add a mac_stat_update() entry point, which is separate from the mac_stat() entry point. (Even better, also add a mac_stat_done() to release any resources acquired by the first call.) The good news, I think, is that hopefully we aren't going to have to support DLPI DL_GET_STATISTICS_REQ, so it should be safe to cv_wait in mac_stat() related calls now (unlike with older GLDv2.) We aren't supporting the DLPI statistics calls, are we? Please say we aren't....

If the driver wants to export any additional driver-specific statistics, it has to do the whole kstat dance itself, in addition to the nemo mac_stat() entry point. Lets try to find a way for drivers to export/register additional driver specific kstats within the existing nemo framework, please?

Duplication. E.g. for bge, there is a "bge0" kstat, created by dls, as well as a "mac" kstat created by the mac module. Both of these will have some common counters, like ipackets64, brdcstxmt, etc. What's worse, one stat in particular, "unknowns" is counted by the dls framework in the "bge0" stat, but is not counted by the "mac" stat. This can lead to confusion. The duplication also makes worse the snapshot problem already mentioned, since it appears that most of the stats are generated just by calling the mac_stat() a second time for the same values already recorded in the "mac" kstat.

Inadequate list of kstats in the default set. I found several kstat which were missing. We got several of them getting fixed as a result of PSARC 2007/220, but I've since found a few others. E.g. Ethernet devices commonly can detect "jabber timeouts". These should be reported somehow. Also, stats about network related interrupts are really important, and aren't included by default. I consider this a significant shortcoming. I guess devices should register a KSTAT_TYPE_INTR kstat, but approximately none of them do today.

Stat cleanups in drivers. This is mostly a driver-specific problem, but look at the kstat output on bge and e1000g, and see what I'm talking about. There is a total lack of consistency here.

From the above, you see the problems with kstats. There are similar problems with NDD. The amount of code scattered around different drivers trying to figure out NIC tuning is boggling. And most of it isn't what you'd call "sterling examples of quality". The eri driver was full of some really, really fragile code in this. (Deleting one tunable ... the instance ndd parameter... required updating no fewer than 4 different locations in the driver. And they weren't conveniently co-located.

Interpretation of values, handling, all of it is terribly replicated across so many drivers. I can't wait to eradicate this crufty, horrid code, and replace it with something nice and sane from Brussels.

Some drivers can change internally to work even better with GLDv3.

In eri, for example, I think we can be smart on the transmit side, so that, for example, when a group of mblks comes down, we don't kick the hardware and resync the descriptor rings until all the packets are queued for transmit. This would help amortize some per-packet expenses across multiple packets.

Other drivers can benefit from multiaddress support. dmfe falls into that category.

That said, my approach so far has been the naive conversion. I'd like to revisit a few of them to enhance them to take advantage of the superior design in GLDv3, but first I want to get them put back.

I'm now a member of the "battery team". I had a very productive con-call with the folks involved, and I think we are going to soon have a better common framework for battery APIs in the kernel so that SPARC systems can also take advantage of the gnome battery applet. Watch this space!

For the curious, I've posted a webrev containing the changes required to integrate afe into Nevada.

The driver includes changes from the stock AFE driver for Solaris, including some lint fixes, and changes to use the stock Solaris sys/miireg.h.

I'd love to make more changes to this driver, but at the moment I don't want to cause a test reset. Once the driver is integrated, I have a bunch more improvements coming... Nemo, multiple mac address support, VLAN support, link notification support (needed for NWAM), as well as code reduction by using some features that are now part of stock Solaris (like the common MII framework!)

Wednesday, April 11, 2007

Recently, I posted a blog entry where I described that "Not All GigE Are Equal", strongly advocating the use of Broadcom GigE devices when faced with a choice.

However, after spending time in the code, I've discovered that there is quite a range of differences amongst Broadcom gigE devices.

I had considered listing a full table of them, but it seems that would be a bit onerous. Take a look at usr/src/uts/common/io/bge/bge_chip2.c if you want to find out the gory details. But in the mean time, here are my recommendations:

If you have PCI or PCI-X: Choose a bcm5704 if you can. It has pretty much full feature support, but you need to pick a recent revision (newer than A0.) Look for pci ids of pci14e4,1646, pci14e4,16a8, or pci14e4,1649. These chips alls support PCI-X, multiple rings, full checksum offload, and multiple hardware tx and rx rings.

If you have PCIe: As far as I can tell, all of the PCIe chips that are Solaris supported lack support for multiple hardware tx/rx rings. This is really unfortunate, as it will have a negative impact on Crossbow benefits. But apart from that, it looks like the 5714 and 5714 series are your best bet. They both support jumbo frames, and they both have full checksum offload support. Look for pci ids of pci14e4,1668, pci14e4,1669, pci14e4,1678, or pci14e4,1679.

What this really says, is if you have to choose between a PCI-X card and a PCIe card, surprisingly, choose the PCI-X card (if you can get a 5704). Save your PCIe for framebuffers or HBAs. (Or, better, 10G cards like Neptune.)

As part of setting up the Tadpole project, I tried to use a feed direct from Blogger, but the OpenSolaris tonic infrastructure doesn't like it. Apparently the feed has some problems, which you can see by looking at the output from feedvalidator. Anyway, I was able to work around by using feedburner to convert the blogger Atom feed into a clean RSS feed. Maybe at some point some Blogger staff will look at this and see what the problem is.

Saturday, April 7, 2007

The first review for Tadpole platform support is online now. Please let me know your thoughts, after reading it. There will be more good stuff coming soon, I hope. (Also, if you have a Tadpole platform other than a SPARCLE or UltraBook IIi, and are willing to test, please let me know!)

Friday, April 6, 2007

FYI, I recently proposed a new project to track improvements to support for Tadpole platforms in OpenSolaris. It looks like it got the seconds needed, so I'm just waiting for the infrastructure to be created.

Thursday, April 5, 2007

I've been wondering how many other OpenSolaris users there are out there in the Inland Empire. I recently met one close to me, which surprised me quite a bit. I figured I was the only one within at least a 30 mile radius.

If there are others of you out there, please drop me a line. I'd like to inquire as to whether it makes sense to consider starting a User's Group for the area. Possibly we could join up with any other User Groups for Southern California.

For the record I live in southwest Riverside county, not far from Temecula and Murrieta. (For those of you not familiar with the west coast, the Inland Empire refers to a large region of southern California that is separated from the coastal areas of Orange and Los Angeles counties by a range of coastal mountains. I often have joked that I'm about 65 miles from any natural technology center, but now I'm not so sure. And I think a lot of people who commute to places like San Diego and LA live out here.)

Sunday, April 1, 2007

Funny note. When I came back to Sun (two weeks ago), I discovered that an ancient PSARC case (2002/356) for the removal of the Trivial Name Server (in.tnamed) had never been completed. So for 5-odd years since we've continued to ship this long-since-obsolete protocol. I'm going to go ahead and drive forward with the actual removal... at the time I did it as a case study in how much process was involved with even a simple EOF. Lets see how long this one takes. (For the record, the IEN-116 protocol was obsolete as far back as 1986, when J. Postel first requested vendors ditch it.)

Those of you using afe (and also mxfe) will be pleased to note that the time is fast approaching when afe will hopefully be integrated into Solaris Nevada. There is a PSARC fasttrack scheduled for it next week if I understand correctly. (I don't have the case number yet.)

There are a few ramifications of this. One of the most immediate is that I'm going to be winding down support for versions of Solaris earlier than 10. In fact, I no longer have any personal installations running anything less than S10u3, and most everything is running Nevada.

The other reason for me to do this is so that I can immediately start taking advantage of some features that are present in Solaris 10 and Nevada. For example, I want to add support for DLPI link notification, and ultimately (in Nevada) port to GLDv3.

The GLDv3 has some compelling features, and as a result afe and mxfe will gain support for features like vlans, jumbo frames, and interrupt blanking. And, they'll also benefit from the increased performance gains afforded by the GLDv3 framework.

It isn't clear to me that I'll be supporting GLDv3 for Solaris 10 (the interfaces are not yet public), but at least in Nevada I will. And even for S10, I'll probably be using new GLDv2 features that are not available to older releases. (Like the DLPI link notification.)

Before I do this, I will be spinning one last significant bug fix release for afe and mxfe, which addresses several significant bugs found by Sun's QA group. (Including the fact that afe has not functioned properly with multicast since it was first written!)