Configuring IPMP on NexentaStor 3

At my day job, we have recently started rolling out NexentaStor 3 for our VM image storage as a trial. If all goes well, our long-term plans are to eventually migrate all storage from Netapp to NexentaStor. As we started rolling out our NexentaStor trial, one missing feature we quickly ran across is the lack of IPMP (IP Multipathing) support. The network configuration interface that they provide can currently configure aggregated interfaces with the LACP protocol, but they do not have any mechanism to configure IPMP to aggregate interfaces from multiple switches. We were able to work out an approach to configure IPMP manually, and received Nexenta’s blessing to use it in our environment. (Important note: if you are going to try this on a licensed copy of NexentaStor, please check with your support team to ensure that they are ok with you making these changes.)

The machine has an 8TB license, with 2 of the disks configured as hot spares. The Intel SSDs are configured as 3 log-mirrors, and the RealSSDs are configured as cache devices.

Caveats

The only major caveat that I’ve hit with this configuration is that the ipmp interfaces will not be viewable via the Nexenta utilities. You can still see all the underlying interfaces; just not the ipmp ones. It’s mostly cosmetic, but is distracting and annoying.

Of course, YMMV – this worked for me, but no guarantees that it will work for you! :)

General configuration steps

As far as I know, you cannot do VLAN tagging on top of an IPMP trunk on Solaris, which means that we need to create the VLAN interfaces on each of the aggregate interfaces, and then create three separate IPMP interfaces – one per vlan. Here are the basic configuration steps:

Via NMC: Create VLAN interfaces ‘100’ and ‘200’ on top of both ‘aggr1’ and ‘aggr2’. This will create the interfaces ‘aggr100001’, ‘aggr100002’, ‘aggr200001’, and ‘aggr200002’.

Via NMC: Configure an IP address from within each VLAN on each of these six interfaces. This will allow IPMP to use ICMP probes in addition to link detection to try to find failed links.

Via the console: Configure the three IPMP interfaces, and add the six aggr interfaces to the proper IPMP groups.

NMC – Create LACP aggregates

This assumes that whatever interface you configured during installation is *not* one of the interfaces you desire to be part of the aggregate. If that is not true, you will need to be on the system console (via IPMI hopefully!), and destroy that interface first. Here is an example of how to create the aggregates (output from NMC; so this is how it ends up being configured):

Console – Set up IPMP

First, we need to get into expert mode.

nmc@nexenta:/$ options expert_mode=1
nmc@nexenta:/$ !bash
You are about to enter the Unix ("raw") shell and execute low-level Unix command(s). CAUTION: NexentaStor
appliance is not a general purpose operating system: managing the appliance via Unix shell is NOT
recommended. This management console (NMC) is the command-line interface (CLI) of the appliance,
specifically designed for all command-line interactions. Using Unix shell without authorization of your
support provider may not be supported and MAY VOID your license agreement. To display the agreement,
please use 'show appliance license agreement'.
Proceed anyway? (type No to return to the management console) Yes
root@nexenta:/volumes#

Next step is to set up the hostname files for each of the IPMP interfaces. I will name the interfaces as follows:

At this point, all of your interfaces should be up, and all the IP addresses should be pingable. Make sure that you can ping the individual interface IPs, and the IPMP IPs. You should be able to use the ‘ipmpstat’ command to see information about your groups; IE:

Note that this configuration provides failover and outbound load balancing, but it does not provide inbound load balancing. If you would like inbound load balancing, you need to configure an IP alias on each of the ipmp interfaces, and then mix up which IP you use from the host that is connecting to your Nexenta machine (or use multipathing if it’s iSCSI!)

One last thing – once everything is configured, you will probably want to define your own ping targets. You can view the ones that ipmp picked automatically by running ‘ipmpstat -t’. On our configuration, on one VLAN, two Nexenta nodes picked each other.. so when we took machine two down (intentionally), machine one marked that interface down, and then when we booted machine two back up, it could not reach machine one’s interface, and marked its interface on that vlan down. Nice race condition. Oddly, mpathd (the daemon that does the checking) does not use a configuration file for ping targets, but instead relies on host routes. What we’ve done is added routes to the individual IP addresses that we would like to have it monitor by using the NMC command ‘setup network routes add’, and specifying the IP address to monitor as both the ‘Destination’ and the ‘Gateway’. We picked four to five IPs on each VLAN that were stable hosts (routers, Xen domain-0’s and the like), and added them on both hosts.. this will give more consistent results, as multiple core machines would have to go down before the interface would be disabled incorrectly.

I hope this helps! Please feel free to leave a comment if you run into any trouble getting it working.

Things have been pretty good so far! I’ve pounded the snot out of these a few times, haven’t been able to get them to really sweat. ;) The only real issue we’ve had is that I initially thought that the LACP config that Nexenta offered actually set up IPMP, so I had it set up with 8 ports on two separate switches with LACP turned off.. that essentially just randomly moves the MAC around, etc.. when we were only testing with one server, it worked ok, but once we added more to the mix, things went *BOOM*, and I felt really, really dumb! :)

Plumb vs Unplumb – unplumb is actually correct.. I am not actually sure what ‘disabling’ the svc:/network/physical:default is supposed to do, but it doesn’t unplumb the interfaces.. and enabling the service won’t bring things up properly unless the interfaces are not plumbed. Hence the unplumb in the middle. Solaris networking still confuses the crap outta me. ;)

Which SuperMicro model did you go with, and which controller ? I’m having all sorts of issues with a SC847E26-JBOD, LSI 9200-8e SAS2 controller and Seagate Constellation ES 2TB SAS2 disks. In both Solaris and Linux I get weird SAS errors. SuperMicro claims that the LSI2008 chipset is incompatible with their dual expander setup, eventhough I’ve only attached one SAS link, and according to them and their “compatibility maxtrix” only internal SAS2 RAID controllers are qualified and supported, how that’s supposed to work with a JBOD I am yet to figure out (well basically, SuperMicro f*cked up and won’t own up to it).
There’s more to this story and I plan to write a blog post about my findings, so far I unfortunately can’t recommend SuperMicro to any, at least not their SAS2 JBODs.

Great Blog post. We are also working on deploying Nexenta at my current employer. We went down the LACP route and it’s worked well so far. I was actually looking for a solution to a FC Multipathing issue when I came across your site and just wanted to thank you for posting this information!

I’m fairly new to all this, but as I understand it, ZIL should be mirrored and that means he’s likely running 3x32GB (96GB) which should be relative to the 48GB of RAM and the amount and rate of synchronous writes he expects. In theory, that setup ensure that he can handle a SERIOUS amount of sustained, synchronous writes and with 10x1TB SAS drives, that makes sense to me. If anything one might call it a bit overkill, but then again it may be a bit of “futureproofing” in case the storage is grown. Just my 2 cents. Love to hear from the author so I can confirm if I’m learning anything.

I’m running in a virtualized environment and trying to figure out the smartest way to get some form of real trunking in place (so I can break the 1Gbe barrier w/o 1+Gbe equipment). IPMP is something I’d like to look at next but since you have LAG working here, can you tell me if you’ve observed with NFS that you can exceed 1Gbe rates (by even a consistently measurable amount, not looking for 2Gb) with your setup? I ask because you clearly have the I/O to feed it and with this network setup, if I understand how LAG is implemented in Solaris combined with LAG/LACP on a switch, it seems possible.