Sorry for the Correct Answer tag. I hit the wrong button, and couldn't figure out how to undo.
We've gone as far as to bring up two PSNs specifically for wireless traffic, and the Wireless PSNs are much less busy than the wired PSNs, so it's not a loading issue, per se.
We thought we had it down to EAP-TLS vs EAPChaining supplicants, but that isn't exactly right. Windows users without NAM behave differently than Macs, and they are both using EAP-TLS.
It appears as though the Macs are throwing out a large number of CoAs, as we are seeing a lot of traffic that is being reported as dropped due to previous packets. What dead/timeout values are you using for your PSNs in the WLC config?
... View more

Ever since we updated to patch 3, we have seen a lot of dead radius server messages, especially when users are accessing the network via wireless controllers.
We have a guest portal tied to AD, and the wireless access is practically unusable, due to timeouts in accessing the PSNs. We've created a pair of dedicated PSNs for wireless, and they are the least busy PSNs on the network.
... View more

I'd like to be able to define a "VLAN of last resort", which is where a user ends up under the following scenarios:
The User fails to authenticate via either MAB or Dot1X
Despite being able to authorize, the user chooses to VLAN deliberately.
ISE 2.1 patch 3 and 3850s running 03.07.04.E
... View more

We have a situation that we are trying to get a handle on, so I thought I'd post here. We are running ISE 2.1 patch 3 and or common switches are 3850s running 03.07.04E. We have six PSNs, and two masters and loggers. Two of the PSNs are dedicated for wireless only. Our workstation are a mixture or Windows machines (AnyConnect with NAM), and OS X machines (AnyConnect but no NAM). At present we are only attempting this for wired connections. We'll expand this particular portal to wireless eventually.
Of our ~35,000 endpoints, approximately 65% are MAB-based, and 35% are based on Dot1X and Active Directory. NAM clients use EAP Chaining, and non-NAM clients use EAP-TLS.
The underlying scenario is that we are piloting two factor authentication (2FA). First factor is Active Directory, and second factor is RSA. At present, the entirety of the Dot1X clientelle (the 35%) will migrate from AD only to 2FA. At present, we only have a few stacks ~300 endpoints utilizing 2FA.
The three scenarios are as follows:
User is presented the Guest Portal in question and they enter in their credentials to authenticate. Instead of the normal "Success" page, they are presented with a white screen. This occurs with either the built-in Success page and an external success page. We cannot use the option to forward the user to their initial URL, as the browsers on OS X do not appear to support that ability.
The user attempt to authenticate and they are presented with wither a 400 or 500 error.
The user attempts to authenticate and they are presented with a second identical portal page.
The third case appears to be related to an option we've been looking at re-authentication. We set the session timer to 12 hours, and at the twelve hour mark, the switch state is updated, and the session is presented with the portal URL. We believe that the first screen has a URL from a stale session, and when the user authenticates, the request goes to a different server, or the initial session has timed out. Most of the time, the user is able to authenticate after the second portal page.
In regards to the reauth process, we've also tried using the Idle Timer RADIUS Attribute, thinking that if a user were to log out but remain connected, once the Idle timer expires, the user would be forced to reauth. We do not find that the Idle timer ever decrements, so the session never expires. This occurs whether or not the user remains logged in, or logs out. I am leaning towards the belief that the Idle Timer is not very useful in an Ethernet environment, and is a hold-over from the dial-up days of yore.
We're interested in figuring out how to troubleshoot these conditions. If we disable the reauthentication option, it appears that Issue 3 goes away completely, Item 2 is greatly diminished, and we haven't seen Item 1 when reauth is disabled.
There also appears to be a five minute timer involved when the portal is displayed , or possibly when the URL for the portal is presented to the switch port.
We've upgraded to Patch 3, and we are about to implement Policy Sets to try to make the list of rules smaller for any given set. We currently have approximately 45 active rules, and some of them, the MAB rules could probably be consolidated.
We initially were putting a MAB order of dot1x MAB on the switch ports, but that didn't work at all, so we are not using auth order mab dot1x. Priority is set to dot1x mab.
Load-balancing has been an issue and we've gotten around that by 1) putting the wireless controllers on dedicated PSNs, and by 2) adding a batch-size of 1800 to the 3850s that are serving 2FA The non-2FA switches are not load-balanced, but the first server in the AAA group is rotated around. As we convert switches to support 2FA, we'll add the batch-size.
Cisco's guidance on batch-size is quite out-dated, as it states that any value greater than 50 is considered a "large" batch, and any value less than 25 is considered "small". As the variable can accept a value of over 2 billion, I'm at a loss. The value of 1800 we are using was trial and error.
... View more

There is a rather involved write up for this here.
We're also considering the F5 as an option.
In the mean time, I've determined that almost all of the Cisco background on 802.1X was written prior to CoA, and that the TAC still believes that a batch-size of 50 is considered "large".
If you look at the command on an actual 3850 switch, you'll see that the maximum allowed batch size is 2,147,483,647, so you should have plenty of leeway.
I took the maximum number of transactions seen in a 5 minute period (high load, around 8:30AM) and came up with using 1800 as a batch size. I'm not 100% on the equivalence of a Session vs a Transaction, but this seems to work for us. My goal was to minimize the possibility of a least-busy server transition in a 5 minute period of time, and I believe I have succeeded. To be honest, I couldn't really see much of a downside if the number was too large. We have 4 PSNs in two locations, 8 servers in all.
... View more

We are using EAP-TLS as part of our AuthZ Rule set and I'm curious if there is a best practice method in dealing with a user certificate. I have a mixture of PCs and MACs. With my PCs, we are also using NAM, so EAP Chaining is available. EAP Chaning does not appear to be an option for the MACs.
We started down the path of having a Microsoft track and a non-Microsoft track for AuthZ, but we're also trying to eliminate two sets of rules where possible.
Suppose My domain is fubar.com
I have some rules set up that are set up to check the EAP method and the Certificate contents a la:
NetworkAccess:EapAuthentication EQUALS EAP-TLS AND
CERTIFICATE:Subect Alternative Name CONTAINS fubar.com
I have other rules that are set up that test for the condition:
CERTIFICATE:Subject MATCHES .*(FUBAR).*
Is one way better? Is there a BEST way of examining User Certificate?
... View more

I had opened a TAC case with this issue, and their recommendations as a work around leave a few things to be desired, so I thought I would throw this out there.
On our 3850 switches, running 03.07.04E, we have
aaa group server radius ISE server name authnad-w2 server name authnad-b2 server name authnad-w1 server name authnad-b1 ip radius source-interface Vlan255 load-balance method least-outstanding
aaa server radius dynamic-author client 10.3.14.239 client 10.9.14.241 client 10.3.14.240 client 10.9.14.239 server-key 7 <secret>
We also have an extensive set of ISE rules, rules that include multi-factor authentication and reauthentication. One of our rule flows accepts EAP chaining and the user SUI, and machine credential are authenticated in a first pass, and then a second pass is used to authenticate against an RSA-type token and a guest portal. What we found with v1.3 of ISE was that it was possible that when a particular client had a CoA event, the session ID would get transferred to a different PSN, and that the second PSN, having no record of that particular session ID, would start over. From a user perspective, it looks like the RSA authentication failed (no reason given), and they would be presented with a second portal login screen. From the troubleshooting from the TAC case, we could see that the session ID was constant, and that the servers changed. Cisco's solution was to remove load-balancing altogether, or to use something like an F5. We initially went with the former solution.
The problem appears to me to be that the switch is checking for the least busy server when it does a CoA, and that check should not be made at that point in time because the switch will cause the existing session ID to be ignored by the new PSN.
While in a configuration with no load balancing, we found the solution to be quite stable, until the lead server in the ISE group became unavailable. While the server was unavailable, users were still attempting to authenticate to it.
Ignoring that an unavailable server can be one of several states, say during an upgrade from v1.3 of ISE to v2.1, we kind of discovered exactly what no load balancing means, and well, we see how less-than-optimal that solution is. We started down the path of looking at the F5 solution, but there are aspects of that that don't seem practical.
So, I'm of the opinion that this issue with no radius load-balancing is an issue with the switches(and WLCs), and that this is a bug in the IOS XE 03.07.04E code.
We appear to be having the same issue with ISE v2.1 as well. We've tried load-balancing both with and without the ignore preferred-server parameter. Without seems to logically be the right choice, but it no workie.
Just curious if anybody else has had similar issues. We'll probably look at node groups eventually, but I think this is a bug.
... View more

The gateway router has sub-interfaces for the various VLANs, and the SVIs corresponding to the VLANs are defined of the access switches. The ip-helpers have worked just fine being on the GW only, but now with the introduction if the 4451, they also appear to be necessary on the layer 2 devices in the SVIs.
If I compare two 3850 switches, one in a closet here, and one out on the WAN, the major differences I see are:
1) local campus 3850 has only one VLAN, for loopback, and has a default route. It has no helper addresses
2) remote 3850 has multiple VLANs defined as SVI, and has no default route. It doesn;t work w/o helper addresses.
I'm guessing that's part of my issue. Too many cooks. I'm just trying to explain the inconsistencies.
... View more

We have a number of sites that are using a 3845 as a router and 3750s as access switches. As they are becoming EoS/EoL, we've been replacing them with with a 4451 and some 3850s.
We haven't made any major modifications to the configurations. The 4451 serves as the layer 3 gateway, and the access switches are purely layer 2 devices.
The user interfaces on the router have an ip-helper address which points to a number of centralized DHCP servers, and we typically enable dhcp snooping to make sure the users don't get bitten if they do something stupid like set up a unauthorized DHCP server.
The IOSs in question are 15.4.3 for the 4451, and the 3850s are using 03.07.04E. We use the 3850s at our major campuses as well, and have no issues. We do not use the 4451 on our major campuses, just smaller remote sites.
The issue seems to be that users do not get addresses unless we also place an ip-helper address on the VLAN interfaces on the 3850s. We never placed these statements on the 3750s, and we do not place them on the 3850s on our main campuses, so it appears to be an issue that is affected by the presence of the 4451.
I'm in the process of collecting pcap files, but this just jumps out as we missed something stupid. AnyConnect is also in play, and if we disable AnyConnect/NAM on end-user workstations, they seem to allow traffic. AnyConnect is 4.3.02039.
Anybody else ever seen this?
... View more

Interesting observation... We were running Prime 3.0 and I updated some 3850s from 03.07.03E to 03.07.04E. Later, we updated Prime to 3.1. Later still, I updated some more 3850s from their current versions (mostly 03.07.03E) to 03.07.04E, and now, only the first batch of updated switches shown up both as WLCs and switches.
Maybe I just need to delete and ad them back (pain). It would be nice to know how Prime makes the determination if it is just a switch or both a switch and a WLC.
... View more

I have a number of C3850s in production. I believe all of them (136) show up as switches in Prime 3.1. However, some of them (74) show up in Prime as Wireless Controllers as well.
None of them are configured to perform WLC functions. Why does Prime see some of them as WLCs?
I'm running v3.1 or Prime. IOS of switches is 03.03.03E or newer.
... View more

I'm attempting to pass some L2 traffic through a L3 tunnel, but the means I would normally use does not seem to be implemented.
This would normally look like:
On each of the routers, Router 1 with loopback 10.83.251.254, and Router 2 with loopback 10.29.254.241. Routers 1 and 2 have multiple L3 paths between them utilizing EIGRP.
conf t
ip cef
l2tp-class DMZclass
authentication
password DMZ_l2tpv3
On each of the routers
conf t
pseudowire-class PW_DMZ
encapsulation l2tpv3
protocol l2tpv3 DMZclass
ip local interface Loopback100
On Router 1:
interface Gi2/9
xconnect 10.29.254.241 1 encapsulation l2tpv3 ps-class PW_DMZ
On Router 2
interface Gi2/9
xconnect 10.83.241.254 1 encapsulation l2tpv3 ps-class PW_DMZ
I'm curious as to why this isn't implemented. I thought it was implemented in 12.4 (which is not available for my 6504 with Sup720).
What alternatives exist for accomplishing the same result?
... View more