2015-02-17T19:05:18-08:00http://jon.netdork.net/Octopress2015-02-17T00:00:00-08:00http://jon.netdork.net/2015/02/17/vsphere-storage-vmotion-times-out-at-32-percent-when-crossing-sansA year or so ago we’d upgraded our vCenter from 4.1 to 5.1, and with this upgrade, and some features built into our SAN, we got access VAAI. For example, removing a VM guest would tell the SAN that the guest had been removed, and if the data store had been thinly provisioned from the SAN, it’d clean up and shrink down the space used (in theory).

Another feature we discovered was something called “fast copy”. In layman’s understanding of this feature, when a storage vMotion request was created, the SAN was notified of the request, and the SAN would process the copying of the bits in the background. This is handy because it stops the data from being sent from SAN to host to SAN again. This causes a good speed up with regards to moving machines around.

There was a caveat to the “fast copy” feature that we stumbled across last year. Well, what we stumbled upon was an issue when using vMotion to move machines between SANs. What we didn’t clue in on was that this was because of VAAI and “fast copy”. When we first observed this issue, we didn’t realize the issue was between SANs, we just thought the issue was random. Our VM hosts had storage allocated from 2 different SANs at the time, and our naming convention was a little off, so identifying quickly that the data store was on a different SAN wasn’t entirely obvious at first.

Ultimately the issue presents itself as a vMotion timeout. When you start the vMotion, it zips along until it hits 32%. It then sits there for a few minutes, sometimes up to 5 or 10, then the guest becomes unresponsive. At this point VMware decides the migration has timed out, and rolls back. Sometimes it can take several minutes for the failed guest to start responding again. If the guest is shut down, it usually hangs around 36% for a few minutes, but eventually processes. The error usually looks like this:

The error generally presented is “Timed out waiting for migration data.” It always happened at 32%. A bit of searching around, and I didn’t really uncover the cause of it. At the time we originally spotted this issue, we decided to take an outage and shut the guests down and vMotion them. This killed 2 stones at once, freed memory on the hosts, and gave the guests a reboot to clear memory and such.

Fast forward to nine months ago, and we had an issue where we discovered one of our SANs had become over saturated, and needed space and load removed from it. At this point, we now had a third SAN added to the mix, so we presented new data stores, and went through the process of trying to vMotion quite a lot of VM guests off of one set of data stores (actually 10) to another set. We hit the same wall as before, time outs at 32%. We put it down to the load and space issues on the SAN and went with the outage. This was our dev environment anyway, so it was less of an issue. We didn’t really look into it any further.

Jump forward to this past Tuesday. A sudden alert that multiple VMs had gone offline left us puzzled until we realized that one of the data stores had been way overprovisioned, and the backup software kicked off and with guest snapshots had filled the drive. With a quick bit of work, we moved some guests around, and bumped into the same 32% issue again. Shutting down some guests and shuffling them around got us through the pinch, but left me wondering.

After some experimentation, I was able to narrow down the cause of the issue on a single action. Storage vMotion between SANs. Inner SAN vMotion was snappy, 100GB in less than 2 minutes. Intra-SAN migrations would hit 32% and time out. That’s it, I had the cause of my problem. It had to be a fiber or switch issue… Right?

Not so much. While doing some digging on performance, our fiber switches, and SAN ports, I wasn’t spotting any obvious issues. Doing some searching again on our favourite web search engine, I stumbled across an HP document tucked away in the 3Par area (document was named mmr_kc-0107991, nice name). Bingo! Okay, the details don’t exactly match, for example the document mentions that it freezes at 10%, but it had all the hallmarks of what we were seeing. IntraSAN vMotion, timeouts, and VAAI.

So the solution was to disable VAAI on the host, do the vMotion, and then re-enable it if you still want to use it. VMware has a nice document on how to do that here in KB1033665. With a little PowerCLI1 we quickly disable VAAI and tested a vMotion on a live machine, and it worked. As we were working on a single cluster at the time, this is what we ended up with:

]]>2014-06-11T18:30:00-07:00http://jon.netdork.net/2014/06/11/enable-remotemailbox-the-address-is-invalidIn the process of migrating our mailboxes from our on-premise Exchange servers to Office 365, we had to rewrite the mailbox enable scripts. This script keys off of our HR database, does some magic, then calls Enable-Mailbox on Exchange 2010 servers. To update this to support creating mailboxes in Office 365, we needed to set user licenses, and use the Enable-RemoteMailbox command in Exchange 20131.

One of the quirks we stumbled upon is a bug in the Exchange 2013 tools that didn’t allow it to identify the domain for the remote routing address. This is what we’d get:

123456

[PS]C:\>Enable-RemoteMailboxJonathan.AnglissTheaddress'@mytenant.mail.onmicrosoft.com'isinvalid:"@mytenant.mail.onmicrosoft.com"isn't a valid SMTP address. The domain name can'tcontainspacesandithastohaveaprefixandasuffix,suchasexample.com.+CategoryInfo:NotSpecified:(:)[Enable-RemoteMailbox],DataValidationException+FullyQualifiedErrorId:[Server=Exchsvr01,RequestId=190c9764-d8bd-446e-ac43-7c80bcc54eea,TimeStamp=6/3/20141:19:33PM][FailureCategory=Cmdlet-DataValidationException]730D5E7F,Microsoft.Exchange.Management.RecipientTasks.EnableRemoteMailbox+PSComputerName:Exchsvr01

According to the Microsoft documentation for Enable-RemoteMailbox you should be able to specify just the sAMAccountName as an argument, the rest should be calculated.

The remote routing address doesn’t need to be specified because mail flow between the on-premises organization and the service has been configured. Using this configuration, the Enable-RemoteMailbox cmdlet automatically calculates the SMTP address of the mailbox to be used with the RemoteRoutingAddress parameter.

As this is part of a script, the content is slightly different, but you can see how it works. We used Get-ADUser earlier in the script to pull other user data to calculate licensing requirements, but if you’re doing this as a one off and are seeing the error then you could just as easily do this:

Hat tip goes to Steve Goodman for posting similar work, and getting me back on track.

If you are using a 2010 Exchange environment, you need a 2013 server to act as a Hybrid server to migrate users.↩

]]>2014-06-10T06:40:00-07:00http://jon.netdork.net/2014/06/10/exchange-2010-2013-and-o365-DDL-FiltersIn our transition to using Offce365 for email services, we’ve had some interesting discoveries. Some of them are revolving around Dynamic Distribution Lists (DDLs). These are groups which have members identified at time of delivery of emails, and are based on various styles of queries. We usually use PowerShell style queries to build the groups, but LDAP works, as does simple queries based on fixed parameters.

One of the interesting observations is that Exchange will tack extra query parameters into the DDL to exclude system mailboxes. For example the following query string:

This forces Exchange to exclude any of the system mailboxes for delivery, which is what you want it to do. The problem is, this additional data varies from version to version, and it’s not always backwards compatible. One of the observations is that in Exchange 2013 they introduced a RecipientTypeDetailsValue of PublicFolderMailbox. This is great, except that value is invalid in 2010. What does that mean?

Let’s try an example. From one of our on-prem 2013 hyrbid servers, we’re going to create a new distribution group with the initial query we gave as an example above…

Now we see an error. This is because 2010 doesn’t like PublicFolderMailbox as a RecipientTypeDetailsValue, and as such, throws it out as an error. So we have to go back to the 2010 server and edit the query, and reset it to be what we wanted originally:

This same query is happy on the 2013 servers as well, however, it will attempt delivery to a Public Mailbox if your other query parameters allow for it. In the example above, we set the RecipientType to be a very specific value, so this shouldn’t allow for this to happen anyway.

One other observation, when migrating your queries to be Office 365 Hybrid complaint, you will also need to include the RecipientType of MailUser. For example:

Mailboxes that are migrated change their RecipientType to be MailUser.

There are lots of other fun things about DDLs that you’ll have to be aware of, which I shall cover in a separate post, but this is one of the fun gotchas I discovered in a mixed environment that’ll impact people using Exchange 2010 and 2013 in the same environment.

]]>2014-06-04T09:00:00-07:00http://jon.netdork.net/2014/06/04/exchange-and-the-case-of-the-missing-countersWhile setting up SolarWinds SAM AppInsight for Exchange, I stumbled across a small Exchange setup bug where it’s not correctly deploying all the counters for the server roles that are being used. When SAM does the checks for the performance counter, you’ll see an error like the following:

12

'Average Document Indexing Time'
'Performance counter not found'

The solution is fairly simple, you have to copy the counters from the install media, and register them using PowerShell. The one caveat is that the counters aren’t on all the install media, so if you have CU3 setup files for example, they are not there. You have to go back to the original install DVD and get them from there. Here are the steps:

Find the missing performance counter files on the install media, usually in <install media>\setup\perf

This is documented on the SAM product blog (here) and in the SolarWinds Knowledge Base (here), with the missing step of Add-PSSnapin.

]]>2014-05-01T10:52:00-07:00http://jon.netdork.net/2014/05/01/l-vs-r-and-the-importance-of-documentationThis post was going to be one of those rant posts about not following instructions, and then I realized this is a common problem that a lot of people have issues with, not just in the IT world. It is the importance of knowing what L and R is referring to. By L and R, I mean left and right.

This might seem silly and trivial, because we all know our left hand and our right hand, and in general when somebody says it’s in the left hand side of the cabinet, you know exactly where to look. But what happens if you are working on something that can be reached/used from both sides? Is that the left hand side relative to the front, or the back?

This comes up with other things too. How many times have you gone to a car mechanic and said “my left brake squeals when I apply the brakes”? Is that the left side when looking at the car, or driving the car? This is why you’ll find a lot of mechanics refer to the sides as driver and passenger, there is no confusion there.

The whole point of this post is about the importance of documenting exactly what is referred to as L and R, because it makes a great deal of difference when you are putting rails on a server. Why you might ask? It’s all about server removal…

A lot of rail kits consist of 3 major parts, the server portion, the cabinet/rack portion, and the slider. The server portion is usually screwed/clipped onto the server, and is stationary. The cabinet/rack portion is also stationary, and attached to the cabinet/rack. Then there is the slider portion. This portion is “attached” to the server and cabinet portions, has ball bearings, and allows the server to slide in and out of the rack. It slides on tracks in the cabinet portion, and the server portion slides on bearings to slide out. This allows for people to work on the server without having to completely remove the server.

Also part of the slider is usually a catch. This catch is to stop you from pulling the server completely off the rails and having it come crashing down to the floor. Something most people don’t want to happen. And it is with this part of the rails that it is important to know what is L and what is R. This catch usually has an orientation so that it can “clip” into the slider rail, and pushing the catch down allows the server rail to slide out of the slider rail. If you mount the server rail on the wrong side, the catch either doesn’t work properly, or becomes impossible to remove.

Here is an example of one of those catches…

If looking at this, you cannot figure out how this works, here is another picture with arrows. Arrows make everything easier to understand…

When you pull the server out, it moves in the direction of the top arrow (orange). Near the end of the slider rail is a small block, this block (shows as a green blob) moves along the server rail in the direction of the bottom arrow (green). As it gets to the catch, it pushes it up, and the spring in the catch pushes it back down when the block moves into the void. Because of the shape of the void, the green blob is prohibited from moving any further, and stops the server sliding off the end of the rail.

If you need to actually remove the server from the rails completely, you simply pull the catch up, which moves the blob outside the void of the catch, and pull the server forward. If you put the rail on upside down, instead of the block catching on the void in the catch, it actually stops when it hits the mount point of the catch. This is why it’s important to know which way around to mount the rails (note the little L next to the screw).

This situation caused myself and a co-worker some struggles as we could not get the server unmounted from the rails. Ultimately we ended up having to unscrew the rails from the rack, with the server still attached, fully extend the rails, and then bend them in so that we could pull the server out of the rack. Fortunately this was a sever that is well past EoL, so this wasn’t a hard decision to make, or live with.

That all being said, it is important to make documentation as clear and concise as possible. Images are very useful in this situation. A server we put in place of this one had really clear documentation, and the rails themselves even had pictures of the configuration, essentially saying “This rail goes on the left here” with a picture of where the rail was located in relation to the server.

So next time you’re writing documentation for something, and there is an opportunity for ambiguity, clear up the documentation and remove any doubt.

]]>2014-04-29T13:42:00-07:00http://jon.netdork.net/2014/04/29/unable-to-remove-exchange-mailbox-databaseWe had an odd issue recently where our Exchange server refused to let us remove a mailbox database, citing that the database had one or more mailboxes. The exact error was this:

This mailbox database contains one or more mailboxes, mailbox plans, archive mailboxes, public folder mailboxes or arbitration mailboxes. To get a list of all mailboxes in this database, run the command Get-Mailbox -Database . To get a list of all mailbox plans in this database, run the command Get-MailboxPlan. To get a list of archive mailboxes in this database, run the command Get-Mailbox -Database -Archive. To get a list of all public folder mailboxes in this database, run the command Get-Mailbox -Database -PublicFolder. To get a list of all arbitration mailboxes in this database, run the command Get-Mailbox -Database -Arbitration. To disable a non-arbitration mailbox so that you can delete the mailbox database, run the command Disable-Mailbox . To disable an archive mailbox so you can delete the mailbox database, run the command Disable-Mailbox -Archive. To disable a public folder mailbox so that you can delete the mailbox database, run the command Disable-Mailbox -PublicFolder. Arbitration mailboxes should be moved to another server; to do this, run the command New-MoveRequest . If this is the last server in the organization, run the command Disable-Mailbox -Arbitration -DisableLastArbitrationMailboxAllowed to disable the arbitration mailbox. Mailbox plans should be moved to another server; to do this, run the command Set-MailboxPlan -Database .

Okay, so thinking we were being stupid and missed the arbitration mailboxes, we ran the recommended commands, with no such luck:

The same was true of mailbox plans, and archive mailboxes. After some head scratching, I stumbled across this post on TechNet. The basic gist is that because Exchange is in a multi-domain forest, the get-mailbox command will usually only search in the domain you are active in. To make Exchange operate outside of the working domain, you have to set the server settings.

Sure enough, those system mailboxes hiding out in the mailbox database. Now we can see them, we can move the mailboxes off of the database, and then remove the database.

]]>2014-04-17T08:44:00-07:00http://jon.netdork.net/2014/04/17/solarwinds-application-monitor-automatically-configuring-appinsight-for-exchangeI’m going to make some assumptions in this post as it’s about a specific product. First, let us assume you are a long time user of SolarWinds Server & Application Monitor (previously known as Application Performance Monitor). Let’s also assume you have a Microsoft Exchange 2010 (or 2013) environment. I’m also going to assume that you have WMI monitoring working for your servers. And for the final assumption, let’s say that you just found out that the latest SAM update (6.1) now includes Exchange monitoring ‘out of the box’. After you’ve finished doing a happy dance, and that you no longer have to tinker with all the extra application monitors now, you set about figuring out how to enable this new functionality.

First, there are some caveats. The first is that this new functionality is only targeted for the Exchange Mailbox role. This means that if you have separated roles, such as CAS or Transport, don’t bother trying to point it at those servers, it just won’t find anything1.

The second caveat is permissions. To let the auto-configuration work (which is what this post will be about), you’ll need to have the account SAM uses have temporary administrative access to the mailbox server.

Now you’ve added your WMI service account to the “Administrators” group on your Mailbox servers. The next step is to make sure your service account has the right access within Exchange. There are 2 roles the account needs, Mailbox Search and View-Only Organization Management. The latter can be handled by adding the service account to the role that is already defined. The former needs to be created specifically for this purpose.

Now let’s see what we have to do in SAM. There are 2 ways to do this, I’m going with the way I’m familiar with, and it’s the same as if you add new drives/volumes, or extra hardware to a server. Locate your server in the SAM interface, scroll down to the “Management” box, and click on “List Resources”. The other method is to use the Sonar Discovery tool.

Let SAM do its work, and wait patiently. This can take a minute or two depending on the load on both servers. Once it has finished its autodiscovery process, you should now see new applications under the AppInsight Applications umbrella, check the box and click “Save”

Once you’ve done this, your “Application Health Overview” section should now show an application in an “Unknown” status.

Click on the “Unknown” line and you’ll be taken to a view listing the unknown applications. This should (hopefully if you’ve set stuff up right) be just the Microsoft Exchange app. Click on that.

At the top of the page, in the “Management” section, click on the “Edit Application”. There are 3 important items on the page that follows. The first is the URLs, if you use the defaults for most of your environment for configuration, these should probably be left alone. The Windows URL for PowerShell is for the remoting functionality, which will be configured automatically if you have not already done so. The next is the server credentials used to access Exchange and the server. Usually the “Inherit Windows credential from node” is good enough, assuming the monitoring service account is the same you want to use for monitoring.

Now we’ve got this far, the last thing to do is hit the “Configure Server” button. This process configures WinRM, and the Exchange PowerShell web services for SAM to access.

Usually this step can take a minute or two, and is a perfect time to go grab yourself a drink. When you return, and everything says it was successful, hit the “Test Connection” button just to make sure.

If you’re curious about what goes on behind the scenes of the “Configure Server” button, I’ll also be writing up the manual process, which is exactly what this does.

You are now ready to enjoy statistics and monitoring for your Exchange Mailbox server, including items such as largest mailbox, quota usage, messages sent, DAG and cluster statuses, and the likes.

So far I’m very happy about getting this deployed. It’s actually giving me some numbers behind some of the concerns I’ve had since I started where I work. For example, we have quotas that restrict sending, but not delivery, so mailboxes can exceed the quota size by a lot, as long as they are receiving only.

Edit (04/22/2014): It appears that when answering a question on Thwack, part of the problem I had locating the automatic configuration is that there may have been a step missing. The documentation mentions to go to the “All Applications” resource, which is not on the node page, but on the Applications / SAM Summary page. The thread with that conversation on is here. As of 13:57 US Central time, they have added that the documentation will be updated to clear up the confusion.

This gives me a sad face, and I’m hoping that it’ll be added as new features in upcoming releases.↩

]]>2014-02-24T08:22:00-08:00http://jon.netdork.net/2014/02/24/azure-vms-and-setting-subnets-via-powershellOne of the projects I’ve been working on recently is a POC in Azure to allow us to move a collection of desktop users to lower end laptops, while using high end servers to perform a lot of data processing. The idea is that we can spin up and destroy machines as we see fit. The plan was fairly solid, and we build out our domain controllers and a template machine with all the software in it, before configuration. We then used PowerShell to spin up new machines as we needed them.

One of the issues I stumped over when working on this was making sure the servers were put into the right network. This was important as they were being joined to a domain. I had originally started with something like this:

It seemed to me that the New-AzureVM command should have had some method to define which subnet was to be allocated to, but it wasn’t there. What was even more confusing was this VNet only had a single subnet, so you’d think it might select that, but not so much luck.

The answer lies in the Set-AzureSubnet command, which should have been pretty obvious to me. You can add it as part of your provisioning command like this:

All I’ve done is added the extra command to the end, and now Azure is happy. This will spin up a new VM and drop it in the right VNet, Affinity Group, and Subnet. Based on the VNet’s network configurations, and DNS settings, the new machine is provisioned, and joined to the domain immediately.

This makes me very happy because this is a quick sample of how we’d proceed with automating and deploying an undefined number of VMs in Azure based off of our golden image. With some minor tweaks we can loop through and spin up 50 machines with little work.

]]>2013-12-04T13:28:00-08:00http://jon.netdork.net/2013/12/04/joys-of-a-new-work-environmentSo it has been a substantially long time since I’ve posted something, and that’s not because I’m being lazy. Well, okay partially because I’m lazy. Evernote has about 7 notes in it for things I want to post about, mostly issues I’ve resolved, but I’ve just been super busy recently.

One of the things I’ve thoroughly enjoyed about my change in work places has been the learning experiences I’ve been subjected to. Where I used to work was pretty much the same stuff, day in, day out. There was little change, and even the introduction of new companies being acquired really didn’t change that. They were either sucked into the fold and their technologies changed to ours, or they were kept separate and I had little to do with them.

Since changing companies I’ve gone from just using VMware and a small level of administering the infrastructure, to being one of the “go-to” people for it in our environment. Same with the storage infrastructure. Where I used to work there was 2 classes of storage, the big beafy HQ stuff where I had no control over at all, to the local NAS which I managed. This has changed to being one of the “go-to” people for the storage stuffs too.

None of that is to say I didn’t learn stuff where I used to work. Due to all the issues we had with code and servers, I have a very broad range of troubleshooting skills that have come in very handy. It helps that I also got a good look at a lot of the code there too because I’ve got a knowledge of reading and understanding code that I probably wouldn’t have in otherwise.

The cool thing about the place I work now is that the development team drive a lot of the changes, working with a very agile development structure. They push the boundaries of our infrastructure, and we adapt and solve for their problems or ideas. This has lead to some pretty cool stuff, and melding of technologies. For example, I’m currently reading up on IIS ARR1. Last month I was tinkering with Windows Azure.

On my list of new things I’ve been learning and playing with at work:

IBM DataPower Appliances

VMware ESXi

HP 3Par SAN storage

Brocade fiber switches

HP servers (used to work in an all Dell office)

HP Blade chassis

Microsoft Azure

Exchange 2010 (Been away from Exchange for a long time)

More PowerShell than just my “tinkering” scripts

More indepth IIS work

ISA/TMG

Lync 2010/2013 (I built out the infrastructure and deployed both)

McAfee Mail gateways

HP Rapid Deployment tools

Lots more stuff I am always forgetting…

One thing that did surprise me was becoming a mentor of sorts too. People come to me for guidance and tips on issues. I don’t give out answers, but I’ll guide them in the right direction. This has interested me because I’ve never considered myself an educator in any way, but I apparently seem to be doing okay at guiding people.

I love my job, constantly learning, even when not working with new stuff. As my boss and I constantly say “never a boring day”.

IIS Application Request Routing. It’s being used as a potential replacement for ISA/TMG, but is much more, including load balancing, content caching (think CDN), reverse proxy, ssl offloading, and so on.↩

]]>2013-06-06T22:09:00-07:00http://jon.netdork.net/2013/06/06/lync-and-phone-number-normalizationOne of the handy things about Lync is the fact that it’ll parse the Global Address List (GAL), and make them available via the Lync client (using the abserver). This means that Lync will do all the lookup using its own copy of the GAL, rather than hitting the GAL. Additionally, that processed addressbook is cached on the client side, allowing much speedier lookups.

One of the things we’d noticed is that Lync likes the phone numbers formatted in a particular manor, otherwise you end up with some very strange number/calling issues. This leads to a problem because folks update their own address and phone information resulting in a myriad of number formats in Active Directory. A couple of examples:

555 555 1234

555.555.1234

(555) 555-1234

(555) 555.1234

555.555.1234 x555

555.555.1234 ext. 555

Lync isn’t very happy with this, and will fail to parse these numbers. That is, unless you create normalization rules. This isn’t the same as “Voice Routing” normalization rules, which are rules that are applied when people make calls.

So how do you know Lync doesn’t like the phone numbers you have in the GAL? Lync logs the failures in the file stores path in a file (creatively) called ‘Invalid_AD_Phone_Numbers.txt’ under the file store location. Open the topology builder, and look at the “Files Stores” section, and go to that path in Windows Explorer. Under that path you’ll find a directory structure that looks like this:

The directory 1-WebServices-1 may have a different number depending on the number of Lync installations you have that are sharing the same file store, or if you’ve performed a transition between 2010 and 2013.

Using one of the above numbers as an example, you may find errors that look like this:

To fix this error, we need to create a normalization rule, these rules are stored in a text file called Company_Phone_Number_Normalization_Rules.txt which is stored in the 1-WebServices-1\ABFiles directory. This file uses regular expressions to match and reformat the numbers to an E.164 format. In the above example, I want to convert the number to be +15555551234;ext=555, so I’d using the following regular expression:

12

^(\d{3})\D*(\d{3})\D*(\d{4})[xX](\d+)$
+1$1$2$3;ext=$4

Once added to the file, and saved, we can test using the abserver tool with the right arguments. From the <install path>\Microsoft Lync Server 2013\Server\Core we can run the following:

Note the UseNormalizationRules is set to True, if it isn’t use Set-CsAddressBookConfiguration to change it. Once set, you can leave it to the automated process to pick up the changes at the next cycle (in my case 01:30 the following day) or you can use Update-CsAddressBook to force an update.

This process usually takes a little fiddling to adjust for all the variations in phone numbers, but once setup it makes life a lot better for the users.

]]>2013-04-30T09:00:00-07:00http://jon.netdork.net/2013/04/30/cross-domain-execution-of-lync-commandsFor the last few weeks I’ve been performing all the preparation work for Lync 2013 in our organization. We’ve had a very successful Lync 2010 pilot, and instead of expanding the 2010 to production, and later having to do a full environment replace for 2013, we decided to jump straight to 2013. Part of the steps, whether a fresh install or an upgrade, is some Active Directory Forest and Domain preperations. These can either be done using the installation wizard, or via PowerShell.

One of these commands is Grant-CsOUPermission. This command is required if you don’t keep your users/servers/computers in the standard containers in AD (I.e, Users in the Users container). In our environment, we move the users into a People OU, so we needed to run the Grant-CsOUPermission command to update some container permissions for Lync to work properly, and allow us to delegate user management. To save some time, I was executing all the commands from one domain, to one of the other child domains in the forest. This was because I didn’t have access to a 64bit machine in that environment without spending additional time spinning up a client to test with. The Lync PowerShell cmdlets allow for this, and this is what I was doing, and having issues with.

I’d first start a PowerShell prompt as a domain admin in the other domain using the runas command:

1

runas /profile /user:OTHERCHILD\myadmin powershell

Next is to import the Lync modules:

1

Import-ModuleLync

Then the final step is to enable the domain, and grant the necessary permissions to the OUs I needed to modify.

Adding the domain allowed execution. The problem here was that it was trying to bind to mychild.domain.tld and then access the OU through the link to otherchild.domain.tld. The problem here was that my account in otherchild.domain.tld didn’t have domain admin access in mychild.domain.tld, and hence the error.

So, learning lesson of the day, either execute all the commands on a server in the domain you are worknig on, or remember to specify the domain. As a side note, the Microsoft documentation is a little fuzzy around this area because it says you must sign in to a domain member on the domain you wish to execute the commands, but then specifies that you can execute the commands in a different domain. It gets a little confusing, but once you get your head wrapped around the fact that you can do this across domains, and that you must specify the domain, even if the OU hints at a different domain, things are a little easier to work with.

]]>2013-03-21T08:49:00-07:00http://jon.netdork.net/2013/03/21/remote-registryEarlier today while updating some documentation, I noticed 2 of the servers being monitored in SolarWinds SAM were reporting applications in an “unknown” state. When I pulled up the display, and looked at the details of the state, it was throwing an error:

Bad input parameter. HResult: The specified object is not found on the system.

I thought this was a little weird, as the monitors used to work, and the server hadn’t been patched, or any changes made recently.

First step was verifying that the counters it was looking for really existed. Logging onto the server, I opened the performance monitor, and tracked down the supposidly missing performance counters. They were there. Maybe it was an issue with accessing remotely, so I jumped on the SolarWinds server, and tested remote counter access from there, repeating the same process, but specifying the remote server name. Again, no issue.

This is when I decided to do some searching, and stumbled across a Thwackpost that mentioned the same error, but related to Exchange. They had basically done the same testing as I had, but were urged to open a support ticket for more troubleshooting help with their support department.

The last post, before mine, in that thread was the answer I was looking for. They were experiencing an issue with the Remote Registry service, and a simple restart fixed the issue. I took a look at the services on both the servers I was seeing issues with and the issue jumped out immediately. Both servers were using about 600MB and 50k handles. This is very unusual for a service such as the Remote Registry. This tripped a light bulb, as I was working on an issue with a coworker, and he had identified a bug and hotfix for memory leaks in the Remote Registry. In KB2699780 it details the same behavior, and we were scheduled to deploy this hotfix on a different set of servers for a similar issue.

A quick restart of the Remote Registry service had the applications successfully polling again, now to just schedule some maintenance to get the hotfix applied to these servers.

]]>2013-02-11T19:56:00-08:00http://jon.netdork.net/2013/02/11/exchange-2007-slash-2010-and-hiding-mailboxesLike any large organization, we have automated processes that go and happily disable user accounts on termination. This process looks in our HR database for certain flags, and reacts accordingly. As part of the termination/disabling process, it’ll also flag their email account to be hidden from the Exchange Global Address List (GAL).

In Exchange 2003 hiding accounts from the GAL used to handled by an Active Directory (AD) user attribute called msExchHideFromAddressLists. When this was set to TRUE, the user would be hidden from the GAL. Our HR applications toggle this flag for users that are disabled to hide them away from other users.

This process all worked fine for a long time, until Exchange 2007 rolled around. I guess there was plenty of push to allow you to hide a user from all the GALs, but still allow specific GALs to have those users in. So Microsoft introduced a new AD user attribute called showInAddressBook. The problem now appears that if you toggle the msExchHideFromAddressList, but have a value set for showInAddressBook, the user accounts are no longer hidden in the GAL mentioned in the latter attribute.

Can anybody see where this is going? Yup, it appears that all the user accounts were getting the default GALs assigned to the showInAddressBook attribute, so even when they were having the option to hide the user, they were still showing up1. This was causing problems as people that were disabled/terminated were still showing up, and causing some confusions and concerns.

I started to poke around, and bashed together a quick PowerShell script that will walk through all disabled users that have a showInAddressBook attribute, it’ll then wipe out that attribute.

If you’ve not seen LDAP queries before, they work by starting with the operator (and, or, etc), and then the objects that they apply to. So in the example above, it reads as such:

(objectClass=user) AND (userAccountControl:1.2.840.113556.1.4.803:=2) AND (showInAddressBook=*)

It can get a little more complicated when you start stringing together multiple options such as AND and OR operators, and various combinations of them. In this example, we’re going for pretty simple.

I then used the .NET libraries System.DirectoryServices.DirectorySearcher. This uses the LDAP query specified, and returns all matching results. Next was a case of walking through the results, and fetching a DirectoryEntry object to edit the properties. In this case we’re setting it to $null which removes it.

After letting this script run over about 25k users disabled users, it cleared up the fluff in the GAL, and made HR happy.

As a weird side-note to this, if you check the box to hide the user in the Exchange management suite, it removes the showInAddressBook flag on its own, same for the PowerShell options too.↩

]]>2013-02-07T19:08:00-08:00http://jon.netdork.net/2013/02/07/octopress-and-openidOne of the things I had completely forgotten about during my migration from WordPress to Octopress was OpenID. I had used one of the few OpenID plugins that tied into WordPress, and allowed you to use WordPress as an OpenID provider, giving me the ability to login to sites using my WordPress site.

This was great, and I’d completely forgotten about it because I rarely used it. That was until yesterday when somebody on the #Nagios IRC channel had asked a question, and then posted the same question to stackoverflow. I decided to answer the question over there, and remembered I had signed up for an account using OpenID, so I dutifully typed in my site URL, and was stumped because I wasn’t redirected.

This is where I did a little face-meets-desk action. I’d killed my OpenID account by killing off my WordPress site. I tried to think of a way around this, and did some quick searching, and stumbled upon a post by Darrin Mison, on the exact same topic. Darrin had left his WordPress site active over on Wordpress.com, but had migrated to his URL else where. Because of this, Darrin was able to use what is called a deligate, and tell anybody making a request to look elsewhere to authenticate.

This sparked a vague memory, and reminded me that when I first started tinkering with OpenID, I used a different site for the authentication, so a quick check, and I was able to login there. Now I just needed to edit my Octopress site to provide the delegate information.

I used myOpenID.com as my delegate, and they have a help article on how to handle using your own URL. Following what Darrin had done, I edited source/_includes/custom/head.html and added the lines that the were mentioned in the help doc. So now my head.html template looks like this:

Pretty simple, and a rebuild of the blog, and my page now includes the delegate headers required to redirect OpenID requests.

]]>2013-01-11T18:06:00-08:00http://jon.netdork.net/2013/01/11/adding-xml-elements-using-powershellIn a follow-up to my previous post on removing XML elements using PowerShell, I decided to figure out how to add elements using PowerShell. I’m working with the same file from Remote Desktop Manager (RDM) and adding remote management configurations based on DNS checks.

In the enterprise licensed version of RDM you are given the ability to add “remote management” interface details to a host configuration. In our environment, that remote management interface is iLO, and is available from a dedicated IP address, over HTTPS, giving you access to a remote console as well as power management features. RDM handles this with a small tweak to the XML file adding another element under the connection meta information.

I’ve removed most of the information, which you can see in the previous post.

As we’re trying to be careful with the file, we need to first validate the XML has a MetaInformation element, and then an existing ServerRemoteManagementUrl element. If one, or neither, exist, then they get created. Not all hosts have iLO interfaces, such as virtual machines, so we need to verify the presence of a DNS record first, and then only create the entry if it exists.

Again, working with a copy of the original file, I use some crafty XPath queries again to only select connections that are RDP. I then loop through the connections/nodes, and extract the name. Lines 14-18 test for the presence of the MetaInformation element, and create it if it doesn’t exist. Line 20 checks for the ServerRemoteManagementUrl element, if it’s not there, it creates it proceeds with DNS validation.

Lines 24-31 perform a DNS lookup, unfortunately it returns an exception rather than a $null or empty object, so I had to throw in some quick dummy catch code that doesn’t really do anything. If a DNS record is returned it creates the new element, and adds it to the MetaInformation element. For the final step, I saved it to a second file so I could do a comparison between the files to make sure it did as I expected.

One thing to note about adding elements to an XML document is that the CreateElement function (lines 16 and 34) are not executed against the node you are adding the element to, they are executed against the document root. This is so that the element gets all the correct name space information. You then append your element to the existing element.

]]>2013-01-09T19:35:00-08:00http://jon.netdork.net/2013/01/09/removing-xml-elements-using-powershellEvery now and again I have to strip out elements from an XML file. In this case, I was doing some cleanup of my Remote Desktop Manager configuration file. When I first started my current job, to save a lot of discovery, my boss shared his configuration file. Unfortunately the configuration file had a lot of hosts that had duplicate configuration information that wasn’t relevant because the “duplicate” option had been used to copy existing hosts. This meant stuff like host description had been copied.

Remote Desktop Manager (RDM) uses an XML file for its configuration, which makes editing it really easy. To clean up the invalid descriptions, I used a little PowerShell and some XML know-how. Here is an example entry I need to clean up…

Pretty simple, but here is how it works. The first line is pretty obvious, it’s getting the content of the file1. It then explicitly converts the array object into XML using [xml]. The next bit is where it gets a little harder, and requires a little knowledge of XPath syntax. The code is looking to select a single node, that has the name “Description”, with the data in it that says ‘HP Command View EVA’. If it’s found, it’ll return a XMLElement object, otherwise $node ends up being $null. This gives us the ability to wrap the search in a loop, and remove the elements we don’t need. To remove the element, you have to tell the parent node to remove it, so you ask the node to go back to the parent to remove itself, a little weird, but it works. The final step is to go back and save it to a file.

The hardest bit about handling XML is knowing how XPath stuff works, once that is understood, the rest is usually pretty easy. PowerShell treats XML as an object, so it’s easy to figure out what you can do with the objects using Get-Member.

Which I had copied to C:\Temp to make a backup of, instead of working on the real file.↩

]]>2013-01-06T20:10:00-08:00http://jon.netdork.net/2013/01/06/updating-ilo-firmware-using-hponcfg-and-xmlIn the course of updating all of our HP BladeSystem blades (BL465c) servers over the last few weeks, I’ve stumbled across some interesting things. For example, you can updated all the iLO cards at once if you have an Onboard Administrator (OA), a TFTP server, and a little XML knowhow…

This gets saved as an XML file on the TFTP server, I named it update_firmware.xml. The USER_LOGIN and PASSWORD fields do not matter as single sign-on is used from the OA. The iLO update binary is put on the TFTP server as well (you should use the version applicable to the hardware you’re updating). Then comes the easy bit. SSH to the Onboard Administrator, and execute the hponcfg command as such:

1

hponcfg ALL tftp://TFTP_SERVER/update_firmware.xml

If you only need to update a single blade, change ALL to the blade number. Otherwise, this will download the iLO firmware update, push it to each of the iLO cards in the BladeSystem chassis, and then restart them. This will not impact the running server. You should see output like this once it has started:

And that’s it, the magic is done. Using hponcfg is possible from Windows as well when updating the local machines, so it’s quite possible to use the same XML (though I’ve not tested it).

]]>2013-01-05T18:50:00-08:00http://jon.netdork.net/2013/01/05/five-saturdaysNote: I started writing this post at the beginning of December, but due to time issues, and working on blog migrations, I never got around to posting. I’ve still decided to post because it throws in some PowerShell goodies.

The internet is such a gullable place. Really it is. People post to Facebook nearly everything they see anywhere because it sounds like it’s quite possible, and usually accompanying some cool picture to make it seem more important.

An example of one that keeps coming up…

This year, December has 5 Saturdays, 5 Sundays, and 5 Mondays. This only happens once every 824 years.

Along with some blah blah crap about money, and Chinese superstitions. I’m not sure why people don’t stop and think for just a second, and wonder how that could be possible.

Lets do some mental math, and see what happens. December has 31 days, so regardless of what year it is, there will always be 3 days that occur 5 times that month. What are the chances that any other month with 31 days, would start on a Saturday? You’d think pretty high, and I’m guessing a little more frequently than once every 824 years.

To prove the point, I threw together some PowerShell to figure out how many might occur within the next 20 years.

So what is this doing? The first line is grabbing November 1st, and the it loops 300 times. Each loop it adds a month, and figures out what day it is, and the number of days in the month. If there are 31 days in the month, and the day is Saturday, it outputs the year and the month. So how did it look? Did I get no results because I’m inside the 824 years? Far from it…

So it looks like it occurs quite frequently. Lets also assume just for a second that they meant only December, if we look at the results, we can see it’s pretty consistent. 6 years until the next even, then 11 years, then another 6 years, much more frequently than 824 years.

So my tip of the day, if you feel the urge to repost somebody’s random image and something doesn’t seem right, hit Google or Bing, and search for part of the phrase and see what you come up with.

]]>2013-01-04T19:42:00-08:00http://jon.netdork.net/2013/01/04/wordpress-to-octopressWell, it’s only taken me about 3 months of on-off messing about, but I’ve finally cut my blog over to using Octopress. It’s been a long ride with Wordpress but the overhead and all the fluff were more than I needed. Octopress is nice and simple, and just gets stuff done.

The hardest part was exporting and converting. The 6 years of Wordpress usage had me tinkering with all sorts of plugins to get various fun things working, but ultimately ended up being abandoned. Unfortunately after abandoning them, I nearly played clean-up and fixed all the data it left floating around. One major example is the 4+ different code formatting plugins I’ve used.

I don’t remember all the steps I took to get it done, but have most of it documented. I’ll write it up some day. For now, let me know if you spot anything terribly wrong.

]]>2012-07-18T00:00:00-07:00http://jon.netdork.net/2012/07/18/lync-federation-and-dnsOver the last few weeks, I’ve been working on a Microsoft Lync pilot at work. One of the requirements was external federation. This feature basically allows us to use instant messenger (IM) between users in both locations. So for example, you are CompanyA and you do business on a regular basis with CompanyB and are both using Lync. Federation allows you to add each other to your Lync clients, and talk to each other.

The configuration and implementation went pretty smoothly, but I was having intermittent issues with federation. The problem came up when adding an external company that hadn’t explicitly added us to their federated domains list. Initially we had dismissed the issue as a firewall issue because we got federation working with some consultants, however I was later asked to add a vendor and started seeing the same issues.

After some testing, I wasn’t getting any closer, so I enabled the client logging options in the Lync client. Those are found under Options, General, and “Turn on Logging in Lync”. This writes a log file to a Tracing folder under your user profile directory (C:\Users\username\Tracing). When I started digging into the logs, some errors popped out at me.

1

ms-diagnostics: 1027;reason="Cannot route this type of SIP request to or from federated partners";

Without doing much digging, the first suggests the request I’d sent to the vendor couldn’t be routed, and the second reports that no error explaining why it couldn’t be routed was returned from the remote side. This made me think it was a potential firewall issue again. Doing the basic testing of validating that our Edge server was accepting incoming connections, and I could connect to the vendors edge server, I eliminted the firewall being an issue. This got me really scratching my brain.

I ran a SIP stack trace from the Lync Edge server and saw more unusual errors, such as “504 Server time-out”. This was beginning to frustrate me, I had confirmed that both servers could talk to each other, why were they getting timeouts?

I decided to go back to basics, start at the very bottom. First thing was connectivity, that we already established by using telnet to the servers. The next was DNS. Lync, like a lot of Microsoft products, takes advantage of Service Records (SRV) in DNS. This record tells the requesting client the protocol, port, and host to connect to. In this case, the Edge server is looking for the SRV record for the entry _sipfederatiltls._tcp.sipdomain.com. The response should look something like this:

So the protocol is TCP, the port is 5061, and the server I need to connect to is sipexternal.sipdomain.com. I ran a check against our domain, and the vendors domain, and both came back with records. Except, with them being so close together on the screen, I immediately spotted an issue.

Ignoring the different in 300 and 3600, the Time-To-Live of the record, the next difference was the port numbers. Looks like I made a simple transposition of numbers. I did a quick test from outside the firewall, and confirmed that 5601 was not open. I went back through the firewall change tickets, and confirmed I had requested 5061, and the Lync configuration was also set to 5061.

A quick DNS change for the SRV record, fixing the port, and within 10 minutes I received 3 notifications from the vendor that they had staff adding me to their contact lists.

One of the things I’ve come to learn over the years, whenever there is something awfully quirky going on, and you cannot quite figure out what’s causing it, take a look at DNS. I’ve had a number of issues that have resulted in being a simple DNS issue. In this case, it was simple human error, but boiled down to be DNS saying one thing, and it should have been something else.