Monday, November 25, 2013

I recently started fighting an issue where my mouse pointer will become unresponsive on the screen. The right and left click still work. But the touch pad just stops on me. The only time I have this happen is after I wake my computer up. Not sure if it is from sleep or hibernate.

It is kind of hard to troubleshoot mouse issues when you can't use the mouse to help you. I can reboot the computer, but I would rather not. I am bad about leaving a lot of windows open and I try not to reboot if I can help it.

The strange thing about it was the dell touch pad software in the system tray could see me touching the touch pad. It shows a mini touch pad and it would highlight where my fingers were. The first thing I decided to do was to close that app. I then ran into my first challenge. How do I get focus to the task bar and send the right click command to the app?

I decided it was just easier to fire up powershell and kill the process. That was easy enough to do, but it didn't solve anything.

My next idea was to remove the driver and scan for hardware changes.It was a bit of a challenge using the device manager with out the mouse. I removed every mouse, keyboard, and HID driver I could find. I thought this worked once, but I could not get any results the next few times I tried that.

I then tried something else. I decided to kill every process one at a time on my computer until it either started working or it crashed. I already had powershell open from before. I typed in this command:

Get-Process | %{$_.name; $_.close(); sleep 3}

It took every process, printed the name, closed it, and then waited for 3 seconds. I did it this way so I could tell what process was closed just before the mouse started working. Best of all, it that it worked. It killed about 8 or so processes before it got to csrss. My mouse sprang back to life and I stopped the script before it killed too many processes on my machine.

I still don't know why that process is a problem. The next time this happens, then this will be the first place I look.

Wednesday, November 06, 2013

I started to use RRDTool to record and chart performance counters of my servers. Here is a dashboard for one system I deal with. There are about 6 servers involved. Some counters apply to different servers.

There is a lot going on in this chart. It shows terminal server sessions, iops, available ram, cpu, SQL connections, and sql batchs. Together, these are the key counters that really show the health of the system. I don't report network activity because it is not as important in this context.

If you look at this next one for a single file server, I look at different counters. The nice thing about using RRDTool to generate graphs is that you can really build something customized to your environment.

The last one I am going to post is of my Hyper-V cluster. The storage iScsi connections are at the top for each individual node. Across the middle is the aggregate activity of all hosts and guests to the rest of the network. The bottom shows iops on the SAN and available ram for the cluster.

This is still a work in progress but these charts are already providing a lot of value and insight into our servers.

I jumped right in with this one. Ping the host forever every 14 seconds and record it in our datafile. I know we defined the file as receiving samples every 15 seconds but it is ok if we record more often. RRDTool will aggregate the values for us. Our next step is to produce a graph.

The graph will be 400x300 over the last 15 minutes with the average ping in red. It may look something like the one below. I had to extend it out 4 hours to show a little variety in ping times.

RRDTool ping graph sample

I took this further and created a dashboard for all of my servers. I customized the charts so that they looked green if the host is up and red if the host goes offline.

The graph command for something like this gets a little more complicated, but it highlights what you can do with RRDTool. This graph shows 5 minutes of time and goes gray after 24 hours of downtime. Here is the code for those charts if you are interested.

Thursday, July 04, 2013

Storage Spaces is not exactly software RAID. RAID is defined as a redundant array of independent disks. When an admin thinks about RAID they look at the raid levels and they each have a very well defined meaning as to how each disk is involved. The primary goal is to allow you to lose a disk with minimal impact and also to provide better performance. Storage Spaces has the same goal, but does it very differently.

When thinking about RAID, it's usually in terms of whole disks. When thinking about Storage Spaces, its in terms of 256KB chunks of data across a pool of disks. With Storage Spaces, you don't use hot spares but leave free space on the volume for it to maintain redundancy. Rebuild times are significantly faster and not limited to the work of a single disk.

I want to give you some examples on how each one handles the data differently.

In our RAID 1, Disk 1 and 2 will be a mirror pair (500G usable). Every byte will be written to both disks in the same sectors. Reads can happen from either disk. This can give 2x read performance for some workloads. The configuration can survive a failure of one disk. In the event of a failure, the data from disk 2 would be copied to the hot spare (disk 3). Disk 1 can be replaced and then become the new hot spare.

In our Storage Spaces mirror, all 3 disks will be added to the pool. While we could allocate 1TB of space, I want to keep the recovery scenario close to the one above. We will allocate 500G to the mirror volume and leave 500G free (instead of a hot spare). Every 256KB chunk of data is written to 2 of the 3 disks. So data will exists on all 3 disks. Reads can happen from any of the 3 disks. I won't compare the read performance of this example, but it evens out to raid 10's 2xread when more disks are involved. The configuration can survive a failure of one disk. If disk 1 fails, then disk 2 and 3 will copy data between themselves to maintain the mirror (this is why we left the space free).

In our RAID 10, its common to pair each disk and then stripe the data across those pairs. When data is written, each pair writes the same data. When a disk fails, the other disk in the pair will copy all its data to a hot spare and it becomes the new partner. You can replace the failed disk and have the hot spare hand that data back over to the replacement. That would result in a second full copy of the data.

Storage spaces would write your data in pairs to any two disk. None of the disks are mirrors of any other disks. It just makes sure that every 256KB chunk of data exists in 2 locations. This process can be enclosure aware so that the data can be mirrored across the enclosure. When a disk fails, the mirror of that data already exists in 256KB chunks across the other 14 disks. Those 14 disks copy the 256KB chunks to different disks to rebuild the mirror. (This can happen very fast because all remaining disks work to copy the data instead of just one). When you replace that failed disk, nothing happens. No rebuilds and no recopies. Data is only added back onto that disk when data is written to the volume. There are no disk pairs to micromanage.

* There is not a lot of information on exactly how storage spaces works. This is the way I understand it from the information that I have found. If you have a better understanding of Storage Spaces, I would appreciate any feedback.

Monday, May 27, 2013

I set up a handy script a while back that allows me to right click a script to sign it. I already had the code signing cert worked out. I just needed an easy way to sign things. Once you have the base scripts in place, its easy to sign .ps1, .vbs, .dll, .exe, and RDP files.

Friday, May 24, 2013

We all know that AppLocker can stop a lot of things we don't want running on the computer. That includes malware. If you are not ready to pull the trigger, audit mode can still be a great asset.

Audit mode tells us about everything that is running on your system. It creates a log entry every time you run a program. That log will tell you if it would have allowed the app to run or if it would have blocked it and why. A log like that can give you a lot of information.

Once you start building rules, it gets even better. Then you can filter on the things that would have been blocked. If you see something that is legit, then you can create a rule for it.

Things like malware just jump out at you in those logs. A quick script like this will show you where its hiding.

Friday, May 17, 2013

You already know that I am fast to apply updates and move to new products. I have almost all our WSUS updates set to auto approve and I load them on our servers and workstations the same day. I have a lot of faith and confidence in the patching process. But every once in a while, they bite back.

I can recall a few years ago that Microsoft got a lot of flack for blue screening computes with an update. I was keeping up to date with the situation from various news feeds. I unapproved the update while I investigated it more. We were not seeing any blue screens but other admins were. I saw all kinds of email flying around warning people and reporting issues.

As real details started to poor in, it turns out that the only computers that were blue screening were the ones with root kit infections. I immediately pushed that patch out to the rest of my computers. If my computers were infected, I wanted to know about it. In the end we had a clean bill of health but not everyone was so lucky.

I had the Bing desktop search bar get deployed once. Turns out I had feature packs on auto approve. I quickly fixed that and recalled it.

Last year Microsoft released a patch that would not trust 512bit certs anymore. I was following the progress of this issue for a while. The Flame malware was using a 512bit cert of Microsoft's that was weak enough to break. MS revoked that cert and later released this patch to break all 512 bit certs. I took a quick peek at our central IT's cert server. While I did see a few of those 512 bit certs, I saw many more 1024 and 2048 bit ones. I figured we had nothing to worry about.

Turns out that our email was using one of the weaker certs. So every one of my users was getting an error message that Outlook did not trust our email server. I got on the phone with central IT and pushed them to get an updated cert rolled out. They recommended that the rest of the org not install that patch. it turned out that they needed to update a root cert first and that is kind of a delicate process when you don't do it very often. That was not something they were going to put a rush job on. Luckily Microsoft had a KB that talked about this issue and offered a command that would trust 512bit certs again.

I was able to Powershell that command out to everyone and life returned to normal. I was able to revert that setting once the certs were taken care of.

The one update that almost bit us the hardest was the Powershell 3.0 and remote management update that was released around December 2012. We started to run into some strange issues with remote Powershell and SCCM config man. And before we knew it, we realized that we could not remotely Powershell anything. SCCM was also down and out. I started to deep dive into the internals of WinRM to fix this. Listeners were broken and Powershell was refusing to re-register settings it needed for remote management.

Something reminded me that Powershell 3.0 was out and I found it on our workstations. We started finding reports of compatibility issues with SCCM 2012 and Powershell 3.0. Config Manager was attempting to repair WMI but would corrupt it instead. We ended up pulling that patch using WSUS and everything returned to normal in a few days. The SCCM server took a little more work to correct.

Not having Powershell when you need it can be very scary. That is my go to tool to recover from most issues. So handy to for a WSUS check in or gpupdate or ipconfig /flushdns to resolve some issue.

I think patching fast works well in our environment because we have a good team that is flexible and quick to respond to these types of issues. We still get caught off guard from time to time, but we handle it well.

Monday, May 13, 2013

How fast do you deploy updates? If Microsoft released a run of the mill update today, how soon would you see it on your production systems?

I like to patch my systems quickly. Over time, I have gotten quicker and quicker at rolling them out. WSUS was a great addition to Windows Server. Not only do I auto approve the important updates, I also auto approve just about everything else. I found myself blindly approving them twice a year anyway. That tends to create a monster patch. The problem with monster patches is that people notice them. The login takes longer so you tell them its the patches. Then if something like a hard drive goes out, then people blame the patches.

I still hold of on service packs, new versions of IE, and feature packs. All for good reason. I have only been caught off guard a few times because of it. So I have everything on auto approve.

One issue that I did have for a while was patches showing up later in the week than I expected. I would be ready for patch Tuesday expecting that things would be patched Wednesday morning and they were not. It felt like most of our patches hit on Thursday instead. So I started to look into it.

Our machines patch at 3:00 am if they are powered on. I tried to get WSUS to update just before 3:00am so things would be ready to go when the computers went to update. Sounded good in theory but that is not how the client updates. I found out that the computer will check in with WSUS once every 24 hours by default. If WSUS was not pulling updates until 3:00 am, then everything was really updating 24 hours behind.

So know I knew why my updates felt a whole day behind. I increased how many times WSUS would pull updates from Microsoft to 3 and let it run for a long time. My WSUS server was checking for updates at 11:00 am, 7:00pm, and 3:00am central. This way I was catching any other updates that showed up at odd times. I would get a few more machines and servers updated a day ahead, but the bulk of them was still 24 hours behind.

There is a very subtle detail here that I overlooked for the longest time. I had no idea what time Microsoft actually released at on Tuesdays. I ran with this schedule for a very long time. Then one day I was really reading an important IE patch that everyone was rushing to load and I saw the expected release time. It was at 10:00 pst. Seeing this time reminded me to check my update schedule.

Sure enough, I had it in my mind that they were released in the evening. I could see my logs showing updates getting pulled at 7:00pm every Tuesday patch day. With 10:00pst being 12:00 central, it clicked with me where my issue was. I moved that early sync to 1:00pm and everything started updating right on schedule. All my servers and workstations were updating as expected with a very clear schedule.

I also started scheduling wake on lan to power up our workstations and combined it with a check for updates event. So now all my computers are getting updated as fast as reasonably possible and I know exactly when to expect issues.

Tuesday, April 23, 2013

Our reporting needs have outgrown our existing tools. Actually, that's not true. We have all the right tools but are not using them as well as we could be. It all starts with our data. Right now it all sits in our vendors schema. That works well for the transaction nature of the application, but not so much for reporting.

We have done a lot with what we have. Every night, we take the most recent database backup and load it onto a second server that is used for reporting. I take about a dozen of our core queries and dump them to tables for use the next day. We do the basics like indexes and primary keys. Or issues is that these are designed for specific reports. As the demands and needs of the reports change, we put in a good deal of time reworking the queries.

We started building our reports with Reporting Services and have not expanded our use of the tools that SQL has to offer yet. In the mean time, I have gotten more involved in the SQL community. Attending user groups, SQL Saturdays, and other Microsoft Tech Events. I have been introduced to a lot of features and ideas that I was previously unaware of. I think it's time we built a data warehouse.

I don't think our dataset is large enough for me to truly call what I am going to make a data warehouse. My database sits at 30 some gig in size. I also have a huge maintenance window. The core activity of our business ends by 5:00 pm so I have all night to process whatever I want. So my ETL process can process my entire dataset every time. In the beginning anyway. I'll deal with slowly changing dimensions later.

I want to build a star schema for my data and take advantage of Analysis Services. I want to be able to expose my data to PowerPivot and PowerView. I see a lot of power in these tools and there is no better way to learn than to jump into it. Even if I can't get my user base to use these tools, it will help me parse our data and they will still benefit.

Friday, April 19, 2013

I enabled AppLocker in audit mode about 3 months ago for all of our workstations. I spent about 2 weeks checking the logs and adding rules. I put it on the back burner to take care of some other things and almost forgot about it. I ran those scripts I posted previously to check up on my workstations and things look fairly clean. Here are a few things that stand out to me.

There are a handful of things that run out of the user's profile and ProgramData that I need to be aware of. I see a Citrix and WebEx client pop up on a few machines. Spotify also jumps out in the list. I didn't realize how many of our users used that. I also see a few Java updates being ran from the temp internet files folder. Nothing too crazy here that would have impacted much. I expect it would have been a hand full of panic calls from people that could not get some web conferences to work.

I did find a custom app that we wrote sitting on some desktops that would have broke. That would be been a big deal. I think I will just sign those apps and place them in the Program Files folder. I can use these logs to track down these users. This app is just an exe so there is no installer or registry thumbprints to look for.

The last group of findings were just a hand full of special machines that had something installed to a folder on the root of the C: drive. I could guess exactly where these machines were based on the names of those folders. I will handle these case by case. I am tempted to just give them local exceptions instead of baking something into the main policy.

Now that we are aware of these things, we can do things right going forward. Primarily loading everything into the program files would be the most help. I plan on letting this go for another several months and see what else I pick up.

Tuesday, January 15, 2013

I ran AppLocker in audit mode for a few days on a small
number of computers. So all that
activity is collecting in the "Microsoft-Windows-AppLocker/EXE and
DLL" audit log. It creates an event
every time an application starts indicating if it was allowed, blocked, or
would have been blocked. That last event
type is 8003 and that’s the one I care about.

The Powershell command to view this log entry is this:

get-winevent-logname"Microsoft-Windows-AppLocker/EXE and DLL"|

Where-Object{$_.id -eq8003} |

ftmessage

This will tell me every application that would have
failed. I can either make a new rule or
ignore it knowing that it would be blocked in the future. I can combine this with powershell remoting
to check the event log on every computer I manage.

Get-QADComputer|%{Invoke-Command$_.Name –AsJob–ScriptBlock{

$ErrorActionPreference="SilentlyContinue"

get-winevent-logname"Microsoft-Windows-AppLocker/EXE
and DLL"|

?{$_.id -eq8003} |

Format-Tablemessage

}}

I use the QuestAD tools to get every computer in the domain
and request the log event 8003 from the correct event log. The other stuff just cleans up the
output. Give it 60 seconds to finish or
timeout (for computers that are not powered up). Then run these commands for
the results.

This will filter out results we don’t care about and then
output all the logs on all the other systems. If you have pages of data, you
can process them one computer at a time. This walks the results from the top
down.

(Get-Job|?{ $_.HasMoreData -eq$true})[0]|Receive-Job

When it outputs the results, it will reset the HasMoreData
flag from that Job. So if you see some
output and you want to know what job it was from, run Get-Job. In the middle of the list, you will see the
the HasMoreData flip from false to true.
The bottom one with a false value is the last computer you pulled output
from. This can be very handy when
setting up rules.

If you have the admin share open to administrators, you can
open explorer to \\computername\c$ and find files on it. You can also use that remote admin share in
the wizard to add new rules.

I saw Google Chrome show up on a computer in a user’s
profile on a remote computer. I was able
to point the AppLocker rule wizard to \\computername\c$\users\john\appdata\....
and it added the needed rules. I was
able to add 4-5 needed applications. I
also saw some spyware on a few computers that I was able to clean up.

Now that we added some new rules, I wanted to clear the logs
so they are cleaner next time. Here is
the command to do that.

Wevtutil.execl"Microsoft-Windows-AppLocker/EXE and DLL"

I plan on repeating this ritual every few days to identify
new rules. Eventually I will no longer be adding rules and can look at
enforcing them without much risk.

Have you ever ...

Have you ever had a problem that is hard to search on? Some key words generate too many unrelated results. Other problems may be so basic that it’s just expected everyone will know it. I often run into problems that I expect others to have but nobody talks about it or just accepts that’s the way it is.

When I run into something that felt like it was harder to find then it should be, I will post it here. I don't have a set theme and many of my solutions are unrelated. But I hope you were able to find the solution to your problem within the pages of my blog.