.NET, OSS & a little more

Browsers have implemented all sorts of great new security measures to ensure that certificates are pretty valid. So, using a self-signed certificate today is more difficult than it used to be. Also, IIS for Win8/10 gained access for using a Central Certificate Store. So, here’s some scripts that:

Create a Self-Signed Cert

Creates a self-signed cert with a DNS Name (browsers don’t like it when the Subject Alternative Name doesn’t list the DNS Name).

Reimports the certs to the machines Trusted Root Authority (needed for browsers to verify the cert is trusted)

Adds the 443/SSL binding to the site (if it exists) in IIS

Re-Add Cert to Trusted Root Authority

Before Win10, Microsoft implemented a background task which will periodically check the certs installed in your Machine Trusted Root Authority which are self-signed and removes them. So, this script re-installs them.

It will look through the shared SSL folder created in the previous script and add any certs back to the local Machine Trusted Root Authority that are missing.

Back in 2015, we started using Win2012 R2 servers and within a day of Production usage we started seeing Out of Memory errors on the servers. Looking at the Task Manager, we could easily see that a massive amount of Kernel Memory was being used. But why?

Using some forums posts, SysInternals, and I think a Scott Hanselman blog entry we were able to use PoolMon.exe to see that the system using all the Kernel Memory was Wnf. We had no idea what it was and went down some rabbit holes before finding this forum post.

Microsoft Support would later tell us the problem had something to with a design change to Remote Registry and how it deals with going idle, and another design change in Windows Server 2012 R2 about how it choose which services to make idle. Anyways, the fix was easy to implement (just a real pain to find):

If you want the service to not stop when Idle, you can set this registry key: key : HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\RemoteRegistry name : DisableIdleStop REG_DWORD, data : 1

So, I had a bad week. I crashed a multiple server, redundant, highly available Exchange Server setup using the IIS Healthchecks of a single website in Dev and Test (not even Prod).

How did I do this? Well …

Start with a website that is only in Dev & Test; and hasn’t moved to Prod.

All of the database objects are only in Dev & Test.

Do a database refresh from Prod and overlay Dev & Test.

The database refresh takes 2 hours; but the next 17 hours is a period where the Dev & Test environments don’t have the database objects available to them, because those objects weren’t a part of the refresh.

So, now you have 19 hours of a single website being unable to properly make a database call.

Why wasn’t anyone notified? Well, that’s all on me. It was the Dev & Test version of the website, and I was ignoring those error messages (those many, many error messages).

Those error messages were from ELMAH. If you use ASP.NET and don’t know ELMAH; then please learn about it, it’s amazing!

In this case, I was using ELMAH with WebAPI, so I was using the Elmah.Contrib.WebAPI package. I’m not singling them out as a problem, I just want to spread the word that WebAPI applications need to use this package to get error reporting.

Finally, you have the IIS WebFarm Healthcheck system.

The IIS WebFarm healthcheck system is meant to help a WebFarm route requests to healthy application servers behind a proxy. If a single server is having a problem, then requests are no longer routed to it and only the healthy servers are sent requests to process. It’s a really good idea.

Unfortunately, … (You know what? … I’ll get back to this below)

Our proxy servers have around 215 web app pools.

The way IIS healthchecks are implemented, every one of those web app pools will run the healthchecks on every web farm. So, this one single application gets 215 healthchecks every 30 seconds (the default healthcheck interval).

Well … those response emails went back to the sender … which was a fake email address I used for the website (it’s never supposed to be responded to).

Unfortunately, that fake email address has an the domain as my account (@place.com); which sent all the responses back to the same Exchange server.

Those “Inbox is Full” error messages then triggered Exchange to send back messages that said “This email address doesn’t exist”. (Third email type)

I’m not exactly sure about how this happened, but there was a number of retry attempts on the [First Email Type] which again re-triggered the Second and Third email type. I call the retrys the (Fourth email type).

Once all of the error messages get factored into the equation, the 1.5 million healthcheck emails generated out 4.5 million healthcheck and smtp error emails.

Way before we hit the 4.5 million mark, our Exchange server filled up …

It’s database

The disk on the actual Exchange servers

So, I don’t really understand Exchange too well. I’m trying to understand this diagram a little better. One thing that continues to puzzle me is the why the Exchange server sent out error emails to “itself”. (My email address is my.name@place.com and the ELMAH emails were from some.website@place.com … so the error emails were sent to @place.com, which that Exchange server owns). Or does it …

The Client Access Server queued the error email and determined which Exchange server to process it with (let’s say Exchange1)

Exchange1 found that the mailbox was full and using SMTP protocols it needed to send an “Inbox is full error message”. Exchange1 looked up the MX record of where to send and found that it needed to send it to the Email Firewall. It sent it ..

The Email Firewall then found that some.website@place.com wasn’t an actual address and maybe sent it to Exchange2 for processing?

Exchange2 found it was a fake address and sent back a “This address doesn’t exist email”, which went back to the Email Firewall.

The Email Firewall forwarded the email or dropped it?

And, somewhere in all this mess, the emails that couldn’t be delivered to my real address my.name@place.com because my “Inbox was full” got put into a retry queue … in case my inbox cleared up. And, this helped generate more “Inbox is full” and “This address doesn’t exist” emails.

Sidenote: I said above “One thing that continues to puzzle me is the why the Exchange server sent out error emails to “itself”. ”

I kinda get it. Exchange does an MX lookup for @place.com and finds the Email Firewall as the IP address, which isn’t itself. But …

Shouldn’t Exchange know that it owns @place.com? Why does it need to send the error email?

So … this biggest problem in this whole equation is me. I knew that IIS had this healthcheck problem before hand. And, I had even created a support ticket with Microsoft to get it fixed (which they say has been escalated to the Product Group … but nothing has happened for months).

I knew of the problem, I implemented ELMAH, and I completely forgot that the database refresh would wipe out the db objects which the applications would need.

Of course, we/I’ve now gone about implementing fixes, but I want to dig into this IIS Healthcheck issue a little more. Here’s how it works.

IIS has a feature called ARR (Application Request Routing)

It’s used all the time in Azure. You may have setup a Web App, which requires an “App Service”. The App Service is actually a proxy server that sits in front of your Web App. The proxy server uses ARR to route the requests to your Web App. But, in Azure they literally create a single proxy server for your single web application server. If you want to scale up and “move the slider”, more application servers are created behind the proxy. BUT, in Azure, the number of Web Apps that can sit behind a App Service/Proxy Service is very limited (less than 5). <rant>No where in the IIS documentation do they tell you to limit yourself to 5 applications; and the “/Build conference” videos from the IIS team make you believe that IIS is meant to handle hundreds of websites. </rant>

We use ARR to route requests for all our custom made websites (~215) to the application servers behind our proxy.

ARR uses webfarms to determine where to route requests. The purpose of the webfarms is have multiple backend Application Servers; which handle load balancing.

The webfarms have a Healthcheck feature, which allows the web farms to check if the application servers behind the proxy are Healthy. If one of the application servers isn’t healthy then it’s taken out of the pool until it’s healthy again.

I really like this feature and it makes a lot of sense.

The BIG PROBLEM with this setup is that the WEBFARMS AREN’T DIRECTLY LINKED TO APPLICATION POOLS.

So, every application pool that runs on the frontend proxy server, loads the entire list of webfarms into memory.

If any of those webfarms happens to have a healthcheck url, then that application pool will consider itself the responsible party to check that healthcheck url.

So, if a healthcheck url has a healthcheck interval of 30 seconds …

And a proxy server has 215 application pools on it; then that is 215 healthchecks every 30 seconds.

I think the design of the Healthcheck feature is great. But, the IMPLEMENTATION is flawed. HEALTHCHECKS ARE NOT DESIGNED THE WAY THEY ARE IMPLEMENTED.

Of course I’ve worked on other ways to prevent this problem in the future. But, IIS NEEDS TO FIX THE WAY HEALTHCHECKS ARE IMPLEMENTED.

I get bothered when people complain without a solution, so here’s the solution I propose:

Create a new xmlnode in the <webfarm> section of applicationHost.config which directly links webfarms to application pools.

Example (sorry, I’m having a lot of problem getting code snippets to work in this version of my LiveWriter)

I had a long held belief that health checks should just be pings. “Is the websites up?” And for years, that was right. Not anymore.

Recently, a developer asked me if he should use health checks to ensure that the Entity Framework Cache stays in memory? It took me a while disassociate health checks from pings, but he was right. YES, you should use health checks to ensure the health of your site.

You should use health checks to do this:

Ensure your site is up and running (ping)

Ensure all cached values are available and, if possible, at the latest value.

Ensure Entity Framework’s cache is hit before your first user

EF is a total hog of resources and complete slowdown on first hit

Same thing for WCF

Cache any application specific values needed before first hit

Health checks should not be pings. They should check the entire health of the site and its responsiveness. It should check the cache, it’s database connectivity, and everything that makes a website work. It’s a “health check” not a ping.

I’m very new to all this technology so, please take this with a grain of salt. The reason I’m writing it is because I couldn’t find another guide that had end-to-end setup on Tyk in Docker on Windows 10.

Tyk is an API Gateway product that can be used to help manage a centralized location of many services/micro services. It is a product which is built on top of the nginx web server. And, nginx is really only supported as a “server” product on *nix based systems. Their Windows build is considered a beta.

So, there are already some good guides for each of the next steps, I’m just gonna pull them all together, and add one extra piece at the end.

Install Docker

There are a couple ways to get around the limitation of nginx only being “production ready on *nix”, but I choose to try out Tyk on Docker. Docker is the multiplatform container host that has created a lot of buzz within the cloud space. But, it also seems pretty awesome at setting up small containers on your local machine too.

In Step 2. Get the quick start compose files you’ll need to git clone the files to an folder under you C:\Users\XXXX folder. For me, Docker had a permissions restriction that only allowed containers to mount volumes from folders under my user folder. (So, that could be interesting if you run a container on a server under a service account.)

The silver lining about this set of containers is that they only need to use config files from your local drive. So, it’s not like your C:\Users folder is going to store a database.

In Step 4. Bootstrap your dashboard and portal, if you have bash available to you I would suggest trying it when you run ./setup.sh. I haven’t installed Win10 Anniversary Update, Git Bash, or Cygwin so I didn’t have bash available to run setup.sh.

However, I do feel somewhat comfortable in powershell, and the setup.sh script didn’t look too long. Below is the powershell conversion, which you should be saved in the same directory as setup.sh, and you should run .\setup.ps1 from the PowerShell ISE with the arguments that you want.

After that, I had a running Tyk API Gateway.

Other Thoughts

Since this was all new technology I ran into a lot of errors and read through a lot of issue/forum posts. Which makes me think this might not be the best idea for a production setup. If you’re able to make linux servers within your production environment, I would strongly suggest that.

Because I made so many mistakes I got used to these three commands which really helped recreate the environment whenever I messed things up. I hope this helps.

ERROR: Get-WebSite : Could not load file or assembly ‘Microsoft.IIS.PowerShell.Framework' or one of its dependencies. The system cannot find the file specified.

It’s a really tricky error because its inconsistent. But, there is a workaround that will prevent the error from giving you too much trouble. From the community that has done troubleshooting on this, the problem seems to occur on the first call that uses the WebAdministration module. If you can wrap that call in a try/catch, then subsequent calls will work correctly.