Blog:

Security Hang-Ups

Posted by Dave Engberg on 23 Sep 2011

Posted by Dave Engberg on 23 Sep 2011

Scenario: In the last week or two, lots of people noticed sporadic errors when they tried to synchronize with Evernote or access our web site. The errors would disappear if they manually forced another sync or reload. The web site worked fine after that initial hiccup.

Debugging: The symptoms pointed to a problem with establishing new HTTPS connections, since subsequent requests (over keep-alive connections) worked fine. We were able to reproduce the problem by just hitting our site with ‘curl‘ a few times:

curl -v -i -s https://www.evernote.com/robots.txt

The majority of requests worked fine, but some percentage would fail with an “SSL protocol error”:

This failure deep within the SSL handshake was particularly confusing. It seemed like our HTTPS server was dying in the middle of the SSL negotiation.

As I mentioned in our Architectural Overview, we offload our SSL processing onto a pair of A10 AX 2500 load balancers that have performed well since we installed them in January. But this error made us worried that there may be some sort of deep cryptographic error within that hardware.

So we wasted a couple of days trying to fix the problem by rebooting (and cold booting) the boxes on the theory that this would “shake loose” the cryptographic errors. This caused the problem to go away for a while, but then it would be back again the next morning during our peak traffic hour (6am-7am Pacific). Our theories about the possible causes got more and more complicated, and our proposals for fixes got more and more baroque (and expensive).

Finally, we realized that the failure rate seemed to be the highest during our peak traffic times, and that the problem might actually just be a simpler capacity issue. I.e. maybe we’d grown enough to exceed the ability of this hardware. Unfortunately, we didn’t see anything in the UI for our balancer that indicated we were at any sort of capacity limit:

The CPU usage was low, and the “SSL Conns/sec” was around 14% of the rated SSL CPS for this hardware. We contacted our support representatives from A10 and they scheduled a call with several of their experts to help track down the problem or get us new hardware if needed.

They told us immediately that we were hitting a limit, but it wasn’t the “new connections per second” limit, but rather the “total open SSL connections” limit of 250,000.

We were only receiving 1105 new SSL connections per second, and we only processing 2500 HTTP requests per second over those connections, but we were holding a very large pool of idle connections. This was due to an “idle connection timeout” parameter of “10 minutes,” which meant we’d keep the HTTPS socket open for 600 seconds after the last response to the client before we’d close it.

In retrospect, this timeout setting was a bit excessive. Empirical testing with openssl shows that this is several times longer than the idle SSL connection timeouts used by other big Internet web services.

Now, we’ve lowered the idle connection timeout to 2 minutes. This means that we’re closing idle connections 5x faster, and the number of open connections has dropped dramatically as a result. As I type this:

The moral of the story? Sometimes, it’s easy to overlook the simple explanation for a problem when the symptoms are unusual. Our apologies to folks who were inconvenienced by the sync errors while we sorted out these problems.

Thank you for sharing this experience. I am very glad that you managed to get to the bottom of the problem. Also, by explaining what went wrong and how you resolved the problem, you deserved a huge confidence boost in my eyes.

Glad this has been fixed. As a heavy user of the service I was seeing a lot of “Communication Errors” and “Sync Failed”. Looks like you also have a lot of room to scale up when needed with the A10 devices.

One good thing that came out of this. I now have a much better understanding of how your sync model works.