Performance Optimization Options – Example: Wikileaks

Performance Optimization Options – Example: Wikileaks

Usually I write articles here in German language and will continue to do so for further entry. In this case I decided to move to English, as the topic might be important or interesting to all of the developers over the planet earth.

This article has no political background but the example used has currently huge political potential and is on main public interest these days – so for me a good choice for writing a small „lesson“ on performance optimization.

Lets start with how the performance of wikileaks.org is measured:
I have instrumented measurement agents spread around in Germany on real user pc’s (http://www.gomezpeerzone.com/ as an employee of the company I can do so for testing and training purposes). Currently 10 of these agents randomly selected within the region „Germany“ and a connection greater 512kbit/s downstream measure the performance and availability of wikileaks.org per hour with a FireFox 3.6 Browser. This means per day there are aprox. 240 Test performed on wikileaks.org.

The graph above shows the performance – or better said the response time – over the past 5 days where the Network named above performed 1290 Tests on the page. It is pretty visible that the performance shows inconsistency in the load time. While the performance is often below 2 Seconds it raises up (unfortunately often enough) to 8 Seconds and above. (Every dot represents 10 Measurements – or one hour)

The average load time of 1290 Tests is pretty good with an average of 3.3 Seconds (click on the image for a good size). The availability is only at 95%. Availability means in this case: The page could not be loaded completely within 60 Seconds. Every 20th call of the webpage ran into an issue.

Drilling now down into the various TCP Layers shows where the performance bottleneck can be identified:

(TCP component chart for all tests performed on wikileaks.org – click for good size)

Explanation for this chart:DNS Lookup Time: Time for the clients DNS-Server to translate the URL into an IP to connect toInitial Connection Time: Time for the client to get a ack/ack sync/sync with the server – or more simple: To get a socket on the server to place a request for getting some dataSSL Encryption: Time to establish the encryption based on an SSL certificate (does not apply in this example)1st Byte Time: Time that it takes to deliver the first byte of whats requested by the client.Content Download Time: Time that it takes to recieve all content (text, binaries) from the serverTotal Time: Time to complete all requests – or time to download the complete page with all objects included

The component chart shows, that the main Bottleneck is the connect time (yellow line). It seems that the agents often have to wait to get a socket at the server to place a request. Usually Browser re request a socket after various times: after 3, 9, 21 Seconds. If after 21 Seconds still no socket can be reached the socket connection will not be requested again – we call it „timing out“. Reasons can be (do not treat me to go to deep into detail or name all possible conntection symptoms):
– Server refused the connection
– Server reset the connection
– Socket connection times out
– Socket connection aborted

(Error chart for those tests which failed – click for good size)

So the charts above for the page is telling us clearly there seems to be an issue with connectivity. On September 4th we see a peak in the DNS time too (in the component chart). But this yet not seems to be a constant problem.
Connectivity issues happen mainly for 3 reasons (beside of many more – lets face only the MAIN reasons for it):
– Huge amount of packet loss between client and server
– Server/Firewall does not allow to the specific client to connect
– Server is not able to handle more requests because he is fully loaded and all sockets are seated.

We can exclude the packet loss issue – because all measurements taken coming from a healthy and stable internet infrastructure – also the issue does not show a consistent line.
In this wikileaks example it is most likely that the sockets are blocked or taken by other requests. Whether due to a constant hack attempt or because of the high level of usage by usual real visitors. While wikileaks currently is high in the press and a group of people stated to bring the page down both arguments could be the reason for this issue.

To avoid connectivity issues we have to consider only one thing:
How can we reduce the amount of connections to the specific IP address on our own? Options are:

By selecting a third party for serving parts of the contents (i. e. CDN Network, Cloud-storage or Mirrors)

By reducing the amount of connection to be established for each single page.

While the first solution is not cheap or not easy to apply the second solution might be an option which is quickly adoptable. But therefore we need to know how can we reduce the amount of connections in this case. So lets take action here and look what is requested from the server. The good thing: The tests taken by the measurement network shows us for each single test what has been requested by the client.

(Waterfall chart for wikileaks.org main page – click for a good size)

We see: There are only a few objects requested on the Main page – pretty challenging to reduce such a small amount of connections when there are only so few objects. But it is possible and should be considered as at least a solution to reduce the impact of the issue.

The first visible finding is: The transport Protocol for HTTP is 1.0 which does not keep connections alive. The amount of objects and the amount of connections is equal.
HTTP 1.0 protocol might be a good choice and has an advantage in rare cases. So it makes sense not to use HTTP 1.1 with persistent connections (more than one object can be downloaded on one established connection). Reason for the good choice: 1 User can not block many connections on the server (imagine 100 files were downloaded on 6 connections by one user who is badly connected to the internet. 6 connections are blocked for a long time). With HTTP 1.0 Everyone gets a chance to find a „free“ socket. As said pages with a big amount of objects HTTP 1.0 should be revied carefully if it does not make sense to switch to 1.1 – especially in Peak hours with slow users HTTP 1.1 can cause issues.
So – lets simply say: None persistency is a good choice for wikileaks from the performance perspective in general.

Lets take a deeper look. We see one javascript file and one css file – both of them forces the client to establish a connection to the server. The byte size of them both together is just a bit over 7k bytes. To inline them into the html source (10k) might be a good option. We would reduce the amount of connections per request by 2 or even more impressing by nearly 30%. (30% more people can be served – or hack attempts need 30% more power to bring down the page)

The risk by doing so: nearly zero because the component chart shows us that bandwith is not an issue.

The other side of the medal: these files can not be cached – so with every request the 7k of bytes will be delivered.
Same with the images: these can be „inlined“ (look at some of the older entries – here you can see how inlining works (not for IE Browsers)). But…ok….lets keep the pictures externally referenced. Maybe one day there is a good guy giving Wikileaks the chance to host the binaries.
In times of trouble it could also be considered if it makes sense to deliver an icon (.ico) for the browser ? We could save another connection to establish – per each request.

What have we learned: Every medal has two sides – considerations have to be made of course. But honestly: whould I prefer to look nice or would I prefer to be available ? I would take the second choice.