Overview

Timing attacks against web applications: Are they still practical?

Timing attacks have long been underestimated as per their effectiveness in the field of web applications. It is widely considered that due to their nature of exploitation and the fact that network jitter plays a prominent role in inducing a large amount of noise they are not practical. However, our research demonstrates that under certain scenarios they can be proven effective and produce results that could be better than those of a standard brute-force attack.

1.0. Introduction

We must first acknowledge the fact that the exploited application was open source and that we could easily adjust our attack accordingly. The following is the prerequisite:

We had access to the source code

A pass-the-hash vulnerability was present

A weak hash function was used (and without a salt)

There was no WAF/IDS/IPS in place

There was no reverse proxy, CDN, caching

There are several "types" of timing attacks and in this particular case we are going to look into the most common form there is.

What happens here is that PHP is comparing the string one character at a time (using the !== operand). This is how the is_identical_function() in C works. If we look into the source code of PHP, we can see that the comparison is done via memcmp():

The memcmp() function is what opens up the possibility for a timing attack since the operation is performs is not constant-time. False is returned once a character is not correct as it compares one character at a time. The source of memcmp() is the following:

1.1. General scenario

The main idea was to separate the sources of data and its retrieval. We have several sources that could induce a problematic noise if not filtered out. First, we had to take care of the RTT (round-trip time) which happened to be one of the trickiest measurements.

All in all, the scheme looked like so:

1.2. Our plan

Based on previous researches in this area, we felt like there is a lot of room for improvement. This is mainly due to the fact that previous researchers have focused on a more generic approach rather than targeting a particular web application and obtaining additional information based on its specifics as well as the specifics of the environment hosting the application. Our early attempts of having a general solution to filter out noise did not yield successful results under practical circumstances - that is having a reliable means of identifying spikes and working around different obstacles that may influence the observations to an extent where the collected data is literally useless. The main aspects we focused on are:

Greatly improving data collection over the network;

Targeting a paricular, remotely hosted web application;

Finding an actually meaningful way of analyzing data;

When we started picking a target, we aimed for something widely used since that would likely imply decent security is in place. We went for a particular application and its last stable release as of August 2018. As surprising as it may seem to some, it appears that the application is not as secure as we originally considered it to be and we will demonstrate that in a bit.

And so if we were to align the whole process in a timeline, it would look something similar to this:

What didn't work?

So a few things we didn't manage to make use of were:

Application profiling using a profiler (xdebug, xhprof etc)

Measuring the RTT via TCP timestamps

Making observations from user-space

What worked?

Timing different PHP functions used by the application

Average out web server's performance (per version)

The main idea was to use the information about the application to our benefit. Since we had access to the source code we knew exactly what operations were executed following a request. In order to estimate how much time each request took to process, it was of utmost importantance to collect data as of how long, normally, it takes for the application to process a particular input. Then we might use this time to subtract it from the total time for the server to return a response.

1.3. Improvements on data collection and analysis

In addition to only taking into consideration the round-trip delay, we made use of several other metrics that would improve the overall noise reduction and help us obtain more reliable results as well. The key metrics we used are:

Application profiling: Profiling of the application's workflow and estimation of the total execution time for authentication mechanism(s) (e.g. with xdebug) for different PHP versions. This is necessary as different PHP versions vary in performance. For example, there is a significant improvement of the overall performance in PHP 7 over PHP 5. Used for identifying spikes.

Function timing: Calculating execution time for PHP functions utilized by the authentication mechanism of the application. Done during run-time instead of unit testing so that tests are optimal and utmost realistic.

Delta calculation: Remote estimation of the difference between successful(recovery) and fail(recovery) time taken. Most important metric for spike identification and data filtration. Used with the Kalman filtering as a main means of sorting collected data and enumerating results. Used during runtime.

Calculating the RTT had to be done via packet capturing but unfortunately that would not be easy to automate. However, observations must not be done from user-space as this would also include the time it took for the kernel to receive the data, process it and send it all the way to the user-end.

It was also worth mentioning that using TCP timestamps did not yield any reliable results since the granularity ranges from 1 ms (miliseconds) to 100 seconds and in our case we are interested in microseconds (nanoseconds < microseconds < miliseconds < seconds). Tracing system calls (e.g. strace(1)) was also not that reliable since you'd have to work on tracing each every call and identifying them as you go. Last but not least, different protocols could be routed differently. For example ICMP may be routed in a different way than TCP thus using either protocol can result in totally irrelevant results.

Another important part was the RTTM (round-trip time measurement). We tested several means of calculation of the RTT but unfortunately almost all are basically estimations and not an actual calculation based off of a bunch of factors. Below is presented the result of several tools.

And an improved data collection technique would be using the time difference between SYN and ACK packets. This is presented below where we can see the difference in the data variance from similar observations (in terms of time, duration and samples). The graphs also illustrate a second example where the Kalman filter was used to smooth the data and provide the final data set to work with.

Here is a sample packet captured that has been trimmed to the most relevant lines (with tshark):

In this example we get the time at which the GET request was sent, the ACK packet and the first web server's response (the 301 redirect). This can help us find the web server's and the application's response time. The time difference between the ACK packet and the first response from the web server is the application's execution time.

Should also mention that we rely on Kalman filtering to get rid of the noise before we start working with the data.

Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, containing statistical noiseand other inaccuracies, and produces estimates of unknown variables that tend to be more accurate than those based on a single measurement alone, by estimating a joint probability distribution over the variables for each timeframe.

Spike identification

So a common thing to also acknowledge is the identification of spikes. They can pretty much ruin the entire data collection process and the further analysis. Therefore, several approaches can be considered when identifying spikes:

Mirroring the target environmentIncludes copying the remote environment locally so as to make observations on the execution time of the web server as per its version and other robust measurements. This would help identify potential spikes caused by the web server as we would have statistical data regarding how much time it takes (on average) for Apache to process a request of particular size.

Function timingEstimating how long it takes (again on average) for a particular function/snippet of code to run provides us with a measurement which we can later on subtract from the total execution time.

Execution timing per different environmentsWe also experimented with different setups and they did turn out to yield different results. Took that into account as well.

Calculating median absolute deviation (MAD)A metric that was also useful was the standard deviation and the median absolute deviation. It is more of an improved way of obtaining 'average' values than just a mean value.

The target? More specifically

Our target is a pass-the-hash vulnerability in a PHP application. Having the possibility to provide a hash instead of a plaintext password means that we don't need to crack the hash in order to authenticate oneself into the system. But before that, let's obtain the hash. In our case we have access to the source code so here is how the verification of the password field is done:

The idea here is to time how long it takes for the isValidHash() to process an input and how long it takes for the verifyPassword() function to do the same. The test case inputs must be the same for each sample of data provided to both. First we time the function locally under different versions of PHP and then we attempt to time it over the network. In the end we subtract the local timing of the function from the total time we got from the packet capturing and we end up with the noise.

Some rough statistics for the attack:

On average we have around 14 000 requests per character. The hash is hexadecimal so utilizes a character set of 16 elements: a-f0-9 and its length is 40 characters. This means that there are roughly ~224 000 requests per each position or ~8 960 000 requests for a full hash recovery.

Reducing the search space is also of great importance as it would significantly reduce the number of requests and that is something we should strive to achieve all the way. We could, for example, look up the hash in a rainbow table for partial matches instead of recovering the entire hash. In that sense, the following aplies:

- First 10 characters are expected to be recovered with 2 240 000 requests;

- First 13 characters are expected to be recovered with 2 912 000 requests;

- First 20 characters are expected to be recovered with 4 480 000 requests;

Solution to timing attacks

Do not solely rely on network jitter to induce noise and hinder a timing attack. This would not prevent timing attacks in general but would only make them harder to execute. The most efficient way to battle timing attacks is to use constant-time algorithms for security critical operations like authentication, cryptography and others alike. Enforcing a rate limiting policy would also be a good idea and would render timing attacks nearly useless but still theoretically working.