A frame from a video demonstration showing BREACH in the process of extracting a 32-character security token in an HTTPS-encrypted Web page.

Prado, Harris, and Gluck

The HTTPS cryptographic scheme, which protects millions of websites, is susceptible to a new attack that allows hackers to pluck e-mail addresses and certain types of security credentials out of encrypted pages, often in as little as 30 seconds.

The technique, scheduled to be demonstrated Thursday at the Black Hat security conference in Las Vegas, decodes encrypted data that online banks and e-commerce sites send in responses that are protected by the widely used transport layer security (TLS) and secure sockets layer (SSL) protocols. The attack can extract specific pieces of data, such as social security numbers, e-mail addresses, certain types of security tokens, and password-reset links. It works against all versions of TLS and SSL regardless of the encryption algorithm or cipher that's used.

It requires that the attacker have the ability to passively monitor the traffic traveling between the end user and website. The attack also requires the attacker to force the victim to visit a malicious link. This can be done by injecting an iframe tag in a website the victim normally visits or, alternatively, by tricking the victim into viewing an e-mail with hidden images that automatically download and generate HTTP requests. The malicious link causes the victim's computer to make multiple requests to the HTTPS server that's being targeted. These requests are used to make "probing guesses" that will be explained shortly.

"We're not decrypting the entire channel, but only extracting the secrets we care about," Yoel Gluck, one of three researchers who developed the attack, told Ars. "It's a very targeted attack. We just need to find one corner [of a website response] that has the token or password change and go after that page to extract the secret. In general, any secret that's relevant [and] located in the body, whether it be on a webpage or an Ajax response, we have the ability to extract that secret in under 30 seconds, typically."

It's the latest attack to chip away at the HTTPS encryption scheme, which forms the cornerstone of virtually all security involving the Web, e-mail, and other Internet services. It joins a pantheon of other hacks introduced over the past few years that bear names such as CRIME, BEAST, Lucky 13, and SSLStrip. While none of the attacks have completely undermined the security afforded by HTTPS, they highlight the fragility of the two-decade-old SSL and TLS protocols. The latest attack has been dubbed BREACH, short for Browser Reconnaissance and Exfiltration via Adaptive Compression of Hypertext.

As its name suggests, BREACH works by targeting the data compression that just about every website uses to conserve bandwidth. Based on the standard Deflate algorithm, HTTP compression works by eliminating repetitions in strings of text. Rather than iterating "abcd" four times in a chunk of data, for instance, compression will store the string "abcd" only once and then use space-saving "pointers" that indicate where the remaining three instances of the identical pattern are found. By reducing the number of bytes sent over a connection, compression can significantly speed up the time required for a message to be received. In general, the more repetitions of identical strings found in a data stream, the more potential there will be for compression to reduce the overall size.

Using what's known as an oracle technique, attackers can use compression to gain crucial clues about the contents of an encrypted message. That's because many forms of encryption—including those found in HTTPS—do little or nothing to stop attackers from seeing the size of the encrypted payload. Compression oracle techniques are particularly effective at ferreting out small chunks of text in the encrypted data stream.

BREACH plucks out targeted text strings from an encrypted response by guessing specific characters and including them in probe requests sent back to the targeted Web service. The attack then compares the byte length of the guess to the original response. When the guess contains the precise combination of characters found in the original response, it will generally result in a payload that's smaller than those produced by incorrect guesses. Because deflate compression stores the repetitive strings without significantly increasing the size of the payload, correct guesses will result in encrypted messages that are smaller than those produced by incorrect guesses.

On how an Oracle attack works

The first thing an attacker using BREACH might do to retrieve an encrypted e-mail address is guess the @ sign and Internet domain immediately to its right. If guesses such as "@arstechnica.com" and "@dangoodin.com" result in encrypted messages that are larger than the request/response pair without this payload, the attacker knows those addresses aren't included in the targeted response body. Conversely, if compressing "@example.com" against the encrypted address results in no length increase, the attacker will have a high degree of confidence that the string is part of the address he or she is trying to extract. From there, attackers can guess the string to the left of the @ sign character by character.

Assuming the encrypted address was johndoe@example.com, guesses of a@example.com, b@example.com, c@example.com, and d@example.com would cause the encrypted message to grow. But when the attacker guesses e@example.com, it would result in no appreciable increase, since that string is included in the targeted message. The attacker would then repeat the same process to recover the remainder of the e-mail address, character by character, moving right to left.

The technique can be used to extract other types of encrypted text included in Web responses. If the site being targeted sends special tokens designed to prevent so-called cross-site request forgery attacks, the credential will almost always contain the same format—such as "request_token=" followed by a long text string such as"bb63e4ba67e24d6b81ed425c5a95b7a2"—each time it's sent. The compression oracle attack can be used to guess this secret string.

An attacker would begin by adding the text "request_token=a" to the text of the encrypted page being targeted and send it in a probe request to the Web server. Since the size of the encrypted payload grows, it would be obvious this guess is wrong. By contrast, adding "request_token=b" to the page wouldn't result in any appreciable increase in length, giving the attacker a strong clue that the first character following the equal sign is b. The attacker would use the same technique to guess each remaining character, one at a time, moving left to right.

Most attacks that use the BREACH technique can be completed by making only a "few thousand" requests to the targeted Web service, in about 30 seconds with optimal network conditions and small secrets, and in minutes to an hour for more advanced secrets.

BREACH, which was devised by Gluck along with researchers Neal Harris and Angelo Prado, builds off the breakthrough CRIME attack researchers Juliano Rizzo and Thai Duong demonstrated last September. Short for Compression Ratio Info-leak Made Easy, CRIME also exploited the compression in encrypted Web requests to ferret out the plaintext of authentication cookies used to access private user accounts. The research resulted in the suspension of TLS compression and the temporary disabling of an open networking compression protocol known as SPDY until it could be modified. BREACH, by contrast, targets the much more widely used HTTP compression that virtually all websites use when sending responses to end users. It works only against data sent in responses by the website.

"If you go to the Wikipedia page or any of the specialized security pages, they will tell you that CRIME is mitigated as of today and is no longer an interesting attack and nobody cares about it," Prado said. "So we are bringing it back and making it work better, faster in a different context."

The good news concerning BREACH is that it works only against certain types of data included in Web responses and then only when an attacker has succeeded in forcing the victim to visit a malicious link. Still, anytime an attacker can extract sensitive data shielded by one of the world's most widely used encryption schemes it's a big deal, particularly as concerns rise about NSA surveillance programs. Making matters more unsettling, there are no easy ways to mitigate the damage BREACH can do. Unlike TLS compression and SPDY, HTTP compression is an essential technology that can't be replaced or discarded without inflicting considerable pain on both website operators and end users.

At their Black Hat demo, the researchers will release a collection of tools that will help developers assess how vulnerable their applications and online services are to BREACH attacks. Most mitigations will be application-specific. In other cases, the attacks may give rise to new "best practices" advice on how to avoid including certain types of sensitive data in encrypted Web responses. Most websites already list only the last four digits of a customer's credit card number; BREACH may force websites to truncate other sensitive strings as well.

"We expect that it could be leveraged in particular situations, maybe with an intelligence agency, or maybe an individual actor or a malicious crime organization might use this in a targeted scenario," Prado said. "Any malware writer today has the ability to do something like this if they have not been doing it already."

Article updated to correct statement that SPDY was suspended. It was temporarily disabled until it could be modified.

Promoted Comments

I could be under thinking, but seems like a simple fix - a few randomly generated characters in an HTML comment at the bottom of the page would randomize the size, making it much more difficult to infer any useful information.

Using padding would only increase the number of messages required to extract the secret because you can use statistics to compensate for the random variation. By injecting the same string multiple times the attacker can find out if the average length after injection deviates from the expected average and thus determine if the message length has changed. Increasing the proportion of padding to message only increase the size of sample required to determine if a change has occurred. However the average number of messages required to perform such an attack should be fairly easy to calculate and so you could set a dynamic padding:message ratio which is calculated to increase the attack time hopefully unfeasible levels.

However it might be possible that doing this would increase the page size so much that you might as well disable the HTTP compression altogether.

This is another form of the "increasing key size/password length just increases the average number of guesses required to recover the desired secret" argument. Making things computationally infeasible is a fundamental tool of security. And, since we're talking about an online attack here, increasing the number of required guesses also increases the odds of detection.

Exactly, that's why I suggested a padding ratio that requires an unfeasible number of messages. But my main point is that if you have to add so much data to the page that you increase the page size significantly it might be just eaisier to disable compression on dynamic assets since the random padding bytes that you add will be almost totally incompressible.

Additionally whilst such online attacks should be easy to detect in theory the practical problems and technical challenges will mean that some sites, esp. smaller ones will not have such capabilities in the near future.

Most web server operators and companies are just putting everything under SSL which is ridiculous. An article like this in Ars does not need to be protected. Sensitive data should go under SSL connections, public data should not for the simple reason that you will able to server 100 times more traffic without it. I just hate it when I see SSL locks on blogs and other public pages which make no sense to protect in the first page and there is no user interaction either, not to mention when they display mixed insecure elements in the same page....

An argument for encrypting non-sensitive pages (such as the pages of a bank's website listing their branch offices and business hours) is that it makes it harder to do simple man-in-the-middle replacement of page content. The bad guy then would not need to trick the user into visiting a malicious URL to accomplish the attack on the secure part of the site, they just need to inject that invisibly into the non-secure pages.

Even aside from security concerns, there are unreputable ISPs that redirect all their customers' web traffic through proxy servers which replace the ads in the sites visited with their own ads (or inserting additional ads), thus diverting the ad revenue stream from the site owner to the ISP. Using SSL also makes this harder for the ISP to do and is a form of ensuring your site is delivered to the end user exactly like you intend it.

If the site operator has the compute capacity for running everything through SSL, what concern is it of yours.

If guesses such as "@arstechnica.com" and "@dangoodin.com" result in encrypted messages that are larger than the request/response pair without this payload, the attacker knows those addresses aren't included in the targeted response body.

Of course they would never try that because ars doesn't support https in the first place =/

If guesses such as "@arstechnica.com" and "@dangoodin.com" result in encrypted messages that are larger than the request/response pair without this payload, the attacker knows those addresses aren't included in the targeted response body.

Of course they would never try that because ars doesn't support https in the first place =/

This isn't another man-in-the-middle exploit, it's more like a man-sat-next-to-you attack. The attacker needs to inject an iframe somewhere and observe your encrypted traffic, so being on the same LAN is ideal.

If guesses such as "@arstechnica.com" and "@dangoodin.com" result in encrypted messages that are larger than the request/response pair without this payload, the attacker knows those addresses aren't included in the targeted response body.

Of course they would never try that because ars doesn't support https in the first place =/

Sounds to me that the actual secret data (particularly passwords) should not be compressed. Maybe a variation of deflate with escape tokens inserted by the compressor. Not sure how one would specify this to a compression routine not part of the same program as the generator of the contents. It's a layering violation for certain.

Wouldn't an easy way be to disable content-encoding for the specific content one wishes to have secure? Yes, there's an increase in bandwidth, but if security is paramount then that's a cost that may need to be taken.

Either I'm not understanding this attack at all, or it is very limited in application. It not only requires access to the traffic (so don't use open WiFi or WiFi using WEP or a widely known WPA key) for sensitive transactions, but also requires that the server include the supplied data, verbatim, in its response. And it also requires that there be no rate limiting or other countermeasures.

Off the top of my head, it seems like doing compression on headers and data (at least in the case of CSRF protection cookies) separately would mitigate, as would simply using uncompressed headers, although that last one might open up other avenues of attacking the encryption directly due to increased repetitiveness in the message.

I could be under thinking, but seems like a simple fix - a few randomly generated characters in an HTML comment at the bottom of the page would randomize the size, making it much more difficult to infer any useful information.

I could be under thinking, but seems like a simple fix - a few randomly generated characters in an HTML comment at the bottom of the page would randomize the size, making it much more difficult to infer any useful information.

Wouldn't pages which used that trick only vary in size within a predictable range though? Say a possible variation between 50 and 55KB, and if the hacker sees a greater variation than that then he knows... whatever variation in the page size tells him (not a coder and haven't re-read this article to actually understand the attack!)

I could be under thinking, but seems like a simple fix - a few randomly generated characters in an HTML comment at the bottom of the page would randomize the size, making it much more difficult to infer any useful information.

Wouldn't pages which used that trick only vary in size within a predictable range though? Say a possible variation between 50 and 55KB, and if the hacker sees a greater variation than that then he knows... whatever variation in the page size tells him (not a coder and haven't re-read this article to actually understand the attack!)

Yes, but if the padding size varies randomly in a 5KB range, and the secret you're trying to detect is only a few bytes long, then the random variation is much larger then the systematic variation. The :signal would be completely swamped by the noise.

In practice I would guess than even varying the size by a random amount between, say, 0 and 512 bytes would be enough to confuse the attacker.

BREACH plucks out targeted text strings from an encrypted response by guessing specific characters and including them in probe requests sent back to the targeted Web service.

Would a "simple" solution be to replace all of those common credentials (e.g.: "request_token") with long, nonsensical strings which are not repeated within the document - or across websites? If the atacker has to guess at common phrases, making it more difficult to guess goes a long way.

BREACH plucks out targeted text strings from an encrypted response by guessing specific characters and including them in probe requests sent back to the targeted Web service.

Would a "simple" solution be to replace all of those common credentials (e.g.: "request_token") with long, nonsensical strings which are not repeated within the document - or across websites? If the atacker has to guess at common phrases, making it more difficult to guess goes a long way.

I could be under thinking, but seems like a simple fix - a few randomly generated characters in an HTML comment at the bottom of the page would randomize the size, making it much more difficult to infer any useful information.

Wouldn't pages which used that trick only vary in size within a predictable range though? Say a possible variation between 50 and 55KB, and if the hacker sees a greater variation than that then he knows... whatever variation in the page size tells him (not a coder and haven't re-read this article to actually understand the attack!)

Yes, but if the padding size varies randomly in a 5KB range, and the secret you're trying to detect is only a few bytes long, then the random variation is much larger then the systematic variation. The :signal would be completely swamped by the noise.

In practice I would guess than even varying the size by a random amount between, say, 0 and 512 bytes would be enough to confuse the attacker.

Using padding would only increase the number of messages required to extract the secret because you can use statistics to compensate for the random variation. By injecting the same string multiple times the attacker can find out if the average length after injection deviates from the expected average and thus determine if the message length has changed. Increasing the proportion of padding to message only increase the size of sample required to determine if a change has occurred. However the average number of messages required to perform such an attack should be fairly easy to calculate and so you could set a dynamic padding:message ratio which is calculated to increase the attack time to hopefully unfeasible levels.

However it might be possible that doing this would increase the page size so much that you might as well disable the HTTP compression altogether.

It requires that the attacker have the ability to passively monitor the traffic traveling between the end user and website. The attack also requires the attacker to force the victim to visit a malicious link. This can be done by injecting an iframe tag in a website the victim normally visits or, alternatively, by tricking the victim into viewing an e-mail with hidden images that automatically download and generate HTTP requests. The malicious link causes the victim's computer to make multiple requests to the HTTPS server that's being targeted. These requests are used to make "probing guesses" that will be explained shortly.

Meh.

So there is still no direct attack on HTTPS then. It's just another variation of a phishing scheme that requires end User action.

I could be under thinking, but seems like a simple fix - a few randomly generated characters in an HTML comment at the bottom of the page would randomize the size, making it much more difficult to infer any useful information.

I think you're sort of half-way to the 'fix' - rather than 'randomly generated' characters, what you need is to put all the *wrong* guesses at the end of the file.

I.e. if the 'secret' on the page is a an email address, aa@bb.com, then at the end of your page you need to put "@a.com @b.com @c.com ... @z.com @aa.com @ab.com @ac.com ... @zz.com a@a.comb@a.comc@a.com ... z@a.com" and so on, so whatever 'guesses' the bad guy makes, the string he is guessing is already in the document.

Of course you don't need to actually cover every possible guess. You just need to put enough 'fake' secrets at the end of the document to reduce his confidence in what he's guessed. This may vary on the usage.

If the 'secret' is something that he gets one shot at actually using, e.g. some sort of one-time generated token that he can try to use to open a door but if he enters the wrong one it's permanently locked, then you don't need to put many 'false positives' in there for the bad guy to not have enough confidence that his guess is the right one to be able to make use of it. If it's something where even if he's got lots of equally possibly-valid secrets he can just try them all one after t'other, then you're going to need a lot of fake secrets to take him back to the point where he'd just be better off trying to brute-force guess the secret instead of trying to decrypt your page anyway.

I believe that fortunately there is a maximum length of a string that Deflate will replace with a back-reference (258 bytes?,) so you can work out how much 'fake context' you need to put around your 'fake secrets' (to ensure that the bad guy can't just find the 'real' secret by knowing that, say, the real secret always has a "</a>" tag at the end that you didn't put on the end of your fakes.) So you should be able to cobble something together that definitely made the bad guys life difficult this way theoretically.

I was just wondering who is really vulnerable to this attack and found that google does not use gzip on their encrypted.google.com page and nor does my bank. However bank of america as an example does use compression on their online banking sign in page.

HTTP compression is an essential technology that can't be replaced or discarded without inflicting considerable pain on both website operators and end users.

Really? Last I checked, most web servers have an option to disable that... Apache even has it disabled by default and you have to manually enable it. I assume most other web servers, application servers, and load balancers work the same way.

HTTP compression is an essential technology that can't be replaced or discarded without inflicting considerable pain on both website operators and end users.

Really? Last I checked, most web servers have an option to disable that... Apache even has it disabled by default and you have to manually enable it. I assume most other web servers, application servers, and load balancers work the same way.

Yet another reason why the 'https everywhere' zealots are wrongheaded.

If you only use HTTPS for transactions that actually need it, then disabling compression on HTTPS traffic will have negligible impact. If on the other hand you insist on the 'all the world's HTTPS' point of view, then yeah it's not an option. (Particularly given the impact that HTTPS has already had on the bandwidth bill by eliminating caching.)

Not all web browsers or web servers support RFC 2817 so that you could use STartTLS to initiate a secure TLS session over an existing connection. Without that, if you switch between HTTP and HTTPS most browser will issue a warning that you are changing to a non-secure connection (which isn't wrong, if my bank only used HTTPS for the login and HTTP for the rest, even though it's information that I'd want to have encrypted I would want to be warned).

The other thing to keep in mind that is you really should have an IPS/IDS in front your webservers to protect against these kinds of attacks. These kinds of attacks require a lot of guesses to gain any useful information, that kind of traffic would be easily detected by an IPS/IDS system.

HTTP compression is an essential technology that can't be replaced or discarded without inflicting considerable pain on both website operators and end users.

Really? Last I checked, most web servers have an option to disable that... Apache even has it disabled by default and you have to manually enable it. I assume most other web servers, application servers, and load balancers work the same way.

Disabling HTTP compression not only adds to bandwidth costs, it also slows down the loading of web pages considerably. Text based files like HTML, Javascript, CSS can all be 'minified' and have the size reduced without compression, but proper gzip or deflate compression still beats it out by miles.

Plus, when you hit a certain size with regards to people visiting your website, anything that slows down the loading of webpages only impacts the usability of the site for everyone else. The longer it takes to load a site for one person, the more it'll affect other users trying to request the site from the server. HTTP compression is a big, easy win for speeding things up.

I could be under thinking, but seems like a simple fix - a few randomly generated characters in an HTML comment at the bottom of the page would randomize the size, making it much more difficult to infer any useful information.

Using padding would only increase the number of messages required to extract the secret because you can use statistics to compensate for the random variation. By injecting the same string multiple times the attacker can find out if the average length after injection deviates from the expected average and thus determine if the message length has changed. Increasing the proportion of padding to message only increase the size of sample required to determine if a change has occurred. However the average number of messages required to perform such an attack should be fairly easy to calculate and so you could set a dynamic padding:message ratio which is calculated to increase the attack time hopefully unfeasible levels.

However it might be possible that doing this would increase the page size so much that you might as well disable the HTTP compression altogether.

This is another form of the "increasing key size/password length just increases the average number of guesses required to recover the desired secret" argument. Making things computationally infeasible is a fundamental tool of security. And, since we're talking about an online attack here, increasing the number of required guesses also increases the odds of detection.

I could be under thinking, but seems like a simple fix - a few randomly generated characters in an HTML comment at the bottom of the page would randomize the size, making it much more difficult to infer any useful information.

Using padding would only increase the number of messages required to extract the secret because you can use statistics to compensate for the random variation. By injecting the same string multiple times the attacker can find out if the average length after injection deviates from the expected average and thus determine if the message length has changed. Increasing the proportion of padding to message only increase the size of sample required to determine if a change has occurred. However the average number of messages required to perform such an attack should be fairly easy to calculate and so you could set a dynamic padding:message ratio which is calculated to increase the attack time hopefully unfeasible levels.

However it might be possible that doing this would increase the page size so much that you might as well disable the HTTP compression altogether.

This is another form of the "increasing key size/password length just increases the average number of guesses required to recover the desired secret" argument. Making things computationally infeasible is a fundamental tool of security. And, since we're talking about an online attack here, increasing the number of required guesses also increases the odds of detection.

Exactly, that's why I suggested a padding ratio that requires an unfeasible number of messages. But my main point is that if you have to add so much data to the page that you increase the page size significantly it might be just eaisier to disable compression on dynamic assets since the random padding bytes that you add will be almost totally incompressible.

Additionally whilst such online attacks should be easy to detect in theory the practical problems and technical challenges will mean that some sites, esp. smaller ones will not have such capabilities in the near future.

Randomize response size, by including pseudo-random text-strings of random length in both directions that don't get viewed by real end-users (but which can be validated by the server if necessary). That should at least slow the attack down.Also, apply a rate-limiting scheme, develop strong session/account controls, and send HTTP error-code 429 when necessary...EDIT: Perhaps better, disable compression for user-authentication related HTTP requests?

Of course one way of preventing this attack is to employ a padding scheme which ensures that the response is a fixed length regardless of content. This should completely mitigate such attacks. However there may be significant downsides particularly in pages whose size may vary significantly.

I mean most sensible sites will suspend your login after a number of unsuccessful attempts. Can't this be handled the same way?

I was thinking the same thing. It look like this attack has a very specific fingerprint that can't easily be disguised – at least to human eyes.

Wouldn't it be possible to have a process that uses some regex magic to monitor the request log of the last few minutes to catch this kind of pattern?

Then again, the overhead of this kind of monitoring might be so great that turning off compression could be more efficient. (Or it may not – regexes never were my strong suit, so I'm not sure how hard it'd be to implement.)

Edit: Simple rate limiting may actually be the better approach – unless someone develops a distributed version of this attack that uses coordinated bots.

Using the pattern created by the data compression itself to find the repeated pattern points. Oddly enough I presumed a secure connection avoided this. I accept that nothing's 100%, but whomever discovered this way of reducing data usage for secure connections must have considered this shortcoming.

This isn't another man-in-the-middle exploit, it's more like a man-sat-next-to-you attack. The attacker needs to inject an iframe somewhere and observe your encrypted traffic, so being on the same LAN is ideal.

If all pages and links are HTTPS, one should not be able to inject an iframe short of compromising one of the sites you're trying to connect to.

This isn't another man-in-the-middle exploit, it's more like a man-sat-next-to-you attack. The attacker needs to inject an iframe somewhere and observe your encrypted traffic, so being on the same LAN is ideal.

If all pages and links are HTTPS, one should not be able to inject an iframe short of compromising one of the sites you're trying to connect to.

Isn't one of the downside of HTTPS is that it's more resource intensive?*

*I think there's another when it comes to virtual domains, but I'm not sure.

It not only requires access to the traffic (so don't use open WiFi or WiFi using WEP or a widely known WPA key) for sensitive transactions...

Given the demonstrated ability of the Chinese to redirect traffic from outside the Great Firewall through to where they can inspect it ("oops, we accidentally misconfigured a DNS server"), I'd think it doesn't require wireless snooping.