8 Answers
8

I could not find any evidence pointing towards homomorphic encryption. What I could find were different combinations of deterministic and format preserving encryption. There is probably also a variant that preserves order, but I couldn't find any material depicting it.

This post is based on material published on the CipherCloud website at CipherCloud Cloud Security Learning Center - Featured Content. CipherCloud notes that the actual product has additional encryption features that offer stronger security and don't suffer from the listed weaknesses (see CipherCloud response section).

The first column shows the policy which determines how different fields should be encrypted. For most fields this is "AES Crypto Encryption", for some fields such as phone numbers it's a spezialized form of of encryption, such as "Telephone Number Encryption".

The second column shows the plaintext ("As seen by authorized users")

the third column the ciphertext ("As seen by Unauthorized Users")

Analysis: When a field consists of several parts(say firstname and lastname) they split it into those parts and encrypt them individually.

Some important fields, such as "annual revenue" aren't encrypted at all, probably because they need to do use it in calculations.

The encryption for names and address preserved the length of the individual words and the position of spaces. There are special tags between words which seem to mark encrypted parts and separate tokens.

Preserving length strongly hints at deterministic encryption, as does support for search. In theory it'd be possible to use difference nonces for different records, but I consider that unlikely since it'd be hard to keep a search index in such a case.

Analysis: Same length encryption was already observed in Material 1. Presumably preserved functions are search and ordering. Probably it's possible to choose which functions to preserve on a per field basis. Material 3 will demonstrate a form of encryption that allows search, but does not appear to preserve ordering.

Description: This shows a typical message board which is part of the yammer web-application. There are two open windows showing the same thread. One shows the encrypted messages (as the yammer.com server sees it), the other the plaintext form as it is seen when accessed through CipherCloud. The message bodies are encrypted, the usernames and post times are not. The encrypted messages are a mix of Asian and ASCII characters.

Analysis:

There is a clear correspondence between plaintext words and ciphertext token. Each token starts with zqx1, ends with 0j1xqz and corresponds to one word. Punctuation marks are not encrypted at all.

Words that occur multiple times in the plaintext (for example meet, to, want) appear as identical tokens in the ciphertext.

The words new and will are even more interesting. They occurs both in lowercase and uppercase form. The encrypted form of New might look like zqx1賓翡祀徠鈞祁記勤机琦芸稶70j1xqz1 and the encrypted form of new like zqx1賓翡祀徠鈞祁記勤机[linebreak]20j1xqz2.

So I assume the first part of the token encodes the word in a case insensitive form and the second part supplies the casing information. Sid made the same observation in his post. This form of encryption allows the case insensitive search they demoed in the video, where a search for apac turned up a message containing APAC since the beginning of the word is the same, regardless of case. This certainly looks like a 1:1 mapping between lower case words and the first part of the token.

The first part of the token seems to have a constant size of 9 Asian Characters. Given the large amount of such characters, that would be enough to encode 128 bits, or one AES block. So one possible implementation would be uses AES in ECB or CBC mode with constant IV encrypting a single block. But this paragraph is pure speculation that can't be proven from the limited observations we made.

An inherent weakness in such a scheme is that it doesn't offer semantic security. If the same word gets encrypted in different places, an attacker can see that the same word was used in both places. If he can figure out one of them, he automatically known the other too.

The attacker could also employ some kind of word level frequency analysis on tokens. Once part of the ciphertext has been recovered, the known words serve as context for further guesses. Once a certain knowledge threshold has been exceeded this should allow recovery of most words. So it's essentially to prevent an attacker from learning any (ciphertext, plaintext) pairs, even for harmless messages. Quoting public material or storing drafts which later will be published is also quite dangerous.

A similar non deterministic scheme

The above scheme is essentially a word-level substitution cipher. Classical substitution ciphers usually use a single letter (or a small fixed amount of letters) as a substitution unit. Modern blockciphers such as AES trivially allow using whole words as substitution unit which was a bit difficult to do without a computer.

Those early substitution ciphers had similar weaknesses to the above scheme (1:1 mapping between letters, deterministic encryption and frequency analysis). Of course with short/letter units these weaknesses where much more pronounced than with long/word units.

One technique used to reduce the impact of these weaknesses is having multiple possible substitutions for each plaintext unit and randomly choosing one during encryption. This technique is called Homophonic substitution. This technique can obviously be applied to word level substitution ciphers as well.

When a word has n possible ciphertexts, one could still do searches by triggering n separate searches over the encrypted data and merging the results. This would turn above 1:1 mapping into a 1:n mapping, make the encryption non deterministic and, depending on the chosen probability distribution, frequency analysis might be less effective as well. Of course such a combined search would leak some information to the server about which ciphertext units might correspond to the same plaintext unit.

There is no evidence that CipherCloud uses a homophonic word substitution cipher. I only mentioned this scheme because it's the simplest change to the above scheme that doesn't contradict the claims in their DMCA notice. Ciphercloud claims not to suffer from the above weaknesses, thanks to "patent pending mechanisms". I certainly hope that their mechanisms are better that simply switching to a homophonic word substitution cipher.

CipherCloud's claims in their DMCA notice:

As a counterpoint to this analysis here are some claims claims made in their DMCA notice:

Such false, misleading and defaming statements include the following sample:

(2) "If the same string gets encrypted in different places, an attacker can see that the same string was used in both places." [statement from an old version of this post]
* Again, CipherCloud's product is NOT deterministic.

(7) "Basically they end up with a 1:1 mapping of lower case words." [Sid's post, refering to the message board discussed in this post]
* The statement is patently false. Sid implies that what was perceived from a public demo is CiperCloud's product offering. [emphasis mine] CipherCloud does not incorporate 1:1 mapping.

(10 sample statements were included in the DMCA notice, I picked those most relevant to this post. I inserted notes linking quoted statements to their original posts)

Conclusion

The observed encryption has significant weaknesses, most of them inherent to a scheme that wants to encrypt data, while enabling the original application to perform operations such as search and sorting on the encrypted data without changing that application. There might be some advanced techniques (homomorphic encryption and the likes) that avoid these weaknesses, but at least the software demoed in the video does not use them.

The only way I see to reconcile the statements in their DMCA notice with the observed encryption properties is that the actual CipherCloud product is significantly different (and better) than what is shown in their promotional material.

If they don't want to be judged based on the information they published about their product, perhaps they should update their material to match their actual product more closely.

I wanted to provide some clarity to the question of whether CipherCloud uses homomorphic encryption. The answer is NO. Homomorphic encryption is far from ready for practical usage due to performance and lack of capabilities.

The cited CipherCloud product demo in the board threads was focused on highlighting our reverse-proxy concept for cloud information protection to organizations using cloud applications. Some of the fundamental security features made available (e.g. full field encryption, randomization through IVs, etc.) were disabled because we were not comfortable sharing such IP on the internet while our patents are still pending.

1 The actual Asian characters were different from the one in my post, but the general form was the same.2 The linebreak is an artifact of the browser displaying the text. It might have hidden some characters.

[Comment omitted due to DMCA request]. You know this site is becoming popular when...
–
SteveApr 22 '13 at 6:17

3

They in fact confirm that they do a simple word-aligned substituion-based cipher here: ciphercloud.com/tokenization-cloud-data.aspx "what is sent out to the cloud are tokens that are structurally similar to the actual data, but have no mathematical correlation. These tokens preserve operations such as searching, sorting and reporting within cloud applications." Note that preserving sort order basically means no semantic security could be attained at all.
–
wizzard0Apr 22 '13 at 9:47

3

What do the comments/claims about their cipher have to do with DCMA? Unless someone can demonstrate that the content used was subject to copyright, then DCMA claims have no merit. In fact, there is a penalty for misuse of DCMA.
–
adrenalionApr 22 '13 at 10:14

1

@adrenalion The letter, has two parts. One part is the DMCA notice for the images. The other (non DMCA part) is about the false, misleading and defaming statements.
–
CodesInChaosApr 22 '13 at 10:20

I don't think they have implemented homomorphic encryption at all. They have just implemented regular AES encryption (they have a FIPS 197 certificate for their AES), but in what appears to be a very insecure way. Why would they choose to do that? Because they had no choice. Here's what I mean:

The challenge for cloud encryption providers like CipherCloud that have a lightweight architecture (no required database, small storage requirements), is that they need the back-end SaaS application (Salesforce, GMail, etc.) to be able to perform all the search and reporting functions on the data as if it were in clear text. To make this possible, you must ensure that the same string gets encrypted the same way every time. As CodesInChaos suggested in an earlier answer, this makes the solution extremely vulnerable to frequency-analysis attacks.

But SaaS encryption implementation has a much larger problem. Searching for exact matches is an easy problem to solve - just send the encrypted value for the match to the SaaS application an you're all set. But that is not the only kind of search you need to support. What happens if a user does a search for all names that begin with John? (e.g. "John*") There are (at least) two options: The first is to store a mapping to every instance of every string that begins with John* in the encryption appliance, then send all the instances of the encrypted text for every mapped string that matches John* to the SaaS application so it can perform the search. That becomes problematic if there are a lot of strings that begin with "John*" - you have to send all those matching strings to the SaaS application in order to make the search work. But imagine a search for John* + Jon* + Jame* + Smith*. You could run out of query parameters pretty easily. It's even worse when running reports.

You also have to have a mapping infrastructure (a database would be the enterprise-grade way to do this) on the encryption appliance to make this work, but CipherCloud do not appear to require a database, making this approach unlikely in their case. And CipherCloud does not seem to use this approach, as it appears from their publicly available documents.

But they may have implemented a worse one, because the second way to address searches for "John*" is what they appear to have actually done. This method preserves string order within the ciphertext, such that John becomes XXyzzz, Johnathan is XXyzzzAAddBBaaBB, and Johnson is XXyzzzDdffsss (this is not their algorithm just a representation of the net effect.) That way, a search for John* means I only have to send "XXyzzz*" to the SaaS application in order to properly fulfill the search. But this approach greatly weakens the security of the data. This is because once I deduce that John is XXyzzz, anytime I see a string beginning with those characters I know it is some form of the name "John*" and I can really start attacking the data. CipherCloud claims to use AES, which should not have this problem, so how can they preserve this string order using AES? Well, the first thing to do is not use padding, or to use the same padding everywhere. Yikes! The second thing is to use the same IV (initialization vector, aka nonce) for all strings. Yikes again! Without padding and IV diversity, AES becomes a glorified version of XOR. Who would bet the security of their data on that? (This probably explains why they do not have, and are not even in process to obtain, FIPS 140-2 validation, which pertains to the proper implementation of an approved algorithm.)

More recent demonstrations of the CipherCloud solution appear to use multi-byte characters in the ciphertext, which makes the patterns harder to see by the naked eye (ok, to eyes used to parsing western character sets) but certainly no harder for a computer to crack.

I'm not sure if they are still there, but there used to be some good videos of their solutions on Vimeo and Youtube, so you can look at those and see for yourself what I'm talking about. I'm sure you can also download whitepapers from their site. I'll leave it to someone else to really dig into the available data and figure out exactly how they are doing what, but it's worth mentioning to any would-be investigators that CipherCloud also appears to be preserving certain punctuation in clear text. (I saw an instance of " I'm " encrypted in a way that preserved the apostrophe!)

As always, but doubly so when it comes to security products, Caveat Emptor! If you are looking at CipherCloud, or any SaaS encryption solution, you'd do well to ask a lot of specific questions and make sure the answers are clear and unambiguous.

My impression is that they don't use order preserving encryption everywhere, and in the case of names they simply split it in parts, which wouldn't allow prefix-matches. But the available information is so sparse that it's hard to say anything concrete.
–
CodesInChaosAug 26 '12 at 15:33

1

Responding to one of the statements in the DMCA letter that addresses this post in particular: CipherCloud claims that the above statement that they were not pursuing FIPS 140-2 validation is "patently false." They then provide a link to the NIST FIPS 140 Modules In Process list as proof. I stand by my statement, because on the day I made it, I checked the list and CipherCloud was not on it. If they can provide proof that they were in process on Aug 26, 2012, then I will remove that statement from my post.
–
adrenalionApr 23 '13 at 2:49

I haven't posted in a while, so long in fact that the email tied to my Stack Exchange account is no more, I forgot my StackEx password, and I had to create a new account. (I'll leave it to the reader to decide if this is the real me.)

But I did want to just to follow up here, because there were some unanswered questions from my last post and the follow-up posts from others. Since I wrote the above post, I had been wondering myself how this searchable encryption could actually work without being incredibly weak from a security standpoint. As it happened, I was at the RSA Security 2013 conference this week where Ciphercloud was exhibiting. In between sessions I had time to visit their booth to learn more.

They do claim to do "military grade encryption", and it does appear that they can use third-party FIPS 140-2 encryption modules. However, in the demonstration I was given, where they were encrypting data in a SalesForce setup, the encryption was definitely NOT using FIPS 140-2 or anything close. In fact, I could see on their large demo screen the exact issues I had expected to see with their encryption algorithm, plus some things that just made me shake my head.

For example, it turns out that they are indeed preserving clear-text patterns in their ciphertext. Searching for "John" is easy if it is encrypted the same way (eg "XXyyZ123") everywhere. But they also appear to individually encrypt each word within a string, such as you would see in an Account Name field. I know this because they showed their demo of a side-by side comparison of clear text and encrypted Accounts. There were two Accounts with "United Oil & Gas" in the name. Both the encrypted names were the same. That means they are using the same key, nonce (IV), and padding for the Account Names. Since the whole point of encryption is to promote randomness in the ciphertext, this is a pretty weak, non-random implementation. Would you entrust your data to what amounts to XOR? I sure wouldn't.

But here is the part that had be shaking my head, mainly because it takes almost zero crypto cracking skills to determine the true value of the data: They appear to have issues, for reasons I cannot completely fathom, encrypting the punctuation characters in the string. In the example they showed me, "United Oil & Gas" was encrypted as something like "fgt^e3s3 SD72d & 3edf" (Note: they also have prefixes and suffixes that wrap their encrypted strings, but I have not included them because they appeared to be pretty consistent and may be there to identify the strings as encrypted, but would do nothing to protect the data.)

So, if you are looking for a customer that named "United Oil & Gas", you have a pretty simple way to narrow down which records that could be - just search for the "&" character, and narrow those results to the one where it appears between the second and third words. Then, in that list, look at the word lengths in the name, and the strings with the short second and third strings are your best bet. This is in part because in "United Casualty & Life", the word "Casualty" would have longer ciphertext than the word "Oil". (Remember they are using the same padding to make this all searchable.) The bottom line is that encryption hasn't really protected the data here. Cost with no benefit.

But it gets worse: Once you knew you had the ciphertext for each of the words “United”, “Oil”, and “Gas”, you could just search for matches for those ciphertext patterns, and you would know all the Account Names (and perhaps all the other fields as well) that had those words in them, as well as the placement of the words in the multi-word strings stored in those fields.

But then, it may be even worse: You may even be able to derive new words based on the words you have already derived. This is because for those three clear text words, you now know the ciphertext patterns for any words that begin with those strings. (Full disclosure: Here is where I am speculating a bit because the guy showing me the demo couldn't tell me the AES modes that they use. I am assuming they use something like CBC, and that they still process in 16 bit blocks - two characters - at a time.) With CBC, the same key, nonce, and padding will preserve the patterns of the strings at the beginning of words. So "United", "Un" would share two common character patterns to start their ciphertext, and "United" and "Unit" would share the first four. So if you derive "United" you could find any word that also began with "Unit", "Un" etc.

Using the example I saw at Ciphercloud’s booth at RSA, you could use that pattern preservation to find out any other account with the word "United", or "Oil", or "Gas", as well as any words that began with the same character strings as "United", "Oil", or "Gas."

Now, I know this was a demo, and the guy showing it was probably a marketing guy with no concept of security. But this was the 2013 RSA Security show. You are going to be viewed by people like me who know a thing or two about encryption, and poke holes in the shoddy stuff. I will also say, in their defense, these shows are coordinated by their marketing department and may not have the most up to date demonstration materials. So, perhaps they could have shown a better (newer?) implementation of their product that would have satisfied me or any security professional.

But the fact remains that they did not. And at one of the largest, most influential security shows in the US, if not the world , you shouldn't put up for demonstration something so easily defeated.

@adrenalion Welcome back! Note that while moderators can't merge your accounts, it either is possible for you to do so yourself (for the two new accounts used on this post) or request help from Stack Exchange to merge the original account. Use the merge user profiles link on the help page.
–
Paŭlo Ebermann♦Mar 3 '13 at 21:29

They are not using any exotic encryption. In fact, based on data, it appears it's just 1:1 mapping (tokenization) after lowering the case on plain text data. I don't know about others but to me this pattern just stood out when I had a look at the demo video. To see it yourself, check their publicly visible demo video. Hit HD, go full screen to 2:19. You will find that identical clear text words all have identical cipher text (eg: "to"). Where clear text differ only in case (eg: "New" and "new"), the only difference in the cipher text is at the end, where the case difference appears to be encoded.

Basically that scheme ends up with a 1:1 mapping of lower case words with case encoding at the end. So it doesn't matter if they circle the galaxy, suck all the energy of a star, perform AES256 or just rotate bits. At the end of the day it's just 1:1, at lower case word level! So you can run the entire "encrypted" conversation into a statistical analyzer and based on the frequency of regular English words uncover that 1:1 mapping. If you add the logic that word level patterns exist ("The the" is extraordinarily rare vs "extra extra") i.e. Markov chains modelling - then you need even fewer copies of "encrypted data" to peel off the security.

Note that this mapping is essential to the search and sorting functionality they describe in their marketing documents. Why? Because searching, by definition, relies on a fixed relationship between the search word ("john") and the search space ("ann", "beth", "john" and "zoe"). If one can't map the "search john" to the "search space john", then one can't find it because one can't detect a search hit itself.

Personally, I would prefer traditional encryption over such a tokenization scheme, because

To me, there is no security in 1:1 mapping. Forget mapping just being able to visually see patterns in the data (either by unaided eye or by a program) leaks informational content and is therefore a security weakness.

Framing it as secure means IT managers might make assumptions and relax policies they would otherwise not allow => introduces new holes on corporate security

If the additional overhead of buying their gateway boxes/maintenance comes with such security weaknesses then its a waste of the organization's money and time

Vendor lockin - Unless they offer migration tools their box/gateway is the only one that would know these 1:1 mappings, so your official upgrade and data extraction path will necessitate performing frequency analysis (i.e. hacking) yourself.

UPDATE:

In response to this DMCA notice, StackExchange has temporarily taken out the image used in this answer (and elsewhere on this entire topic). So I've reworded the answer to still make sense in the absence of that image/link.

All information is based on publicly available data and used within guidelines of US copyright law. This includes, but is not limited to, sections 107 through 119 of the copyright law - Title 17, US code. This also includes, but is not limited to, the Fair Use Doctrine.

Although it is blatantly obvious, the above information is purely based on my knowledge of security, the state of the art in security science and in response to the question itself. They do not reflect the beliefs of my wife, kids, friends, employers, companies, anything or anyone else or even myself in any other capacity other than that of an individual with knowledge of security, for the purposes of a purely technical discussion in response to a technical question. I am not representing any other party here, other than my own personal self and my own personal, technical abilities.

I also watched the video (thanks Sid, for the link) and after looking at it, it reveals some of the other methods that Ciphercloud appears to be using to preserve search. Nothing appears to be an implementation of any sort of homomorphic encryption.

I snapped a copy of one screen after the response from John is entered and encrypted, and have attached an image below (apologies for the crude highlighting). Look at the word "meet" in John's post and then "meet-up" in the first post from Sophie. The pattern of ciphertext for the string "meet" is the same in both, which would be required if you were to perform searches by encrypting the user input of "meet" and sending the ciphertext to the cloud to actually perform the search.

I have not had time to fully explore this, but note that in "meet-up" the hyphen is preserved in the clear within the ciphertext. I suspect that this is because there is a requirement to enable search for the word "up", which basically requires setting the IV back to one of its static values like the one used when "meet" was encrypted or perhaps the one (assuming that there are any other IVs) used to encrypt other instances of "up". This is the only way to guarantee that the suffixed "up" will match the singular instance of "up".

I didn't highlight it, but you can also see that terminating punctuation such as question marks are preserved in the clear. Again, if you want to perform an exact match search for "meet", you need to strip the extra character because the ciphertext for "meet?" would be different than from "meet" so the search would not return results that a human would expect.

[image removed due to DMCA request]

But, the implication here from a security perspective is that if I am able to plainly see punctuation such as hyphens, and preservation of patterns in the ciphertext is so critical that I have to strip (and then reveal!) trailing punctuation, then an attacker is provided a head start in breaking down the encryption. If you are not promoting randomness in your ciphertext you are not encrypting. What Ciphercloud appears to be doing is not random, therefore it is not truly encryption, and certainly not homomorphic encryption.

So, the answer to the original question is that Ciphercloud is NOT doing homomorphic encryption.

I don't know how CipherCloud works. However, a related question is: How could you encrypt data in a database, in a way that allows you to achieve these goals? What are the best cryptographic techniques currently known, for that goal?

As it happens, that question has a good answer. Take a look at CryptDB, a system built by MIT researchers to encrypt all the data in your database while still allowing your application to manipulate the data. In their system, the application can execute SQL queries on the encrypted data (even though it is encrypted!) and do some limited computation on the data.

CryptDB uses a combination of techniques that have been developed by cryptographers over the past decade or two, to achieve these goals. They show that the result is practical, with good performance and ability to use it with existing systems (like phpBB). It's a brilliant system, and a significant advance for the field. Read their research paper for more on how they do it:

In summary, the techniques in CryptDB are what Ciphercloud ought to be doing. I have no clue whether Ciphercloud is actually doing that (you'd have to ask Ciphercloud that), but CryptDB represents about the state of the art in this area right now.

Those techniques have severe security limitations ( though they provably meet them). For example, the comparison search features( which are needed for efficient indexes) leak half the bits of a message. cc.gatech.edu/~aboldyre/papers/operev.pdf
–
imichaelmiersApr 20 '13 at 22:45

CipherCloud's website now clearly states, here, that CipherCloud DOES NOT use homomorphic encryption.

This also states that CipherCloud DOES NOT implement 1:1 mapping or ECB mode in any customer deployment. Other statements are next to acknowledging that CipherCloud's early demos did that, citing the will to illustrate the functionality, features that where not yet implemented, and necessity to avoid disclosure of technology not yet patented.

I'm also seeing that CipherCloud now "wholeheartedly apologize" for some of the chilling effect of the DMCA takedown notice brought by its legal team.

I'm eager to see a description of the feature more precise than "recognized standard for cloud information protection"; and in particular, for the (reverse-)proxy setup in front of a database application, what kind of restriction (if any) there is on searchability of terms enciphered without 1:1 mapping.

Update:

Ciphercloud's FAQ indicates that it does word-by-word encryption to support search. From the FAQ:

How is encrypted data still searchable >

CipherCloud provides granular control over the level of encryption and search-ability for specific pieces of information. Data can be encrypted on a per-field or per-word basis with industry standard AES 256-bit encryption.

On a related note, IBM has a patent on one Homomorphic Encryption scheme: U.S. Patent #8,565,435, "Efficient implementation of fully homomorphic encryption".

Considering IBM is competing in the same space, I would bet that IBM did not license the technology to a competitor.

That does not mean CipherCloud did not invent their own homomorphic encryption scheme. But I suspect there's a greater chance of CipherCloud licensing from IBM rather than CipherCloud inventing a HME or SWHE scheme. (But who knows, maybe CipherCloud has an R&D that rivals IBM Research).