The vault stores the plaintext encrypted using an algorithm so that detokenization
of one index is computationally possible but hard.
Let the costs of detokenization be X per entry.

An attacker that got access to the vault should find it too computationally expensive
to detokenize the entire vault.

A variation: the attacker may have access to a limited list of (index, plaintext)
pairs that they injected into the vault before they compromised it.
Let the length of this list be Y.

What is the encryption algorithm that should be used for this use case? Can it be designed so that X is configurable (given Y=0)? Can detokenizing the entire vault be computationally hard regardless of Y, or must we assume Y = 0?

Edit:

I've modified the original question. Instead of tokenizing a 6 decimal digit plaintext, the question is now about tokenizing a 16 decimal digit credit card number.

Both a legitimate user and a very skilled attacker has access to partial credit card numbers. A partial credit card number consists of the first 6 plaintext digits, the last 4 plaintext digits, and the index received from the vault. In order to find the remaining middle 6 digits the legitimate user would authenticate and detokenize, while the skilled attacker would have three options:

Perform a limited number of tokenizing operations without compromizing the vault.

Compromise the vault and perform 50,000 tokenizing operations per plaintext.

Compromise the vault and perform one detokenization per plaintext.

Additional information:

Many credit cards share the same first 6 digits.

Werther the plaintext is the complete 16 digits number or only the middle 6 digit number is part of the design.

The index length is part of the design

Once the vault is compromised the attacker has access to the vault storage/memory which contains everything except one-time keys that were properly erased. The same attacker has also compromised the detokenizer's private key stored elsewhere if there are such keys.

Edit 2:

Removed the requirement that the index is random. It just need not to be some hash of the plaintext to avoid brute forcing the index without compromizing the vault.

Changed brute force tokenization per plaintext from 100,000 to 50,000

Obviously the plaintext could be encrypted with a public key, and assume that only the authorized detokenizer has the private key. Alas, if a skilled attacker has enough resources to compromise the vault (by hacking into the vault machine) , the attacker has enough resources to compromise the private key (by hacking into the machine containing the private key).
Assuming the hack is discovered in timely fashion, the cards would be canceled before the hacker sold the Cards in the black market.
For that purpose, the detokenization of the entire vault must be so hard that it would delay the attacker for a long time, or would require vast amounts of compute resource to make it non-economical in the first place.

It is expected to make the tokenization computationally hard - so 50,000 tokenization operations per credit card number would be non-economical or would take too long (configurable)

It is expected to make the detokenization computationally harder than 1 and less than 50,000 tokenization operations (configurable).

Given a procedure that did the above, wouldn't one attack be to go through all 1,000,000 possible plaintexts and insert them into the vault. By rule (1), if they're already in the vault, that'll return the same index. Won't this efectively detokenize the entire vault without that much expense (hence violating requirement (4)?
–
ponchoFeb 6 '14 at 5:52

1

The vault will most likely store only the CCN (16 decimal digit PAN) without the expiry date or service code. I suppose this may change in the future, but it is a reasonable assumption for now. I am updating the question from 10^6 to 10^5.
–
itaifrenkelFeb 7 '14 at 14:13

Why does the new index need to be random? $\;$
–
Ricky DemerFeb 8 '14 at 2:54

It does not have to be random, it just need not contain information from the plaintext itself. I'll update the question
–
itaifrenkelFeb 8 '14 at 6:25

1

As fgrieu suggests, itaifrenkel should probably think again and see if he can change the situation. Placing the tokenizer/detokenizer inside a smart card may be feasible and may provide real security.
–
K.G.Feb 10 '14 at 20:44

2 Answers
2

I am assuming that the vault shall store arbitrary-length messages and associate with each message a token consisting of six decimal digits. Otherwise, as has been noted (see below), the problem is probably either impossible or trivial.

I interpret your requirements to mean that the detokenization algorithm is also available to an attacker that has gotten access to the vault, which means that the cost of detokenizing the entire vault is at most $nX$, where $n$ is the number of items stored in the vault and $X$ is the cost of detokenizing one index.

This means that the best you can do is to balance your $X$ such that $X$ is feasible, but $nX$ is economically infeasible. (Note that $n$ is very small, so it seems unlikely that $X$ cost can be feasible for you, but $nX$ cost infeasible for an attacker.)

(Note that if the vault's detokenization algorithm is not available to the attacker, your problem has a very simple solution using deterministic public key encryption: Encrypt the message using the detokenizer's public key. If you already have the ciphertext, you're done. Otherwise, pick a fresh index and store the index-ciphertext pair. The detokenizer simply decrypts the ciphertext. This is the best you can do unless the vault's tokenization algorithm also isn't available to the attacker, in which case everything is trivial.)

The system consists of an interactive algorithm $V$ and an algorithm $D$. The correctness requirement can then be stated as follows:

We can send $m$ to $V$, in which case $V$ responds with an integer $i$. As long as $V$ has been sent less than $10^6$ distinct messages, the following requirement is satisfied: For any $m'$ previously sent to $V$ to which $V$ replied with $i'$, then $i=i'$ if and only if $m=m'$.

We can send $i$ to $V$, in which case $V$ responds with $z$ or $\bot$. The response satisfies the following requirement: If $V$ previously replied with $i$ when sent message $m$, then $D$ will output $m$ on input of $z$, using expected time $X$.

We can send compromise to $V$, in which case $V$ responds with any secret keys it may have and a list $(i_1, z_1), (i_2, z_2), \dots, (i_n, z_n)$. If $V$ was sent the messages $m_1, m_2, \dots, m_k$ with replies $j_1,j_2, \dots, j_k$, then $k=n$, $i_1=j_1$, ..., $i_n=j_n$, and $D(z_i) = m_1$, $D(z_2) = m_2$, ..., $D(z_n) = m_n$.

The security requirement is the following:

If $V$ has been sent $n$ distinct messages, of which $k$ have been sent by the attacker (or can be predicted with reasonable effort) and the remaining $n-k$ are unpredictable to the attacker, then no adversary sending compromise to $V$ can recover the remaining $n-k$ messages with expected cost significantly smaller than $(n-k)X$.

a large family of groups $\mathcal{G}$ such that we can efficiently sample a group $G$, a generator $g$ and a verifiably random element $y$ in the subgroup generated by $g$, such that $(G,g)$ are suitable for ElGamal encryption, and the best algorithm for computing logarithms smaller than $T$ in $G$ requires expected cost $\sqrt{T}$. (All reasonable assumptions.)

Compute the discrete logarithm of $x$ to the base $g$. (Expected cost $X$.)

Compute $K = w y^{-r}$.

Compute $m = \mathcal{D}(K, c)$.

(Optional) Verify that $H(m) = K$.

This scheme is correct, and a five-minute analysis suggests that it also satisfies the security requirement. The only minor subtlety in the scheme is that every entry must have a fresh group, otherwise algorithms for computing many discrete logarithms in one group could allow the adversary to decrypt with expected time smaller than $(n-k)X$.

I would not actually use this scheme without a proper security analysis.

A few minor notes: The families of groups can be either based on finite fields or elliptic curves, where the latter would probably involve point counting and therefore be moderately expensive. In either case, naïve ElGamal is probably not the right scheme to use, some variant of DHIES or ECIES or something would be better.

You can make tokenization cost $T$ work as follows: Let $K' = H(m)$. Use $K'$ to generate an elliptic curve of prime order $\approx T^2$ and two verifiably random points $P$ and $Q$ on the curve. Compute the discrete logarithm $U = \log_P Q$. Then let $K = H(m,U)$. Another (simpler) option (as was pointed out in another answer) is to simply use an expensive hash function such as PBKDF2 or scrypt with appropriate parameters.

Would it be possible to make the tokenization also computationaly hard (configurable?) to mitigate the fact that the plaintext is not arbitrary length.
–
itaifrenkelFeb 7 '14 at 20:54

I think so. I've added a paragraph at the end.
–
K.G.Feb 8 '14 at 15:48

"We can send compromise to V, in which case V responds with a list (i1,z1)...". When compromised V would also respond with any "secret" keys it needs to perform the deterministic symmetric encryption. How would that change the security analysis of this solution?
–
itaifrenkelFeb 8 '14 at 23:13

Yes, $V$ should respond with any secrets. I've fixed it. There are no long-term secrets in the solution, so it doesn't matter.
–
K.G.Feb 9 '14 at 12:52

Given the nature (Credit Card Numbers) of the 16-digit decimal numbers, they include one Luhn check digit, and it is trivial to reconstruct any unknown digit from the 15 others. With their 6 first digits and 4 last digits assumed known, the remaining 6 digits have at most as much entropy as 5 decimal digits, that is $b=5\cdot\log_2(10)\approx16.6$ bit. The following generalizes to a short plaintext (say at most 100 bytes) with an unknown portion taking $2^b$ equally likely values trivially determinable from the rest of the plaintext, and would be adaptable to $b$ bit of entropy and a publicly known distribution, like a bias towards small values.

An unavoidable limitation is that, with access to the vault's internals (even without the detokenization credentials), partial information on a plaintext in the vault, and matching index, an attacker can repeatedly query candidate plaintexts (say sequentially from a random starting point) into a simulation of the vault, and check if the simulation is returning the index. Expected cost is that of $2^{b-1}$ tokenizations. Expected time is $2^{b-1}/n$ queries with $n$ simulations of the vault running in parallel at the same speed as the vault, for $n\ll2^b$. Thus one query to the vault must require significant work $W_q$ on average. If we are willing to wait 1 second per query to the vault, and have the vault consume 10 kW (about the design power of the NEMA 14-50 plug sometime used for charging an electric car) during that (bringing the cost of electricity alone to $\$0.0004$ at my home's rate), an attacker using the same hardware and rate could recover a plaintext every 5.8 days (at a cost of $\$200$ in electricity) per plaintext; and we should fear the adversary is significantly more efficient that we are. That is not satisfactorily safe by normal cryptographic standards, but better than nothing.

Another unavoidable limitation is that with access to the vault's internals, the detokenization credentials, and an index, an attacker can recover the plaintext for that index by the method used by the vault for detokenization. Thus one detokenization must require significant work $W_d$ on average. Say, if we're willing to spend 100 seconds per detokenization, and have the vault consume 10 kW during that, an attacker using the same hardware and rate can do so and recover each plaintext at that cost; and again we should fear the adversary is significantly more efficient than we are. That is not safe by any stretch of imagination, and I thus question the rationality of considering an adversary with simultaneously detokenization credentials, valid indexes, and partial information on plaintexts, as in the question right now.

Thus I first describe a system that disregards an adversary with detokenization credentials, but I think matches all the requirements as currently worded (including not holding a detokenization private key outside the vault, if that still allows a passphrase unknown to the adversary as detokenization credentials). I'll then sketch how to modify that to add any feeble resistance we can have against an adversary with detokenization credentials.

At initialization:

The vault is given a passphrase $P$, which subsequently will be required only for detokenization.

The vault stretches $P$ into a 256-bit key $K$ using scrypt and constant salt $S$ unique to the vault; the parameters determining the amount of work (iterations, memory, number of threads/cores) are set for $W_d$.

The vault deterministically generates an RSA key $(N,e,d)$, using as the necessary source of random bits a CSPRNG seeded with $K$.

The vault stores the public key $(N,e)$ and zeroizes $P,K,d$ and any other intermediary result.

The vault initializes its internal variable $I=0$, and an internal database to empty (that will hold one cryptogram per plaintext stored).

At query:

The vault deterministically and slowly turns the plaintext it receives into a cryptogram as follows:

The vault applies scrypt with the plaintext as password, and some constant salt $S'$ unique to the vault, yielding a 256-bit result $R$; work parameters are set for $W_q$.

The vault enciphers the plaintext into the cryptogram using RSAES-OAEP of PKCS#1v2, using as the necessary source of random bits a CSPRNG seeded with $R$.

That cryptogram is searched in the database:

If absent, it is stored in the database at index $I$; $I$ is incremented; and the former $I$ is returned as the index for the plaintext just stored.

If present, its index is returned.

At detokenization:

The vault accepts the index to detokenize, and alleged passphrase $P$.

The vault stretches the alleged $P$ into alleged $K$ as during initialisation.

The vault deterministically generates the alleged RSA key $(N,e,d)$ from $K$ as during initialization.

If the alleged $(N,e)$ matches the stored $(N,e)$, then

if the index to detokenize is less than current $I$, then

the vault fetches the cryptogram at the index, deciphers it using $(N,d)$, and outputs the plaintext.

The vault zeroizes $P,K,d$ and any other intermediary result.

As pointed in comment, it is enough to use a moderate $N$ when $b$ is small, since the system can't be very safe anyway. Given use of RSAES-OAEP, $e=3$ is safe and allows to spend more effort in scrypt (but security authorities frown at $e=3$, thus we might bow and use $e=2^{16}+1$).

The system is such that it is twice safer for any extra unknown bit of information in the partial plaintext, which is a nice-to-have. I do not see that Y known plaintexts as in 5 of the question helps more than by allowing to weed out records corresponding to these Y plaintexts.

If we really want to present some symbolic resistance to an adversary with detokenization credentials, there are options. I'll assume $W_d/W_q\ll2^b$ (in any system, the contrary would be useless against any adversary also holding indexes and partial information on plaintexts, as assumed in the question). Sketch of one possibility:

We modify query by enciphering, rather than the full plaintext, the plaintext excluding the secret portion $M$ (here of 6 decimal digits), which we replace with $M\bmod\lceil2^{b+1}\cdot W_q/W_d\rceil$ or other suitable hint giving $b+1-\log_2(W_d/W_q)$ bit of information about $M$.

We modify detokenisation to recover the full plaintext by trying the about $2\cdot W_d/W_q$ candidates that remain (in random order or at least starting from a random point to avoid timing attacks), thus with expected work about $W_d$.

We modify initialization and detokenisation to stretch $P$ into $K$ with work only a fraction of $W_d$.

Many improvements seem feasible, but I lack the energy to do more than list some:

random-like indexes as in the question initially;

reducing the memory used in the vault, e.g. by using a public-key cryptosystem with shorter cryptograms than RSA with small plaintext, or perhaps radically by creeping ciphertext in the indexes;

improved security against an adversary with detokenisation credentials but without the random-like indexes, or/and partial plaintext information.

Further, if we turned around the problem and removed the assumption that the vault is insecure, replacing that with say a security-evaluated Smart Card IC with redundant CPUs, or perhaps just an off-the-shelf Java Card or programmable HSM, we could have much enhanced security without drawing kilowatts during operation or requiring too much of a huge investment. In the simplest embodiment

Initialization chooses an AES key at random; and initializes an 8-digit PIN.

Query accepts the plaintext as 16 bytes in ASCII; waits as long as bearable; enciphers the plaintext using AES; outputs the 16-byte ciphertext as index (encodable as a 22-characters base-64 string);

Detokenization checks the PIN code as familiar in bank Smart Cards and SIM cards, with an error counter, zeroizing the device after three consecutive failed attempts; accepts the index; waits as much as bearable; deciphers the index; and outputs the result.

Note: plaintext can be verified to be valid on Query and Detokenization; that can only help, by limiting the information that an adversary can get.

The PKE should be fast so that more effort can be spent on scrypt. $\hspace{2.55 in}$ (I would suggest $\:\operatorname{length}(N\hspace{.02 in}) = 1280\:$ and $\:e=3\;$.) $\;\;\;$
–
Ricky DemerFeb 7 '14 at 21:44

I updated the question to clarify that the private key should be assumed to be compromised too.
–
itaifrenkelFeb 8 '14 at 6:50