Especially in relation to stream ciphers, I frequently read about (sometimes theoretical, sometimes practical) attacks that are able to "distinguish a ciphertext from a truly random stream".

What's logic to me is that - just because a ciphertext looks random, it isn't necessarily random. Looking around, the general consense is that "ciphertext needs to be indistinguishable from a stream of truly random bits".

This got me thinking: what exactly is "true randomness"? According to (cryptographically related) definitions I found, "true randomness is unpredictable". So far, so good... but that's also marks the exact point where I've lost it.

"Unpredictable" would practically mean that we have nothing to compare the ciphertext with, because we can not predict what output "true randomness" might produce. Also, there is a (minimal) chance that "true randomness" might output the exact same series of bits as a ciphertext. Meaning: a ciphertext might read 63F1t49X43 and there's a (minimal) chance "true randomness" might produce exactly the same output 63F1t49X43. No one can tell, because "true randomness is unpredictable".

Not being able to predict any true randomness, how can we compare, distinguish, or even claim that a ciphertext not truly random? Obviously not by comparing it with "true randomness" as that would be impossible due to "true randomness being unpredictable".

Now, I'm pretty sure cryptography is not philosophy and — as a result — I'm absolutely sure I'm missing something obvious in relation to the cryptographic meaning of "true randomness". Guessing the details are to be found in the cryptographic definition or "true randomness", leads to my question:

How exactly is "true randomness" defined in the realms of cryptography?

Practically, I guess you could say that I'm not really sure I correctly understand how someone can provide (cryptographically sound) proof to the claim that a series of bits is truly random when "true randomness" is considered to be unpredictable. So, if you think it's not the definition that might be confusing me here, please feel invited to set my head straight by pointing me to whatever I might be interpreting incorrectly.

EDIT

To avoid misunderstandings: when talking about "true randomness", I'm thinking along the lines of "True Random Number Generators" and not "Pseudo Random Number Generators". That's why I'm asking about "true randomness" and not pseudo-randomness.

7 Answers
7

Randomness is not a property of strings of bits (or characters of any sort). Rather it is a property of the process that generates those strings. However, it is convenient to conflate the string with the thing that produced the string, and thus to speak about strings being “random” or “not random”.

The string 00000, for example, is random if it was the outcome of “random process”, such as a coin being tossed five times and landing on tails five times in a row. Similarly, the string 1,2,3,4,5,6 is random if it was the outcome from rolling a die six times. Note, the random process does not need to be “fair” to be random, though processes that deviate substantially from the uniform distribution are not as useful for cryptographic purposes.

What is a “random process”? As I think about it, a random process is either an indeterministic process (if any actually exist), or a deterministic process for which the entropy of what we don't know (that is germane to the outcome) is greater than the entropy of the generated string. There is a lot we don't know (and can never know, given the Uncertainty Principle) about the state of the flipped coin – e.g. the exact position and momentum of every particle in the coin, the air around it, and the hand that flips it (all of which are germane, in that we presumably would need to know them to accurately predict the outcome of the flip).

If the string 00000 was the outcome of a number-generating algorithm, where the entropy of what was unknown and germane about the algorithm was less than 5 bits (e.g. we knew the algorithm and all but four bits of information about the seed), then the string would be “not-random”. At best it could be “pseudorandom”, meaning computationally difficult to distinguish from random.

My question was neither about randomness in general, nor about "pseudo randomness" as produced by number-generating algorithms of prngs. Don't get me wrong: I appreciate the time and effort you've put in your answer describing "randomness" in general, but I would like to point out that I specifically asked about "true randomness" (as produced by hardware random number generators) and related cryptographic provability.
–
e-sushi♦Oct 5 '13 at 17:29

1

@e-sushi - "true randomness" = 'random' as I define it. You have some hardware RNG that generates 'true random' bits if and only if you have a hardware RNG that is 1) indeterministic, or 2) deterministic, where the entropy of the germane unknowns exceeds the entropy of the output. Hopefully the distribution of the output is also very close to uniform, so as to be cryptographically useful.
–
J.D.Oct 5 '13 at 17:40

1

Am I understanding correctly that "true randomness" defines itself by producing a "close to uniform distribution"? If yes, am I also correctly interpreting that a "distinguishing attack" merely is able to prove that a ciphertext is a ciphertext because it's distribution is not "truly random" due to it not having a "uniform distribution" (aka "because it is baised")?
–
e-sushi♦Oct 5 '13 at 18:02

1

@e-sushi - a 'true random' process does not need to have a uniform prob. distribution to be 'truly random'. However if it is supposed to be used for crypto purposes then it should be uniform, or as close to uniform as you can make it. Dist. attacks don't 'prove', so much as enable the Adversary to make a guess with a high probability of being correct. Usually the attack reveals the encrypt. algo. behaving in a way highly unlikely for a truly random process (which usually means it has biased outputs).
–
J.D.Oct 5 '13 at 18:23

2

@e-sushi: you can take a non-uniformly distributed source and use it to make a uniform output. This is done, for example, in quantum RNGs where they take normally-distributed output and "convert" it to uniformly-distributed output.
–
ReidOct 5 '13 at 18:31

To be concise, true randomness boils down to the selected data being causally unrelated. That is, if each piece of data is the result of no common cause, then there is no relation by which the rest of the data can be predicted or inferred. So being unpredictable is a consequence of being truly random, but it is the lack of causal relationship that is the determining factor.

I agree, this my preferred outlook. Randomness is the deliberate information loss of any causual relationship between events. For example, deleting the location and timing of earthquake measurements; or sunspot events.
–
LateralFractalOct 6 '13 at 11:47

That is insufficient. The fact that there was a relationship there at all means it is possible to predict or infer other data. If, however, you take some data from a randomly selected sound, some from a randomly selected earthquake, etc., then your data should be random.
–
TanathOct 6 '13 at 20:31

Think of it this way: Every random event may be causal as far as we know. If so, sources we attribute as random a priori are simply streams of related events with the information regarding the causal relationship deleted. We can say "ah ha but I see a pattern" but this means we didn't delete enough of the causal information packaged with the event. One form of pre-packaged causality that is so common we rarely think about it and is also stubborn to delete (since our own action of deletion requires the selfsame causality) is Temporal Causality.
–
LateralFractalOct 6 '13 at 21:57

(cont.) We usually address the problem of temporal causality obliquely through Brownian Motion a/k/a massively parallel temporal causality - where the causality information is lost through collapsing the temporal reference frames down from N actors to 1 observer. The avalanche in a PRNG works in much the same fashion. If this isn't what you meant in your answer, would you like me to create this as a separate answer? (My earthquake/sunspot example was limited in retrospect as all inherent causality was not removed.)
–
LateralFractalOct 6 '13 at 21:57

Every event appears to be causal, and the universe appears to be deterministic, but that doesn't mean there's a proximal causal relation between any given set of events. If you select data from solar activity and data from ambient noise, chances are they're causally unrelated. Or at least far enough removed as to eliminate any ability to infer one from the other. Think chaos theory & sensitivity to initial conditions. The sources need not be random themselves if your selection removes causal relationship. A seeming pattern can mean nothing here, where if data is deleted it does mean something.
–
TanathOct 12 '13 at 16:35

Randomness is the information loss of any causal relationship between events.

The universe needn't be a clockwork universe for the assumption of pervasive causality - if events are "sticky" and accrue localised causality in the same way that a molecular cloud accretes into stars and planets. The underlying cause of the speed of light might also be the prime inducer of localised causality, but I digress.

Sources we attribute as random a priori are simply streams of related events with all the information regarding their causal relationships deleted.

This is a subtle point and deserves expanding: When we see patterns in a long* stream of data we are seeing causal information that has been preserved. The most persistent source of causality is temporal causality. As the order we receive events is due to the propagation speed (max of c) from and between events - hence localised non-temporal causality is often strongly preserved in temporal propagation as the shortest path to the observer involves little if any extraneous interference.

The way that we can deliberately (or nature can incidentally) remove temporal causality is force an observer to collapse multiple frames of reference into a single frame of reference; losing information in the process. For example, with Brownian Motion, if the history of every particle is tracked then causality is preserved and all future behaviour can be predicted. If the observer can only observe one particle, causality is not preserved and influence on that particle by other particles is essentially random.

PRNG avalanches work in much the same way; so the difference between a TRNG and PRNG is simply that PRNGs have a tiny amount of unknown information (e.g. 1024 bits) while TRNG natural phenomena have enormous amounts of unknown information (e.g. a state size of 10^24 bits for a cup of hot coffee).

We can't store, transmit or meaningfully brute force the massive state size of natural phenomena so TRNGs are considered completely random; providing you choose a natural phenomena that collapses multiple frames (e.g. "Each molecule in the Sun") into a single frame of reference (e.g. "Current rate of solar wind").

This is why you need a trust chain for any random number generator you haven't analysed or don't control. I can give you a page of data that looks random to you because it passes every test for uniqueness, repetition and distribution but isn't random to me since I used a PRNG and kept the seed.

* Short steams of data can very easily have patterns due to ascribing causality where none reliably exists.

The same answer without explanation:

A TRNG is a stochastic process with an extremely large unknown
internal state.

A PRNG is a stochastic process with an extremely small unknown
internal state.

TRNGs are created by picking a proven stochastic process which is
easy to observe the output of.

TRNGs measure natural phenomena as they already contain extremely large unknowns; or superposition so much state information as to qualify as "extremely large unknowns".

If this @Tanath I assume you are going to comment on the down-vote.
–
LateralFractalOct 13 '13 at 1:28

Dilbert's "999999" fails test of being incompressible. "1E6-1" is another way to express the same thing, but shorter. Because of long run of repeating result, randomness quality tests would generally reject this sequence. Then again, eventually TRNG will output this sequence (if the sequence is within its output domain). So I completely agree on what appears like patterns in output of TRNG. Also, the comment on "trust" being component of good random number generator is what I like.
–
user4982Oct 13 '13 at 10:46

In the context of the original question, what you're comparing your stream cipher to is a particular probability model. That model has each bit have probability 0.5 of being a 1, and has that probability be independent of the bit's position in the string and any surrounding bits. It's the kind of source you would get if you flipped a fair coin to determine each bit in the sequence.

The reason this matters for a stream cipher: Imagine drawing a keystream from the ideal random source and XOR-ing it into the plaintext to encrypt it. This would never leak any information about the plaintext and would be unbreakable — it's a one time pad.

Now, suppose instead we draw our keystream from a stream cipher, and suppose there is no way to distinguish this stream cipher's outputs from an ideal random sequence up to some limit in computation and observed keystream.

Now, let's imagine an attack algorithm that uses less than those limits of keystream and computation to break the stream cipher and recover some information about the plaintext. If I had such an attack algorithm, I could always use it to build an algorithm that would distinguish my stream cipher outputs from an ideal random sequence. I would just take the stream cipher output, use it to encrypt some plaintext, and then try my attack against it.

That means that we know that if there is no way to distinguish the stream cipher outputs from an ideal random sequence, then there is no attack that recovery information about the plaintext from the data encrypted by the stream cipher.

How do you tell if a ciphertext has the properties of a truly random stream?

How do you tell if a stream actually is truly random?

For the first part, there are many statistical techniques. But the basic question is whether there is any detectable relationship between the plaintext and the ciphertext. If modifying any bit of the plaintext has a 50% chance of changing any bit of the ciphertext, that's one good sign.

For the second part, I believe it's impossible to tell the difference between a truly random stream and a sufficiently good pseudo-random stream without knowing the implementation details or seeing repetition. No deterministic algorithm can generate true randomness.

Randomness in the realm of Cryptology is defined by the cipher you cannot solve. The process that converts the order of the plaintext to the order of the ciphertext is unknown to you. As far as you are concerned the ciphertext has been put together without an identifiable pattern, plan, system, or connection. It is disordered. To you it is random.

If I can solve the cipher but you cannot, then either I have been lucky or else I can see patterns that you fail to see, which help me solve the cipher. In the latter case, for me the ciphertext is not random. I have an intuition that you do not possess and my observation of the cipher (and probably many other, unrelated things in this world) are different to yours.

I don't think this addresses the correct problem. It is understood by the questioner that we test a primitive by comparing it to a truly random source, which is what you are discussing. This question is about defining just what a truly random source is
–
figlesquidgeApr 25 '14 at 9:40

1

I don't agree with you. A 'truly random source' no more exists than 'a truly cool day'. When I lived in the Tropics I defined a cool day quite differently to when I lived in Europe.
–
user2256790Apr 25 '14 at 11:18

Cryptography might not be philosophy, but the concept of true randomness is somewhat within that purview. It might be useful to cross post this in a philosophy forum and get their perspective.

PRNGs and TRNG blur together if the PRNG is good. Some believe that *nix's /dev/random is pretty good. Some have passed what science has decided are exacting tests (like DIEHARD and the FIPS stuff) for true randomness. So if a PRNG passes a true randomness test, who is to say that it isn't? And how do you formulate such a test?

You can't use the argument that if a computer algorithm generated the numbers they're not truly random. All digital random numbers of all types are generated by computer, be they from earthquake activity, photon counting or thermal noise. Unless you have a random number tree in your garden, or roll a lot of dice, you use a computer.

If you want an example of this and to see randomness written down, poke around the internet and download pi digits. I have 2 billion of them and there is a sequence of eight consecutive zeros in it. Is that random? Yes actually. Mathematicians think that the digits of pi (and others like root 2, golden ratio) are infinite and truly random. Yet the digits were calculated on a souped up pc in a student's dorm room. And I can write them all down as pi. Go figure randomness. I think that in extremis it is a philosophical debate.

Just some small heads-up: there is no such thing called “true randomness test”. The tools you hinted at (like DieHard) merely present a battery of statistical tests for measuring the quality of a random number generator. They are neither meant to detect, nor prove “true randomness”. Next point: when you state “Unless you have a random number tree in your garden, or roll a lot of dice, you use a computer.”, you are bluntly ignoring the fact that there are amply randomness providing hardware solutions out there that do not even come near the definition of “a computer.” [1/2]
–
e-sushi♦Feb 28 at 6:08

[2/2] Note that I skipped the point where I would need to explain to you that “rolling dice” is not as random as you might think it is. Last but not least, you’re mentioning Pi digits and claim Pi is “truly random”. Well, looking at the BBP formula and related research, I tend to strongly disagree… especially, from a cryptographic point of view!
–
e-sushi♦Feb 28 at 6:13

[1/3] I think that there is sometimes a "true randomness test." It just depends on who defines it. Mathematicians who may disagree over how random a generator is, or lawyers? If a casino random generator repeatedly fails a battery of 'approved' randomness tests – someone's going to jail. In that case, random draws must pass a randomness test. And so must the dice on the craps table. Philosophy? [2/3] I think that you'll find that the digits of pi are random - google "are the digits of pi random?" You'll get lots of hits. [3/3] Cryptographically, Blowfish uses 8000 digits of pi.
–
Paul UszakMar 15 at 2:33

[1/3] Just to get the facts straight: test batteries are in no way able to prove a (T)RNG’s randomness, they’re only able to prove non-randomness. [2/3] Google hits are hardly an indicator for scientifically proven facts. For example: Google returns 190,000,000 results for “alien cats on pluto”. [3/3] The reason Blowfish uses 8000 digits of pi is because PI is a well-known “nothing up my sleeves” number with no obvious pattern. As Bruce states in his paper: There is nothing sacred about pi; any string of random bits … will suffice.
–
e-sushi♦Mar 15 at 5:57