I would like to start by saying I know nothing about Cryptography and was reading up on how to choose a random seed and this link is something that I found. What I basically understood that the seed has to be sufficiently random that guessing the seed would be hard.

So the question is would the hash of a Tweet, at any given time, be a good candidate for a random seed? This is mainly because the content of a Tweet can be practically anything as it's being generated by a huge percentage of the world population.

That said, I understand it is possible to game it by mass tweeting a specific string continuously from multiple accounts flooding the tweet stream with predictable seeds. So if this can be mitigated by blacklisting the bad usernames, is using tweets for seeds a viable option?

4 Answers
4

What you are suggesting is not a good idea for a general purpose random number generator. It could be meaningful for very specific use cases if you need a random number generator whose output can be verified independently by a third party.

Even in those cases there are other sources of entropy which are potentially more suitable. The oldest mention of this approach known to me is RFC 2777. The suggested sources of entropy listed in RFC 2777 are:

lottery winning numbers

closing price of a stock on a particular day

daily balance in the US Treasury on a specified day

the volume of trading on the New York Stock exchange on a specified day

Sporting events

Every one of those looks like they are less likely to be subject to manipulation than posts on Twitter.

Reasons it's not a good general purpose approach

You'll have a cyclic dependency. Before you can retrieve posts from Twitter you'll need random numbers for a number of different purposes including:

If you use IPv4 you'll need randomness for the IPID header field.

If you use IPv6 you'll very likely need randomness for address configuration.

You need randomness to assign request IDs.

You need randomness for TCP sequence numbers.

You need randomness for SSL session setup.

Moreover the entropy of a Twitter post is hard to estimate. Some individual posts may have sufficient entropy on their own, but many will not. It's probably a safe estimate that posts have at least one bit of entropy on average, so if you were to hash together a thousand posts, you'd probably get sufficient entropy.

The resulting output is subject to manipulation by Twitter users. If your algorithm is known a user can compute what seed you'd calculate with different contents of their latest post and choose contents producing randomness that somehow suits that user.

The resulting output is also subject to manipulation by Twitter. Surely there will be Twitter employees who have access to information which will make the manipulation possible by any Twitter user even easier to pull off.

All of the input to the random number generator will be publicly known. That is bad for a general purpose random number generator, but can be useful in a few very specific use cases.

$\begingroup$TCP sequence numbers should be random to prevent TCP hijacking, but SSL defends against man-in-the-middle. A simple counter to avoid reuse of the same sequence number between TCP sessions with same same IP and port pairs would be sufficient, although maybe not as resistant to a DOS? But only if you have good randomness for SSL. So maybe your bullet list should say you want (not need) randomness for TCP. Similarly, The IP fragment ID just needs to be unique, not random, for each packet that isn't a fragment of a larger packet. SSL assumes lower layers are insecure.$\endgroup$
– Peter CordesJan 6 '19 at 5:59

1

$\begingroup$@PeterCordes However, the randomness for securing the connection cannot be easily be removed, especially not for common services as twitter. Once those switch over to TLS 1.3 I'd say you'd at least need 64 bits of randomness. Maybe you could get around the issue by implementing a special version of TLS, but at least the ephemeral DH key pair generation would be affected, and the hello contains 32 bits of randomness as well.$\endgroup$
– Maarten Bodewes♦Jan 6 '19 at 11:57

$\begingroup$@MaartenBodewes: Oh yeah, there's definitely a catch22 or chicken/egg problem here, and this answer is correct that you do need secure random numbers at some point to communicate securely over the Internet. But not at every layer, just in TLS I think. Crappy random numbers or fixed seeds or non-random sequences work for IP and TCP, especially if you're securing the data with crypto that authenticates the server to the client. But it won't work for TLS if you want TLS to actually protect you.$\endgroup$
– Peter CordesJan 6 '19 at 12:04

$\begingroup$Why would you use public data like stocks as a seed instead of directly publishing the seed? One would need to know your code to verify your results anyway, therefore just putting the seed in there (hardcoded or as text) seems far easier and less error-prone.$\endgroup$
– SebbJan 7 '19 at 12:39

The other answers provide very good lists of reasons not to use Twitter as an entropy source. What follows is the flip side of your question:-

Why would you want to?

Tweets are typically read on tablets, PCs and phones. All of those have access to hardware entropy sources that can produce oodles of truly random bits for seeding anything. The zeitgeist is that you aim for 128 or 256 bits of entropy and then seed a cryptographically secure pseudo random number generator. That will meet all of your common random number needs.

$\begingroup$I understand that there are multiple better sources for seeding a random number generator. What I wanted to understand was why one can or cannot use a user generated social media content as a source for seeding.$\endgroup$
– aa8yJan 6 '19 at 5:04

$\begingroup$@aa8y I hoped to offer more secure alternatives, within the context of your opening " I know nothing about Cryptography". Quoting from a 1996 edition IT book led me to believe that you'd be unaware of (now) common hardware like RdRand and the ability to use camera devices as TRNGs.$\endgroup$
– Paul UszakJan 6 '19 at 13:40

$\begingroup$dev/urandom is not a "seeding source", it's PRNG output. even /dev/random is not actually raw randomness, it's the same mechanism as urandom, but rate limited based on the estimated rate of incoming entropy from hardware. I assume the same applies to the MS Crypto API. so only the first and last list items count as real "seeding sources".$\endgroup$
– user371366Jan 7 '19 at 6:14

$\begingroup$@dn3s Not a PRNG — but rather the output of a cryptographical secure hash, which takes its input from multiple hardware sources. And yes, it can be used for seeding purposes (assuming you're not cold-booting a VM, which might introduce initial seed issues due to emulation et al)$\endgroup$
– e-sushiFeb 6 '19 at 18:59

$\begingroup$@e-sushi what i was trying to get at is that urandom itself is seeded. my wording was not great; it could be used as a seeding source. however i feel like "source" wouldn't be a good overall description of it since that leaves out the fact that it's a seeding "sink".$\endgroup$
– user371366Feb 7 '19 at 7:52

How are you going to decide which tweet to use? Randomly? This quickly leads to a chicken / egg problem.

What if the chosen tweet is one word? That would not add a lot of entropy.

What if twitter is unavailable? Are you just stopping your service that relies on the entropy or are you going to continue regardless?

How are you going to keep the chosen tweet secret? You can use TLS, but TLS requires a random number generator to operate.

How are you going to blacklist in advance? You don't know the attackers in advance, right?

What if twitter changes his API? Would you keep running if the tweet collection agent crashes or returns bad results?

What if your government decides to block Twitter? There are plenty of governments doing that.

What if you choose a heavily retweeted tweet? How much entropy would that contain?

Having something that provides entropy is just the first step. In general you want something that is local and hard to influence and easy to understand / validate. Twitter doesn't seem to be a good option for any of those requirements.

$\begingroup$I would take slight issue with "What if the chosen tweet is one word? That would not add a lot of entropy." as the definition of entropy needs to be considered in terms of the entire selection process, not the end result. The empty string '' returned amongst other unique possibilities each with true randomness $p = 2^{-1000}$ would be part of an entropy source 1000 bits strong. That may be a separate issue to accidentally loading a string that others will test against - i.e. the assumed generation model that an attacker might use.$\endgroup$
– Neil SlaterJan 5 '19 at 19:22

$\begingroup$@NeilSlater That's a good point; however unless you're actually enumerating all possibilities in the domain I'm guessing that you'd still be left with a small seed.$\endgroup$
– Maarten Bodewes♦Jan 5 '19 at 19:45

$\begingroup$The idea was to select the first tweet we see at the time we want to seed the random number generator to avoid the need for selecting the tweet at random. As far as selecting a tweet of a certain size or one which has not been highly retweeted, that should be easy, right? And for govt banning twitter, I didn't think of that, but honestly, it doesn't have to be Twitter. It can be any user generated content which is hard to predict. And I wanted to know if that is good enough. Twitter was just an example.$\endgroup$
– aa8yJan 6 '19 at 4:51

$\begingroup$Yeah, OK, I can see you want to generalize this idea. But you'd still have boot problems, availablity problems, problems with the secrecy - actually most of the list. As choosing just the latest tweet: that's so much vulnerable to being influenced by an adversary that I don't even want to think about it.$\endgroup$
– Maarten Bodewes♦Jan 6 '19 at 11:53

Other answers have already pointed out the chicken/egg catch22 problem of securely communicating over the Internet before you have a random number, and other showstoppers and possible problems. But you're screwen even against a fully-remote attacker that can't sniff your packets.

The OP commented:
The idea was to select the first tweet we see at the time we want to seed the random number generator to avoid the need for selecting the tweet at random. [...]

Tweets are public, and thus your pool of seeds is available to the attacker.

On average, Tweet throughput is around 6000 tweets per second (source). An attacker that can guess your tweet-query time within one second has a search space of about 6000 tweets. You could say that's equivalent to 12.5 bits of entropy, vastly smaller than the hash length. Or an attacker can widen the window to 1 minute for an equivalent entropy of 18.4 bits, still trivial to brute force in seconds, probably only limited by the time to download all those tweets.

If an attacker controls or knows when a seed was generated, you're screwed. The tighter a time bound they can put on it, the smaller their search space. Even worse, the attacker can simply keep widening their time window with earlier and earlier tweets if they don't find a hit in the first 1-second window they check.

Many use-cases for secure seeding of PRNGs expose the sequence to the attacker so they can test guesses of the seed. Try them with the same PRNG your software uses, and check whether the resulting sequence matches what they've already seen. Then, with high probability, they can predict the next number they'll see.

There can be false-positive matches that lead to the same initial sequence, for multiple reasons:

They can only see (or work backwards to) rng() & 0xff (low 8 bits) or rng() % 100 (or some better way of generating a 0..99 range), not the full 32 or 64-bit random number value of each PRNG step.

The PRNG has a large hidden internal state, and multiple initial states lead to the same sequence of random numbers. (This is already necessary so that knowing one rng result doesn't uniquely determine the next.)

But by observing enough random data from the same seed, an attack can test a seed to a very high probability.

With only 6000 possible candidates, the chances of one giving the same initial sequence you observed but actually being different is negligible.

And if you test them all over a likely window (and are right about that time window), you can detect when you've uniquely identified the one tweet that produces the sequence you're seeing, so you can potentially "lock on" quite quickly even if you don't get many bits of data per observation of the sequence.

If the random number was used as an encryption key, an attacker that can detect "sane looking" plaintext can still attack this way, even if the "sane-looking" check is very weak / inclusive.

Check which (of the ~6000) tweets as seeds lead to sane-looking plaintext from the first key.

Of those few candidate tweets, check which produce sane-looking plaintext from the second key generated from the same sequence. If there were multiple different possibly-sane plaintexts from the first key, this probably rules out most of them. Repeat as necessary.

This might not be the most plausible example, but this kind of idea is applicable for other kinds of things where you don't directly see the random sequence, only a cryptographically-secure use of it. But if you have any mechanism for testing a guess by going through all the steps the target of the attack would take, you can still attack.

Or if you can trigger a re-seed at some known time, and use the service with your own known data to get (probably) some of the first random values generated with that seed, you might be able to work out the seed that it will continue to use for other users' requests.

Only 6000 tweets is a small enough search space that you can start to expand your search space in other dimensions, like allowing for the possibility that other users' requests might have slipped in between yours while you're using it as an oracle to encrypt known plaintext that lets you check. (Or some equivalent thing that lets you really check your PRNG sequence guesses.)