A Tour Through Random Ruby

This article covers various ways that you can generate random (usually pseudo-random) information with Ruby. Random information can be useful for a variety of things, in particular testing, content generation, and security. I used Ruby 2.0.0, but 1.9 should produce the same results.

Kernel#rand and Random

In the past, a random range of numbers might be described like

rand(max - min) + min

--ADVERTISEMENT--

For example, if you wanted to generate a number between 7 and 10, inclusive, you would write:

rand(4) + 7

Ruby lets you do this in a much more readable manner by passing a Range object to Kernel#rand.

>> rand(7..10)
=> 9
>> rand(1.5..2.8)
=> 1.67699693779624

Kernel#srand sets the seed for Kernel#rand. This can be used to generate a reproducible sequence of numbers. This might be handy if you are trying to isolate / reproduce a bug.

random.hd.org(EntropyPool): claims to use a variety of sources, including local processes / files / devices, web page hits, and remote web sites.

Note: As of this writing, the RealRand homepage appears to contain examples for 1.x, where RealRand’s classes are grouped under the Random module. The newest version of the gem (2.0.0) groups the classes under the RealRand module, as in these examples.

“Secure” probably depends on who you are. SecureRandom uses the following random number generators:

openssl

/dev/urandom

Win32

A glance at the code reveals that it defaults to OpenSSL::Random#random_bytes. It looks like PIDs and process clock times (nanosecond) are used for entropy whenever the PID changes.
I suspect that this is enough for most things, but if you need an extra layer of protection, you could use RealRand for additional entropy. Unfortunately, SecureRandom does not have anything like a #seed method, so you will need to seed OpenSSL directly. Note: OpenSSL seeds are strings.

You can read why I used 0.0 here. According to the patch discussion, the 0.0 as the second argument to #random_add is the amount of estimated entropy. Previously, it was being overestimated, so the number was changed to 0.0. However, According to the OpenSSL documentation the 2nd argument to RAND_add is the number of bytes to be mixed into the PRNG state, and the 3rd argument is the estimated amount of entropy. OpenSSL::Random#random_add does only take 2 arguments (instead of 3), but if they got the 2nd argument wrong and 0 bytes of seed are getting mixed in, then SecureRandom is probably worthless for anything serious without a fix. If you know anything about this, please leave a comment.

Now, although you could find human beings that are 50 kilograms (110 lbs), and you could find some that are 130 kilograms (286 lbs), most are not quite that extreme, making the above result unlikely for a completely random sample (not mostly members of McDonald’s Anonymous and professional wrestlers).

The numbers that would generally be obtained are a little better now, but they still don’t approximate reality. You need a way to have the majority of the random numbers fall within a smaller range, while a smaller percentage fit within a much larger range.

What you need is a probability distribution.

Alas, Ruby is not strong in the math department. Most of the statistics solutions I came across were copy/paste algorithms, unmaintained libraries/bindings with little documentation, and hacks that tap into math environments like R. They also tended to assume an uncomfortably deep knowledge of statistics (okay maybe like one semester, but I still should not have to go back to college to generate random numbers based on a probability distribution).

random-word

The random-word gem claims to use the massive wordnet dictionary for its methods. You ever had somebody accuse you of using “them big words?” Those are the kinds of words that random-words appears to produce.

Note: According to the random_data github page, “zipcodes are totally random and may not be real zipcodes.”

Raingrams

The raingrams gem is probably the most interesting thing in this tutorial. It can produce random sentences or paragraphs based on provided text. For example, if you are some kind of sick, depraved, YouTube comment connoisseur, you could create a monstrosity that generates practically infinite YouTube comments, retraining the model with the worst comments as you go, scraping the depths of absurdity, until you get something like:

“no every conversation with a democrat goes like neil degrasse tyson is basically carl sagan black edition at nintendo years old when I was your age I thought greedy corporations worked like this comment has been deleted because the video has nothing to do with what this mom makes 30 dollars a day filling out richard dawkins surveys which is still a better love story than twilight.”

According to wikipedia, “an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.”

Raingrams describes itself as a “flexible and general-purpose ngrams library written in Ruby.” It generates text content by building models based on text occurring in pairs, trios, etc – there doesn’t seem to be a limit on the complexity of the model you can use, but the model classes included go from BigramModel to HexagramModel.

$ gem install raingrams

Creating and training a model is easy.

require 'raingrams'
model = Raingrams::BigramModel.new
model.train_with_text "When you are courting a nice girl an hour seems like a second. When you sit on a red-hot cinder for a second that seems like an hour. That's relativity."
model.random_sentence
=> "When you sit on a nice girl an hour."

If you include the Raingrams module, you don’t need to use it as a namespace.

include Raingrams
model = BigramModel.new

One of the really nice things about Raingrams is the ability to train it with files or web pages instead of just strings. Raingrams provides the following training methods:

Model#train_with_paragraph

Model#train_with_text

Model#train_with_file

Model#train_with_url

I was pleasantly surprised to find that #train_with_url works…pretty well! It isn’t perfect, and it can create sentences that are cut off, but writing a filter to discard broken sentences is probably easier than writing a scraper for every single site you want to train your models with.

Bigram models can work with very small data sets, but they tend to produce rather incoherent results.

>> require 'raingrams'
>> include Raingrams
>> model = BigramModel.new
>> model.train_with_url "http://en.wikipedia.org/wiki/Central_processing_unit"
>> model.random_sentence
=> "One notable late CPU decodes instructions rather than others before him
such as pipelining and 1960s no arguments but still continued by eight
binary CPU register may not see 4."

Coherence to the point of almost believability seems to start with quadgrams. Unfortunately, quadgrams require quite a bit of data in order to produce “random” text.

>> model = QuadgramModel.new
>> model.train_with_url "http://en.wikipedia.org/wiki/Central_processing_unit"
>> model.random_sentence
=> "Tube computers like EDVAC tended to average eight hours between failures
whereas relay computers like the slower but earlier Harvard Mark I which was
completed before EDVAC also utilized a stored-program design using punched
paper tape rather than electronic memory."

>> model = QuadgramModel.new
>> model.train_with_url "http://www.dagonbytes.com/thelibrary/lovecraft/mountainsofmaddness.htm"
>> model.random_sentence
=> "Halfway uphill toward our goal we paused for a momentary breathing
spell and turned to look again at poor Gedney and were standing in a
kind of mute bewilderment when the sounds finally reached our
consciousness the first sounds we had heard since coming on the camp
horror but other things were equally perplexing."
>> model.random_sentence
=> "First the world s other extremity put an end to any of the monstrous
sight was indescribable for some fiendish violation of known natural
law seemed certain at the outset."

That missing apostrophe in “world s” is not a typo, and it was present in the original text. You will need to watch for stuff like that.

Conclusion

Ruby has a lot to offer when it comes to random data. Even more, a lot of these libraries would be easy to modify or improve upon. If you are a newcomer to Ruby, and you want to get involved, this is a great opportunity.