Search This Blog

Fake: Generating Realistic Test Data in Haskell

On a number of occasions over the years I've found myself wanting to generate realistic looking values for Haskell data structures. Perhaps I'm writing a UI and want to fill it in with example data during development so I can see how the UI behaves with large lists. In this situation you don't want to generate a bunch of completely random unicode characters. You want things that look plausible so you can see how it will likely look to the user with realistic word wrapping, etc. Later, when you build the backend you actually want to populate the database with this data. Passing around DB dumps to other members of the team so they can test is a pain, so you want this stuff to be auto-generated. This saves time for your QA people because if you didn't have it, they'd have to manually create it. Even later you get to performance testing and you find yourself wanting to generate several orders of magnitude more data so you can load test the database, but you still want to use the same distribution so it continues to look reasonable in the UI and you can test UI performance at even bigger scale.

Almost every time I've been in this situation I thought about using QuickCheck's Arbitrary type class. But that never seemed quite right to me for a couple reasons. First, Arbitrary requires that you specify functions for shrinking a value to simpler values. This was never something I needed for these purposes, so it seemed overkill to have to specify that infrastructure. EDIT: I was mistaken with this. QuickCheck gives a default implementation for shrink. Second, using Arbitrary meant that I had to depend on QuickCheck. This always seemed too heavy to me because I didn't need any of QuickCheck's property testing infrastructure. I just wanted to generate a few values and be done. For a long time these issues were never enough to overcome the activation energy needed to justify releasing a new package.

More recently I realized that the biggest reason QuickCheck wasn't appropriate is because I wanted a different probability distribution than the one that QuickCheck uses. This isn't about subtle differences between, say, a normal versus an exponential distribution. It's about the bigger picture of what the probability distributions are accomplishing. QuickCheck is significantly about fuzz testing and finding corner cases where your code doesn't behave quite as expected. You want it to generate strings with things like different kinds of quotes to verify that your code escapes things properly, weird unicode characters to check encoding issues, etc. What I wanted was something that could generate random data that looked realistic for whatever kind of realism my domain needed. These two things are complementary. You don't just want one or the other. Sometimes you need both of them at the same time. Since you can only have one instance of the Arbitrary type class for each data type, riding on top of QuickCheck wouldn't be enough. This needed a separate library. Enter the fake package.

The fake package provides a type class called Fake which is a stripped down version of QuickCheck's Arbitrary type class intended for generating realistic data. With this we also include a random value generator called FGen which eliminates confusion with QuickCheck's Gen and helps to minimize dependencies. The package does not provide predefined Fake instances for Prelude data types because it's up for your application to define what values are realistic. For example, an Int representing age probably only needs to generate values in the interval (0,120].

It also gives you a number of "providers" that generate various real-world things in a realistic way. Need to generate plausible user agent strings? We've got you covered. Want to generate US addresses with cities and zip codes that are actually valid for the chosen state? Just import the Fake.Provider.Address.EN_US module. But that's not all. Fake ships with providers that include:

I tried to focus on providers that I thought would be broadly useful to a wide audience. If you are interested in a provider for something that isn't there yet, I invite more contributions! Similar packages exist in a number of other languages, some of which are credited in fake's README. If you are planning on writing a new provider for something with complex structure, you might want to look at some of those to see if something already exists that can serve as inspiration.

One area of future exploration where I would love to see activity is something building on top of fake that allows you to generate entire fake databases matching a certain schema and ensuring that foreign keys are handled properly. This problem might be able to make use of fake's full constructor coverage concept (described in more detail here) to help ensure that all the important combinations of various foreign keys are generated.

Get link

Facebook

Twitter

Pinterest

Google+

Email

Other Apps

Popular posts from this blog

Think of a time you've written tests for (de)serialization code of some kind, say for a data structure called Foo. If you were using the lowest level of sophistication you probably defined a few values by hand, serialized them, deserialized that, and verified that you ended up with the same value you started with. In Haskell nomenclature we'd say that you manually verified that parse . render == id. If you were a little more sophisticated, you might have used the QuickCheck library (or any of the numerous similar packages it inspired in other languages) to verify the parse . render == id property for a bunch of randomly generated values. The first level of sophistication is often referred to as unit testing. The second frequently goes by the term property testing or sometimes fuzz testing.

Both unit testing and property testing have some drawbacks. With unit testing you have to write fairly tedious boilerplate of listing by hand all the values you want to test with. Wit…

As almost everyone with significant experience managing production software systems should know, backwards compatibility is incredibly important for any data that is persisted by an application. If you make a change to a data structure that is not backwards compatible with the existing serialized formats, your app will break as soon as it encounters the existing format. Even if you have 100% test coverage, your tests still might not catch this problem. It’s not a problem with your app at any single point in time, but a problem with how your app evolves over time.

One might think that wire formats which are only used for communication between components and not persisted in any way would not be susceptible to this problem. But these too can cause issues if a message is generated and a new version of the app is deployed before the the message is consumed. The longer the message remains in a queue, redis cache, etc the higher the chances of this occurring.