Monday, 25 February 2013

It struck me recently that there's lots of parallels one can draw between the act of creating and describing a dataset and the act of hand knitting something. (Bear with me on this - it'll make sense, honest!)

The picture above is my scarf. I'm very fond of it. I knitted it myself, and it's warm and comfortable and goes well with a lot of my clothes.

When you're hand knitting a scarf, you take a ball of yarn, and you cast on stitches to make a row, then you keep adding rows until you run out of yarn, the scarf gets to the right length, or you get fed up with knitting.

The yarn in a ball doesn't contain any information or structure, but by the act of putting stitches into it, you're encoding something. In the case of my scarf above, it's a repeating pattern called feather and fan stitch , but it can just as easily be another pattern, or no pattern at all. If you wanted to get really fancy, you can encode all sorts of information into a knitted item - the most famous example of this is Madame Defarge in Dicken's "A Tale of Two Cities", knitting the names of the upper classes doomed to die at the guillotine into a scarf.

(Pushing the analogy a bit far, each stitch could represent a bit in a dataset, with a knit stitch signifying a zero and a purl stitch a one, but in this case that's not so helpful, as I've got yarn overs and knit-two-togethers as well as knit and purl stitches in there.)

My scarf was created by a process of appending- each new row got added to the previous, like a dataset where each new measurement gets appended on to the previous one to make a time series. The scarf has a fixed number of stitches in each row, the same as a dataset where a fixed number of measurements are taken each day. This doesn't have to be the case, I've seen plenty of patterns for scarves out there with variable row lengths. It all depends on the look you want it to have, or what the knitting is supposed to be - you have variable row lengths to shape the sleeves of a jumper, for instance.

Sometimes my data got corrupted. I dropped a stitch, or miscounted the number of knit-two-togethers that I needed to do, and came out with the wrong number of stitches at the end of the row. Usually when this happens you have to pull out the stitches until you get back to the place where you can fix the mistake, and then re-knit the rows you've pulled apart. It can get a bit annoying, especially when you're ripping out perfectly good rows to fix a mistake you hadn't spotted before, which is several rows (and possibly hours of knitting time) below.

I know for a fact that my scarf is not perfect. I've made mistakes there, and I'd feel really uncomfortable having someone scrutinise it and point out all my errors. Thankfully, no one's planning on peer-reviewing my scarf - though they would if I entered it into one of the knitting competitions you sometimes get at village fetes.

Like a dataset, I could have kept adding stitches and rows to my scarf ad infinitum, but there came a point when I actually wanted to wear it, so that meant I had to finish it off (i.e. cast off the stitches and sew in the ends). I could have used it while it was still being knitted (er... maybe as a pot holder, or a lap warmer?) but the knitting needles would have got in the way. It wouldn't have been ideal. Even if I had decided that I didn't want it to be a scarf after all, and was happy with it as a washcloth (a very sparkly one), I still would have had to have cast off and finished it properly, otherwise the first time I used it, it would have pulled apart into a big tangle of yarn. The same is true for datasets - if you're going to use them, you need them to be properly finished off - i.e. a firm definition of what pieces of data you are using, and what pieces you're not.

So, I finished my scarf/dataset, and I can now use it for the purpose for which it was intended - to keep my neck warm in a stylish yet comfortable way. Now what?

Well, I have a lot of scarves. So I need some way of identifying it, storing it, and maybe even reproducing it (when it wears out, or someone wants to make themselves one just like it). In other words, I need metadata about my scarf.

Descriptive metadata is easy. At a very basic level it's things like colour: "teal blue" and what it is: "scarf". But even with something this simple, you still need to have common language to make sure that the descriptors are understood. "Teal blue" makes perfect sense to me, but might not mean anything to someone else, who might think it looks a bit green.

Thankfully, there are other ways of describing the scarf. I can say that it's 200cm long, and 20cm wide, and that it was made from King Cole Haze Glitter DK (the type of yarn), colourway 124 - Ocean, with dyelot 67233. And all those last pieces of metadata, though too specific for general use, do describe the scarf accurately, though not completely, and makes a start at providing the information needed to recreate it.

For recreating the scarf, I need all the metadata about what yarn was used, but I also need the size of the needles I knitted it on (4mm). I need the pattern that I used (18 stitch feather and fan, with a 2 stitch garter stitch border at the edges). I need the number of stitches I cast on (54) and my tension (how tightly I knit in this pattern - 28 rows and 27 stitches for a 10cm by 10cm square). You don't need any of this information to wear the scarf, but it is important to keep it if you want to recreate it!

(As an aside, I didn't keep all the metadata about how I made the scarf and what yarn I used for it written down somewhere, which meant that when I came to write this post, I needed to work it out all over again. In other words, metadata should be collected from the start and stored somewhere safe, regardless of what it's describing!)

I then also need to make sure my scarf is stored correctly when I'm not using it, so it doesn't get lost, or (heaven forbid!) corrupted (i.e eaten by moths or shredded by mice). I also need to be able to tell people where it's stored, so that when I ask my other half to fetch it for me, I can also tell him that it's hanging on the door of my wardrobe.

I want to be able to cite my scarf when I'm talking about it. Mostly, I just do it by saying "my teal blue feather and fan scarf", to distinguish it from the other scarves I have hanging around the place. I could get fancy and assign it a KOI (a Knitted Object Identifier) but most of my handknits are sufficiently distinct that a casual glance can tell which is which from a short description!

And finally, because I've put a lot of time and effort into making my scarf, I'd like to get credit for doing that. Which, for me anyway, is covered when someone says to me "that's a nice scarf" and I can respond with "thanks! I made it myself" and a proud smile.

There's more to this analogy than the special case of one dataset/scarf being created by one single creator, but I'm sure I've bludgeoned you with enough knitting terminology already today. I'm sure I can stretch the analogy further, but that's something for another post!

I'll leave you with a challenge, to think about something you've made yourself, with your own two hands. It can be anything; a nice meal, a garden, a piece of clothing, a lego model, a painting, a piece of furniture. Something that you made yourself and are proud of. Got something?

The way you feel about that thing is the way that dataset creators feel about their data, especially if that dataset has been created through great effort and took a lot of time. Everyone wants acknowledgement and credit for the work that they do. Data creators are no different!