Category: Data

Although there are a variety of ways to extract unstructured data from files, one tried-and-true, fast and simple approach is to use Datawatch Monarch. Years ago I used this tool when building Department of Defense digital contract reporting projects. At that time, the process to define data regions and extract unstructured data required a bit of field mapping experimentation. With the latest version of Monarch Auto Define, that process is intelligently automated today.

Microsoft released a new sample database a couple of months back: Wide World Importers. It’s quite great: not every (unnecessary feature) is included but only features you’d actually use, lots of sample scripts are provided and – most importantly – you can generate data until the current date. One small drawback: it’s quite tiny. Especially the data warehouse is really small. The biggest table, Fact.Order, has about 266,000 rows and uses around 280MB on disk. Your numbers may vary, because I have generated data until the current date (12th of August 2016) and I generated data with more random samples per day. So most likely, other versions of WideWorldImportersDW might be even smaller. That’s right. Even smaller.

260 thousand rows is nothing for a fact table. I was hoping that the data generator would allow for a bigger range of results, from “I only want a few thousand records” like it does up to “I need a reason to buy a new hard drive.” Koen helps out by giving us a script to expand the primary fact table.

Before I begin, allow me to perform the data science Airing of Grievances. This is an important part of data analysis which most people gloss over, instead jumping right into the “clean up the dirty data” phase. But no, let’s revel in its filth for just a few moments.

Despite my protestations and complaints, I think there are some reasonable conclusions. If you need to look like you’re working for a couple of hours (or at least want to play around a bit with SQL and R), this is the post for you.

Recombination. Having selected a set of high performing blouses we can now consider how they should be recombined to form a new child. While a traditional genetic algorithm would stochastically search all combinations over many market generations, we can shortcut that process by algorithmically looking for features that have been historically preferred by our target client segment.

To achieve this, we find statistical regularities between the population of blouses’ attributes (or configurations of attributes) and client feedback. For instance, we can model the relationship between attributes of our existing blouses and client feedback via:

Genetic algorithms (and Koza-style genetic programming) have long been a favorite topic of mine. Integrating GA with fashion was not something that came to mind, but is a very interesting solution.