Descriptors in numl

As some of you know I have been working on a machine learning library for .NET called numl. The main purpose of the library is to abstract away some of the mundane issues surrounding setting up the learning problem in the first place. Additionally sometimes the math in machine learning seems to be a bit daunting (some of it is indeed daunting) so the library allows you to either get into the math or trust that these things are implemented and run correctly.

In order to facilitate this type of abstraction I came to realize that the best way to bridge this gap was to use constructions that most would have already either used or understood: classes. The learning problem, as I understood it, was taking a set of things and trying to learn a way to predict a particular aspect of these things. The best approach therefore was to allow for an easy way to markup these things (or classes) in order to produce an efficient technique for setting up the learning problem.

I settled for an attribute based system that looks something like this:

This approach is intended to allow for quick and easy feature selection in a straightforward and intuitive manner.

Now for the topic at hand. Given that this is simply a bridge between the common class structure and the actual machine learning algorithms I needed to provide the actual bridging mechanism into the library. I finally landed on calling this mechanism a Descriptor. I thought the name was fairly indicative of exactly what job it handled: describing the learning problems in terms of features and the corresponding label (in the case of the supervised problem). In essence it describes the machine learning problem to the mathematical side of the algorithms. Its dual responsibility is to also describe the outcome of the algorithms back to the original structure in terms that it can understand. The Descriptor therefore becomes the literal bridge by describing the problem in terms of Matrices/Vectors to the algorithms while projecting the results of a prediction (which happens in terms of vectors) back to the original object wherein the problem is described.

Here is the general workflow for running a supervised learning problem:

Create a descriptor (line 3)

Create a generator to build a model (lines 5-6)

Use the model for prediction (lines 9-17)

In this case we are loading a bunch of data into a collection of Iris objects. The Descriptor in this example participates in every phase of the learning and prediction process. The first area of participation is obvious: a concrete descriptor is created in line 3 based upon the Iris type. This uses simple reflection to find all of the corresponding attributes and subsequently adds features and label to the descriptor. The other two areas where the descriptor participates are not as obvious from the code.

Once the model is created a ready for prediction, the descriptor is once again used in order to convert the object to a vector representation and then fill in the appropriate property with the model prediction.

An Alternate Approach

Creating descriptors automatically from marked up classes proved to be a useful abstraction in most cases. As I continued to test the library I noticed that an alternate pattern emerged around data that required late binding. Data structures such as collections of dictionary objects, DataTables, and even Dynamic objects like ExpandoObject would preclude the ability to mark up a class where none existed. I also noticed that in this case creating a descriptor proved to be cumbersome (well at least I felt the experience could be improved). In this case I decided to add a fluent interface to the Descriptor in order to describe objects that would never be marked up with attributes but would still participate in the learning process:

Seth Juarez

My name is Seth Juarez. I currently reside near Redmond, Washington and am Microsoft Evangelist for Channel 9.
I received my Bachelors Degree in Computer Science at UNLV with a Minor in Mathematics and completed my Masters Degree at the University of Utah in the field of Computer Science. I currently am interested in Artificial Intelligence specifically in the realm of Machine Learning. I currently am working on a .NET library meant to simplify the usage of the common machine learning algorithms.
I've been married now for 14years to a fabulously beautiful girl and have two beautiful daughters, and two feisty sons.
View all posts by Seth Juarez

Nelson Silva
October 19, 2013 @ 2:43 pm

Petr
October 24, 2013 @ 4:03 am

Hello, thank you very much for creating open-source .NET machine learning library. I really like its API and clear code of its internals. But numl obvously lacks of documentation and usage examples.

I am pretty new to machine learning and read all information on numl that you provided on numl site and your blog. But I've failed to use it for my scenario. It would be very appreciated if you can give an advise.

I need to implement automatic classification of documents. Each doc has [StringFeature] Title, [StringFeature] Text and [StringLabel] Topic. I have a data set for learning - several hundred of docs with Topic set manually. So I tried to do the following:
var descriptor = Descriptor.Create();
var generator = new NaiveBayesGenerator(2);
generator.Descriptor = descriptor;
var model = Learner.Learn(dataForGenerator, 1, 10, generator);

But it fails at the edn of Learn() method because accuracy.MaxIndex() contains vector of NaN values. The same happened when I tried to use DecisionTree and Perceptron model generators.

Seth Juarez
January 16, 2014 @ 9:44 am

Tagir
January 22, 2014 @ 11:41 am

Just one idea, I was wondering if Descriptor's fluent API can be modified to accept Expressions? That would make it more concise and allow for compile time validation. As a result the following call
Descriptor.New()
.With("SepalLength").As(typeof(decimal))
.With("SepalWidth").As(typeof(double))

Seth Juarez
January 22, 2014 @ 2:01 pm

I think it is a fantastic idea! As I see it there are three levels to descriptors:

Strong Descriptors: features/label burned into the classes through attributes. Makes it easy to declare things but it is highly coupled to the data type.

Weak Descriptors: features/label declared dynamically (with strings as I have it). A bit more difficult to declare but is completely agnostic to the data type (so long as it has the right property/type pairs). Currently this works with IEnumerables, DataTable, Dictionaries, and even anything of type dynamic (lie Expando)

Medium? Descriptors: I think this is where your suggestion falls. Not directly coupled with the type itself (as in Strong Descriptors) but still dependent on the shape

Overall I think anything to decrease friction along the surface of interaction with the lib is a fantastic idea.

Seth Juarez
February 16, 2015 @ 12:05 pm

Jon
March 5, 2015 @ 2:16 pm

First, I think this library is excellent since it simplifies ML for us beginners. I'm trying to build an application that will run a neural network on an arbitrary set of data. I do not have instances of a well defined class (which I can tag with attributes). Instead, I have a M by K matrix: M samples, K features. I also have a vector of size M which is the labels. It seems difficult to use the descriptor above if K isn't a compile time constant.

Janus Knudsen
December 1, 2015 @ 11:48 am

Hello Seth
Been looking at numl with great interest the last week, you make an impressive job.
I like especially your channel 9, both the machine learning, and to my big surprise you interviewed one of the coolest js-dudes out there, the one and only Rob Eisenberg, which I really like. I have been on the Aurelia-boat since early sandbox I guess, amazing framework.

I have one question for you and I believe you are the brightest among them all to answer it, can you feel the pressure :)