Getting Started with AutoML for ML.NET

Dr. James McCaffrey provides hands-on examples in introducing ML.NET, for machine learning prediction models, and AutoML, which automatically examines different ML algorithms, finds the best one, and creates a Visual Studio project with the C# code backing the best model, along with C# code that shows how to use the trained model to make a prediction.

Microsoft ML.NET is a code library for .NET Core (the multi-platform version of the .NET Framework). The ML.NET library allows you to use C# to create, train, and use three types of machine learning prediction models: multiclass classification, binary classification, and regression. Directly writing programs that use ML.NET is quite difficult. The new AutoML system has a command-line tool that automatically examines different machine learning algorithms, finds the best algorithm, and creates a Visual Studio project that has the C# code that generated the best model, along with C# code that shows how to use the trained model to make a prediction. Quite remarkable.

[Click on image for larger view.]Figure 1. AutoML for ML.NET in Action and Directory Structure

The best way to understand what AutoML for ML.NET is and to see where this article is headed is to take a look at the screenshot in Figure 1. The goal of the demo is to create a machine learning model that predicts the job satisfaction of an employee at a hypothetical company. There is a 40-item file of training data and a 10-item file of test data. Both files look like:

Each line represents an employee. The value in the first column indicates if the employee is paid an hourly rate (True) or is paid by salary. The second column is employee age. The third column is job type. The fourth column is annual income. The fifth column is the employee's job satisfaction.

The prepare the demo, I installed 1.) Visual Studio 2019, 2.) the .NET Core SDK, and 3.) the ML.NET CLI (command-line interface) tool. Then I issued a rather lengthy command that starts with "mlnet.exe auto-train" and is followed by seven arguments.

The caret ("hat") character is used for line continuation in a command shell. The AutoML tool analyzed the training data, used the ML.NET library to explore several variations of five different machine learning algorithms, identified the best algorithm ("FastTreeOva"), and saved this best one as a model.

To summarize, AutoML is a tool/parameter named auto-train that's part of the mlnet.exe command-line tool. The combined tools call into the ML.NET library, running in .NET Core, using the .NET Core SDK. Notice there isn't a specific "AutoML" entity. The term AutoML was used in pre-release versions of the system and still appears in much of the documentation. The system is also called AutomatedML and auto-train.

After identifying the best multiclass classification algorithm and saving the trained model, the AutoML tools created two Visual Studio projects in a subdirectory named EmpClassifier. The first project is named EmpClassifier.ConsoleApp.csproj, which has the C# code that created and trained the best model, and C# code that calls the trained model to make a prediction. The second project is named EmpClassifier.Model.csproj, which, somewhat confusingly, is essentially a duplicate of the first project and is there for backward-compatibility reasons.

Installing the AutoML for ML.NET ComponentsTo use AutoML for ML.NET the first step is to install Visual Studio 2019 if you don't already have it on your machine. Go here and select the 2019 Community (free) edition option. This will install a Visual Studio Installer utility program that allows you to customize Visual Studio with different capabilities in modules called workloads. Because AutoML for ML.NET is based on .NET Core, you only need the ".NET Core cross-platform development" workload, but I recommend installing the traditional ".NET desktop development" workload too. Launch VS 2019 from the Windows Start menu to verify the installation. If you ever need to uninstall VS 2019, you can do so through the VS Installer program.

Next you need the .NET Core SDK. Go here and select the "Download .NET Core SDK" option. This will download a self-extracting executable named something like dotnet-sdk-2.2.1-win-x64.exe. When you run it, you will get the SDK so you can develop .NET Core programs and the runtime so you can run .NET Core programs. I installed version 2.2.1, but by the time you read this there could be a newer version available. After installing the SDK and runtime, launch a command shell and enter "dotnet --version" to verify the installation. If you ever need to uninstall the SDK, you can do so through the Control Panel Add/Remove Programs GUI interface.

The third step is to install the ML.NET CLI tool, which also includes the ML.NET library. Make sure your machine is connected to the Internet. Launch a command shell and enter "dotnet tool install -g mlnet". The -g argument stands for global. This command will magically install program mlnet.exe in the C:\Users\<user>\.dotnet\tools directory and update your system PATH environment variable to point to the program. You will also get the ML.NET library installed so you don't need to install it separately. Launch a command shell and enter "mlnet --version" to verify the installation. If you ever need to uninstall the ML.NET CLI tool, you can do so from the command line by entering "dotnet tool uninstall mlnet -g".

Preparing Data for ML.NETThe 40-item training dataset and 10-item test dataset are shown in Listing 1. The files are tab-separated and are named employees_train.tsv and employees_test.tsv. The ML.NET library also supports comma-separated files with a .csv extension, and space-separated files with a .txt extension.

Notice that Boolean values are encoded as True or False rather than 0 or 1, which is common in other libraries. Both data files have a tab-separated header line of (hourly, age job, income, satisfac). Header lines are not required but are recommended. The dependent variable to predict can be placed in any column, but it's usual to place it as the first or last column.

There is no required directory structure for ML.NET systems, but I recommend using a top-level root directory named Employees that contains a subdirectory named Data where you place the training and test files.

Using AutoML for ML.NETTo invoke the AutoML system, launch a command shell and use the cd command to navigate to the root Employees directory. You can type ML.NET commands as one long line, but I prefer to use the caret continuation character. The auto-train option has 14 parameters as shown in the table in Figure 2. Of these, there are only three required parameters: task, dataset, and label-column-name or label-column-index.

[Click on image for larger view.]Figure 2. Arguments for the Auto-train Option of ML.NET CLI

The task argument can be "multiclass-classification" when the class to predict has three or more possible values, "binary-classification" when the class to predict has exactly two possible values, or "regression" when the value to predict is numeric, such as income or age.

The dataset argument is required and is the path to the training data. You can use an absolute path or a relative path. The test-dataset argument is optional. If no test-dataset is given, the system will use the training data when computing accuracy.

The label-column-name argument is used to specify the dependent variable to predict. If your data files do not have a header, then you can use the label-column-index parameter with a 1-based integer value.

The name argument specifies the name of the subdirectory and Visual Studio projects that will be created. The name argument is optional and if omitted, you'll get a long, ugly default name that contains the word Sample.

The cache argument tells AutoML to load the entire training dataset into memory if possible. This argument is optional and if omitted, AutoML will try to determine whether to cache data or not.

The max-exploration time limits the amount of time, in seconds, that AutoML is allowed to explore different machine learning algorithms and their hyperparameters. The default value is only 10 seconds, which is rarely long enough to get good results in a non-demo scenario.

Interpreting ResultsFor a multiclass classification problem, AutoML generates two key metrics. MicroAccuracy is the percentage of correct predictions made by the model on the test dataset. For example, in Figure 1 you can see that the FastTreeOva algorithm created a model that scored 0.6000 (60.00 percent) on the 10-item test data: 6 out of 10 predictions correct. When AutoML has enough time to run a particular algorithm more than once, it will average the results.

The MacroAccuracy metric is weighted by the number of items per class. MacroAccuracy is useful when a dataset is highly skewed towards one class. For example, suppose a test dataset of 200 items had 180 low-satisfaction items, 10 medium-satisfaction items and 10 high-satisfaction items. A model could just predict low-satisfaction for all items and score 180 / 200 = 0.9000 accuracy for the MicroAccuracy metric. But the MacroAccurcy would be (0.9000 + 0.0000 + 0.0000) / 3 = 0.3000. Therefore, when you see a big discrepancy between MicroAccuracy and MacroAccuracy, you should make sure the model is not simply predicting the most common class.

Many of the classification algorithms used by AutoML have "Ova" in their names. This stands for "one versus all". The OVA technique is something of a hack applied to a classification algorithm that is designed for binary classification so that the algorithm can be used for multiclass classification.

Wrapping UpAfter generating a prediction model using AutoML, the next step is to use the model to make predictions. The auto-generated Visual Studio projects have code that make a prediction for the first data item in the training dataset. But you can modify the template code to make a prediction for a new, previously unseen item. Such code could look like:

The prediction probabilities array might have values like (0.2500, 0.0500, 0.7500) and then the predicted class would be "high" because the value at index [2] is the largest. The probabilities are ordered by how each class first appears in the training dataset. If you refer back to Listing 1 you'll notice that the first item is "low," the second item is "medium," and the third item is "high." Because of this mechanism, when using AutoML for ML.NET it's a good idea to rearrange the first few items in the training dataset so that each class to predict appears one by one in a logical order of some kind.

Featured

This week saw two third-party vendors of dev tools -- UX and UI toolkits and controls -- release new offerings that include support for two of Microsoft's main open source frameworks, the cross-platform .NET Core 3.1 and Blazor, which allows for creating browser-based web applications with C# instead of JavaScript.

Clustering non-numeric -- or categorial -- data is surprisingly difficult, but it's explained here by resident data scientist Dr. James McCaffrey of Microsoft Research, who provides all the code you need for a complete system using an algorithm based on a metric called category utility (CU), a measure how much information you gain by clustering.