Hi I am Alberto De Marco , I write this blog . I am interested mainly in security & ML/big data tech but also in some other collateral stuff.

As you know I am running machine learning for all to experiment with automated machine learning with all the people that want to try it for free.

This time I want to share with you some of code running behind the scenes , in particular the scala code that , once a file with data and one with target column has been uploaded for analysis to a temporary folder of my azure blob account, trigger the execution of the entire automated machine learning workflows that , as I explained here runs on top of TransmogrifAI .

Of course there is a LOT that can be improved (how many times am I rewriting the same blob configuration paths ?????, proper error management, etc…) but it’s a starting point :-):

Hi everyone , this time I would like to share some of the insights I discovered while working on automated machine learning projects , discussing the vision, what is achievable today and what the future can be.

The vision, or at least my vision, is that with proper tools/software we can and we have to empower all the employees to use their business/domain knowledge to take data driven decisions that impact company performance.

Now let’s dig a bit in the vision statement and let’s understand what are the step by step elements that we need to have in place:

Valuable business/domain knowledge

Be able to handle/learn/master quickly new tools/software

Be able to extract/combine/clean data from internal and external sources

Be able to transform insights into decisions

At first sight it seems very difficult to find someone with all those characteristics and in fact it is truly difficult , however you can quickly overcame those difficulties with a team where different skills combined together can be sufficient to achieve the desired outcome.

One example can be the following :

Senior Manager with deep business knowledge, good understanding of company data and curious to experiment with new technologies

Data Analyst with good statistical background and good experience in extract/combine/clean data and that is able to code new solutions combining different technologies

Of course this is just an example but you can get the idea: you don’t have to search for unicorns but you can create those teams with your existing workforce , adding , if necessary, an initial external help to bootstrap your specific initiative.

Let’s make another example to show how an hypothetical automated machine learning project can be done:

Opportunity statement and selection: the Manager and Data Analyst prioritize the use cases that can at the same time produce a high impact, that are achievable (data availability in terms of quantity, quality and sources) and that can actually drive decisions (so not just insights search but real impact on the business processes).

Once the use cases are selected a time limit has to be defined in order to limit the effort for each bundle of use cases and test as fast as possible if positive results can be achieved.

Data collection and aggregation/combination of data. At this step the business expertise has to be blended with the data available and the potential data that you can obtain internally or externally to produce meaningful datasets that have the your target variable and as many as possible potential influencing factors.

Treasure hunt! Yes at this stage using automated machine learning you can quickly iterate with hundreds of models and different hypothesis, and this usually triggers the need of more data that has to be added to the datasets. It’s very important to time box this stage otherwise you can stay at this step for too much time without producing any results.

Treasure found! Depending of a variety of factors (availability of data, ability to model the data, luck, etc…) you have unlocked a small or big insight that is telling you: your target variable is influenced by A, B, C, D, etc…

Understand what you can do about it. Yes you found that A,B,C,D.. are influencing factors but what are , among A,B,C,D…, the variables that you can truly change and with this change have a better performance?

Optionally Simulate. Once you found that you can control only A and D for example, simulate the possible variations of A and D values combined with your business constraints and find the values that can boost the performance of your process.

Execute and Monitor. Apply the findings of previous steps , measure the impact that you are generating and monitor the deviations between the expected outcomes and the real ones.

At the final stage it will be probably needed a developer in order to integrate the machine learning models with your processes and also a platform that is able to monitor/maintain the model lifecycle and data drift, but the most mature automated machine learning solutions offer that inside their package.

What about the future?

Probably we will see even more automation coming:

automated suggestions of related (internal & external) datasets to the ones we are using

automated transformations looking at the cardinality of our datasets

automated/semiautomated data pipeline creation/sharing in and between ml projects

automated/semiautomated industrialization (not only model exposed as api but model exposed as a complete app / embedded in existing apps)

One of the my dream applications since I started working with data has been a “magic” app that was able to provide me , for a given a dataset, the insight I was looking for.

At the same time in my work I observed several times the need of this kind of tool, so I decided to create a simple website http://mlforall.azurewebsites.net/ (still alpha version) to try to see if I was able to assemble something like that.

How it works?

Upload a csv file with the data you want to analyze

Choose the column you want to understand

Done! In few minutes you will receive in your email the results!

You can for example upload the Titanic dataset , choose the Survived column as your objective and in few minutes have this:

So this means that sex and price/kind of ticket were very important factors for the survival. In fact looking at the data you can see that a significant part of the people that survived were female and/or having a first class ticket.

What about name? Well in reality name contains Mrs, Mr terms that are an equivalent of sex and that’s why they are marked as important.

Now I guess the question you have is : why are you doing this and how are sustaining the costs of this?

The answer to the first question is that I want to understand if really “normal” people can benefit from tools like this.

The answer to the second question is that actually it’s me paying but with tight cost control and the right architecture you can do the same for few dollars or even less every day.

Let’s go then to the architecture and the software stack (still work in progress of course):

In essence the flow is the following:

File upload on the web site (App Service) lands in blob storage container

If you ask to your friends how to cook chicken perfectly you will have several different answers depending on their preferences, their style of cooking , their taste, etc.. but one thing they will all agree : a badly cooked chicken will not taste good and can be even dangerous for your health.

Same applies to cloud deployments , we all have our ideas on what are the best solutions, architectures, etc.. but we all are able to understand when something seems not completely right..

Be global: Data/Applications replicated across the globe with robust consistency

Be fast: Any write/update/read/query/page view should happen in milliseconds

So in a perfect world to avoid the pitfalls we would like to have/manage cloud resources that have auto start/stop/scale up/scale down according to traffic/usage with safe limits, that we pay by second when used, that we don’t manage at OS level, that are accessible from anywhere, that a team of 2-3 people can easily manage and that the underlying technology is constantly updated to the latest and greatest standards.

At the same time we can quickly understand that several of the common desires are not simply doable in this perfect world… , let’s take an example: if we want to use the latest and the greatest open source software it’s often our duty to manage OS.

Similarly if we want to be cloud portable we cannot leverage any cloud specific solution and we have to work at the lowest denominator between clouds : VMs , so again we have to manage OS and probably having army of people developing scripts and packages that have been tested against all the different versions of VM/Network/Storage/Accounts/etc.. across different clouds.

Always in this hypothesis we have to script by ourself the start/stop/scale up/down logic, monitoring, etc.. and we have to basically create our own “multi cloud account/network/storage/etc… provider”.

Now all of this , even if it seems very hard to do, it has been done completely or partially by giants like Facebook, Spotify, etc…, so in theory any company can do the same at some conditions:

Put on the plate the same level of investment of those companies

Being able to attract and hire the same level of talented employees

Having very few specialized IT workloads that are in the end the main revenue stream of company itself (so “the product sold is the software shipped” ).

Hi everyone , this time I want to evaluate another automated machine learning tool called H2O Driverless AI and also compare it with DataRobot (of course a very lightweight type of comparison analysis has been done).

First great feature of H2O driverless AI is that you can have it instantly (almost) as long you have an Amazon, Google or Azure account you can spin up a H2O Driverless quite easily as described here :

You can choose if you want to Bring your own license (you can ask an evaluation of 21 days as I did) or pay the cost of the license inside the hourly costs of the VM in your cloud provider.

Once you have your VM up & running, my suggestion is to update it to the latest docker image of H2O Driverless AI as described into the how to

sudoh2oaiupdate

and then connect to the UI.

Once connected you can upload directly from the UI the datasets you like and perform ML experiments with them choosing which column you want predict/analyze and what is the metric you want to measure your model (AUC, etc..).

Here one running on 4 GPUs:

Once experiment is finished with the interpreting model page you can understand the key influencers in your dataset for the target column you were interested to analyze/predict.

Since few days ago I did some test with DataRobot and Kaggle competitions I tried to perform the same on H2O and the results are….

Titanic Competition (metric accuracy –> higher better):

DataRobot 0.79904 (Best)

H2O 0.78947

House Prices Regression (metric RMSE –> lower better):

DataRobot 0.12566 (Best)

H2O 0.13378

As you can see on both DataRobot leads but the results of H2O are not so far away !

Talking instead of model understanding and explicability of the results in “human” terms I see DataRobot offering more different and meaningful visualizations than H2O, additionally you can decide by yourself which of the many models you want to use and not only the winning one (there are cases where you want to use one that is less accurate but that has a higher evaluation (inference) speed), while with H2O you have no choice than using the only one surviving to the process of automatic evaluation.

H2O however is more accessible in terms of testing/trying , it offers GPU acceleration that is a very nice bonus especially on large datasets .

Hi everyone, this time I want to share one of my preferred side activities : playing with my Ubiquiti home setup!

As you already know I have my controller running on Azure , but I wanted to understand more on which kind of data is stored inside the controller, in other words where the data that we see in the controller dashboard is stored.

Inspecting the binaries and looking in 2-3 posts on the forums I figured out that this data is sitting in a mongodb database, but I wanted of course to look a bit inside of this database.

What I did is the following, I made a backup of the data of the controller using the web interface of the controller and I downloaded it locally:

At this stage I downloaded the controller software for an installation on my laptop (Macbook) and at controller startup i requested a restore of the backup i just downloaded from the cloud controller.

Once the restore is done, mantain the controller running and you can use a mongodb client like robo 3T and connect to localhost and port 27117 (we connect to the mongod process started by the controller locally).

This is great! But I would like to produce some nice dashboards , with some visualization tool like Tableau or PowerBi or simply Excel but the data in a “Document” format while I need it to be in Table/Records format.

The solution is the Mongo Bi Connector that is a kind of “wrapper” or “translator” between the “Document” world and the tables/records world.

But things are never simple ;-), this connector works only from MongoDB v. 3.0 or higher while the one inside the controller software is 2.6. So first we have to download a separate mongodb server that works with it but more importantly upgrade the database itself to the 3.X format.

First let’s copy the database from the controller folder (check a folder called db) and copy it to another location, write down this location.

I tested and failed various times before understanding how to do it but this is the sequence (using brew to install mongodb on my mac):

install mongodb 3.0 –> open the controller database in the location we copied.

uninstall 3.0 /install 3.2 –> open the controller database in the location we copied.

uninstall 3.2 /install 3.4 –> open the controller database in the location we copied.

This will bring the database to a format that is working with the Bi Connector.

Now in the Bi Connector you can extract the schema of a document collection you like (for example the stat_daily collection of ace_stat database) and after that spawn the wrapper process that can be used by a visualization tool:

In my case I used tableau to create some test dashboards:

Here I see that the CPU of my gateway was a bit high during the first part of the month and then decreased significantly.

I can add other metrics like downloaded data, etc.. to understand better:

In reality in this specific case there is already a super nice visualization already offered by the controller dashboards:

So the real interesting thing here is that you can actually create your own report and also discover new insights looking at the your own network data

So what are you waiting for ? Happy custom reports on your Unifi network and device data!