In the spirit of AI, I decided to look into various chatbot frameworks available and build a POC. I landed on Microsoft’s Azure Bot Service as my preferred choice. I’ve had a positive experience with Azure’s ML tools and Microsoft has done a tremendous job in past 3 years investing in cloud services. Also, their Bot Service they also integrates with a NLP service called LUIS (Language Understanding Intelligence Service) – which is also owned by Microsoft. ***Note demo is located at bottom of post***

Coding it was simple in which I chose NODE.js instead of C#. The template provided within the dashboard was a great start, and I only had to make a few updates to help integrate with LUIS on NLP for my domain of my choice. The entire dev and deployment is 100% serverless, which I love.

More about the NLP using LUIS, it’s job is to take a utterance(aka sentence), and determine the intent. That intent is used by the Bot Framework to reply a result. In order for your chatbot to provide some functionally, you need to train a model in LUIS with these utterances. The model requires two things: Intents (verb) and Entities (nouns).

I decided to create a fictitious chatbot for a company called “Jon’s Auto Repair”. The goal of the chatboat was to allow the customer to find out services offered and the cost. Much more could be built including scheduling services, and custom Q&A.

Whenever I’m faced with a machine learning task, my goal on day 1 is to build an initial model. The model will without a doubt need to be tuned in days or even weeks after, but it’s good to have a starting point. In the project below, I timeboxed a machine learning inital model of about 4 hours to see how far I get along with some initial results.

A peer of mine in my Master’s program mentioned there is publicly available Medicare CMS data. I have very little knowledge of healthcare data, but thought I’d explore the data and see if there was an aspect that could be useful in buidling a model to make predictions.

The data:

2008 claims outpatient data (used this, only 1 of 20 available samples, still about 1.1 million rows of claims data)

2008 beneficiary data (used this)

2008 claims inpatient data (did not use this due to initial time constraint)

2008 presciption data (did not use this due to initial time constraint)

I identified one useful infomation to build a model on: Predict a medicare claim specific to ICD9 codes relating to diseases of circulatory system (this makes up about 11% of claims).

Identify Features from Beneficiary data (just grabbed them all to start)¶

In [17]:

features=['AGE','BENE_RACE_CD','BENE_COUNTY_CD','BENE_ESRD_IND','BENE_HI_CVRAGE_TOT_MONS','BENE_SMI_CVRAGE_TOT_MONS','BENE_HMO_CVRAGE_TOT_MONS','PLAN_CVRG_MOS_NUM','SP_ALZHDMTA','SP_CHF','SP_CHRNKIDN','SP_CNCR','SP_COPD','SP_DEPRESSN','SP_DIABETES','SP_ISCHMCHT','SP_OSTEOPRS','SP_RA_OA','SP_STRKETIA']# The name of the column for the output varaible.target='ICD9_DGNS_CD_1'

Group Target ICD9 codes from Claims data (chose Circulatory System Diseases – which is 1 of 17 ICD9 groupings)¶

fromsklearn.cross_validationimporttrain_test_splitx=df_joined_cleaned[features]y=df_joined_cleaned[target]# Divide the data into a training and a test set.random_state=0# Fixed so that everybody has got the same splittest_set_fraction=0.2x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=test_set_fraction,random_state=random_state)print('Size of training set: {}'.format(len(x_train)))print('Size of test set: {}'.format(len(x_test)))

pca=decomposition.PCA(n_components=9)print('original shape prior to PCA',x_train.shape)x_train_new=pca.fit_transform(x_train)x_test_new=pca.transform(x_test)print('new shape after to PCA',x_train_new.shape)

original shape prior to PCA (1715, 19)
new shape after to PCA (1715, 9)

Of the 40 ICD9 codes representing Circulatory diseases, my model only produced predicted for 0422, 0430, and 412, which isn’t ideal, but those three codes make us 37% of my training data. Above, I plotted, recall, precision, and f1 scores. I like using f1 score as it’s really a balance of recall & precision (what portion of true-positives is your model getting and how good it is at predicting true-positives). At this point, much more investigating the data and tweaking the models is needed to improve performance. Gaining domain knowledge is this field would certaining help too!

The data is unbalanced, and if I would have just guessed code 412 for all instances, my Recall rate would have increase, but then my Precision and f1 would have dropped.

This was a “quick and dirty” model building exercise, which didn’t produce great results, but is a good starting point. Rarely are you going to get great results with limited amount of work.

Overall, there is a some opportunity here, but would take many more iterations of model tuning. I would recommend bringing in the Drug Prescription data source, along with a couple more years of claims data so health trends by patient could be leveraged.

Working with arrays in hive is pretty slick. However, I’ve run into an issue in which in the published Hive UDFs there is no function to return an index of a value within an array when it contains an item you’re looking for. So I took it upon myself to write it. This code runs on hive:

Working with complex datasets, often custom code is needed for an intended solution. However, when designing custom code, the use object-oriented design practices promote code reusability and ease to update & extend funtionality. In this post, I’m going to look at the Titanic data (of the 2k passengers, which survived, etc), you can download here Titanic dataset.

This is a basic dataset, however the principles can be applied to more complex data sets. The idea is to perform different operations on each row of data depending on the passenger class (1st, 2nd, 3rd, or the ship’s crew). This could be accomplished by using a bunch of “if, else” statements, but again I’m looking for clean and reusable code here, and when working with complex data sets, it’s a much better approach.

One of the things I love about running Hive is the ability to run Python and leverage the power of the parallel processing. Below I’m going to show a stripped down example of how to integrate a Hive statement & Python together to aggregate data to prepare it for modeling. Keep in mind, you can also use Hive & Python to transform data line by line as well, and it extremely handy for data transformation.

Use case: print out an array of products sold to a particular user. Again is a basic example, but you can build upon this and generate products sold for every user, then use KNN to generate clusters of users, or perhaps Association Rules to generate baskets.

Here is the Python script, which will have to be saved in local Hadoop path:

This demo of a recommender is to illustrate an example of how a website (online music, e-commerce, news) generates recommendations to increase engagement and conversions.

This is not production ready, merely a POC of how it works.

* user selects favorite activities
* data is passed to server and processed in hadoop
* user can go to results page and select an activity to get recommendations

At this point, an auto-workflow has not been built, so there are a series of steps to create the new dataset. Here are the general steps:

1. user data feeds into database via website (which is used in generating recommendations)
2. data is moved and process in Hadoop
3. data is moved to MySQL, accessible using PHP
4. user selects an activity, and the crowd-sourced recommendations are displayed

Example: How Crowd-Sourcing Works (co-occurrence recommendations) Using Activities

A New User like to go to Weddings, and we need to recommend them other activities:
* Find Wedding in History Matrix who also enjoyed Wedding to it: U{Jane, Jill}
* Identify other activities same users (U) enjoyed, and rank by count

I’ve had the opportunity within a Data Mining course in my graduate Software Engineering program to be part of a project in which we were to create a “recommendation engine”. The dataset we used was called the which there are 1M songs, along with play history of 380k users.

The goal was to provide a recommendation (ranked 1-10) of songs based on a current song played. We used three algorithms, Association Rules, Naive Bayes, and user-user co-occurance. When tested, the results were mixed, with Association Rules providing the top F1 scores, but also had the lowest # of recommendations (for a large portion of songs had less than 10 songs recommended). Co-occurance was close behind with the 2nd best F1 score, and provided the largest output of songs, as well as the lowest requirement of computational requirements.