Identifying Fraud from Emails using Machine Learning

Identifying Fraud from Enron Emails

Objective

Using the Enron email corpus data to extract and engineer model features, we will attempt to develop a classifier able to identify a "Person of Interest" (PoI) that may have been involved or had an impact on the fraud that occured within the Enron scandal. A list of known PoI has been hand generated from this USATODAY article by the friendly folks at Udacity, who define a PoI as individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity. We will use these PoI labels with the Enron email corpus data to develop the classifier.

Data Structure

The dataset used in this analysis was generated by Udacity and is a dictionary with each person's name in the dataset being the key to each data dictionary. The data dictionaries have the following features:

Model Development Plan

Developing the classification model will consist of 6 steps that can be iterated over until an acceptable model performance is obtained. These steps include:
1. Exploratory Data Analysis on the dataset to understand the data and investigate informative features
2. Select features likely to provide predictive power in the classification model
3. Clean or remove erroneous and outlier data from the selected features
4. Engineer/transform selected features into new features appropriate for classification modelling
5. Test different classifiers and review performance
6. Investigate new data sources that may exist to provide better model performance

Data Exploration and Cleaning

Data Overview

The dataset includes 146 unique individials having 21 features extracted from the email corpus. The amount of feature data for each individual varies. Table 1 summarizes the data count for each feature in the dataset and the count of PoI included in non-nan feature values.

Feature

Data Count

Count PoI

name

146

18

poi

146

18

total_stock_value

126

18

total_payments

125

18

email_address

111

18

restricted_stock

110

17

exercised_stock_options

102

12

salary

95

17

expenses

95

18

other

93

18

to_messages

86

14

shared_receipt_with_poi

86

14

from_messages

86

14

from_this_person_to_poi

86

14

from_poi_to_this_person

86

14

bonus

82

16

long_term_incentive

66

12

deferred_income

49

11

deferral_payments

39

5

restricted_stock_deferred

18

0

director_fees

17

0

loan_advances

4

1

There are some features in the dataset that having missing information that will be important to our usecase. Some of the features in the dataset will not be very useful in the classification model, as they do not have labelled PoI in their subset of availible data, such as restricted_stock_deferred and director_fees. loan_advances is such a small data sample that it will likely not provide statistically signifigant predictive strength. deferral_payments is borderline, but there are other feature datasets that will likely provide better predictive information. Reviewing the data for missing values, we will exclude restricted_stock_deferred, director_fees, loan_advances and deferral_payments as features from the model.

Outliers

In analyzing the histograms for each feature, a large outlier was noticed across many variables. After further investigation, this turned out to be attributed to an aggreate row of the dataset labelled "TOTAL". This was removed from the dataset.

It also appears that 'LAY KENNETH L' is a large outlier in many features. Kenneth is a PoI however, so we will keep his data for building the classifier and see if we can maintain the ability to build a generalized classifier for the other PoI, as well as for Kenneth.

Data Imputation

In reviewing the dataset, many of the email feature datasets only contain 14 of the 18 identified PoI, while most of the financial features have the full 18 or 17 PoI included. Throwing out data for 20% of our identified PoI seems unnacceptable for such a scarce dataset. Imputing the mean value for any missing email feature seems appropriate, as there is likely a reasonable average number of emails any one employee might send. We will also impute 0 for any financial feature, making the assumption that if the value is missing the employee did not recieve that form of financial compensation. A review of this data imputation may be required as the model is developed and in analyzing the results.

Feature Selection and Engineering

Selection

As was discussed above, restricted_stock_deferred, director_fees, loan_advances and deferral_payments have been ommitted as features for the classification model due to their small sample size. There also exists features that are labels instead of useful datum, such as name and email_address. These will not be used as features in the classification model. This leaves us with 15 potentially useful features to select from. Reviewing the data structure, there are two general categories of data: Finanical and Meta-Email. These overarching data categories seem well suited for Primary Component Analysis, to build a featureset that encompasses the most predictive information from the features and avoids dependant features such as total_payments and salary causing erroneous classification or overfitting. Applying PCA to the 15 features to reduce the dimensioanilty to 2 overarching categories, we can visualize the transformed feature relationships.

%matplotlibinlineimportnumpyasnpimportpandasaspdimportpicklefromggplotimport*fromfeature_formatimportfeatureFormat,targetFeatureSplitwithopen("my_dataset.pkl","r")asdata_file:my_dataset=pickle.load(data_file)withopen("my_feature_list.pkl","r")asdata_file:features_list=pickle.load(data_file)data=featureFormat(my_dataset,features_list,sort_keys=True)labels,features=targetFeatureSplit(data)#Apply PCA to dataset to allow for visualizationfromsklearn.decompositionimportPCAreduced_data=PCA(n_components=2).fit_transform(features)labels=np.array(labels).reshape(len(reduced_data[:,0]),1)reduced_data=np.hstack((reduced_data,labels))df=pd.DataFrame(reduced_data,columns=['param1','param2','poi'])pca_plt=ggplot(df,aes(x='param1',y='param2',color='poi'))+\
geom_point()printpca_pltpca_plt_zoomed=pca_plt+xlim(-2500000,5000000)+ylim(-1000000,3500000)printpca_plt_zoomed

<ggplot: (32479948)>

<ggplot: (32864214)>

Reviewing the PCA plots, there appears to be a few PoI outliers that should be easy to classify, but zooming in on the cluster of datapoints shows PoI data points are very commingled with non-PoI data points that may be difficult to adequately classify. This model may need more advanced features to develop an adequate classifier.

Feature Engineering

It is likely that additional features are required to adequately model a PoI classifier. Most of the supplied features are absolute values of a persons' financial compensation or email correspondance. It makes sense that a more relative measure of these features is appropriate to use when comparing whether someone is a PoI or not. Three additional features have been engineered for use in classifier development:
1. perc_salary - The percentage each employees' salary makes of of their total payment, defined as salary/total_payments. This will represent if the employees compensation is made up of complex additional payments or just a basic salary.
2. perc_to_poi - The percentage of employees total sent emails to a PoI. This will represent the degree an employee spends their time interacting with a PoI.
3. perc_from_poi - The percentage of employees total recieved emails from a PoI. This will represent the degree a PoI spends their time interacting with an employee.

In reviewing the plots, both the perc_to_poi vs perc_salary and the perc_from_poi vs perc_to_poi plots show good seperation of PoI from non-PoI datapoints. The perc_from_poi vs perc_salary plot shows less of a seperation, with many PoI data points closely clustered with non-PoI datum. These advanced features seem to show more promise then the simple PCA approach above, but we will test and compare the relative performance of tuned classifiers using the two approaches.

Model Development, Tuning and Evaluation

Using sklearn's Machine Learning Map as a guide, we will test the PCA and advanced feature datasets with various classifier algorithms and tuning parameters in an attempt to develop the optimal PoI classifier. Sklearn's pipeline and gridsearch modules will be helpful in performing this analysis. We will also use sklearns StandardScaler to scale out input datasets, as some of our classifier algorithms highly recommend the use of scalled variables, like the svm classifer which is not scale invariant.

Model Validation

In order to determine which model has the "best" performance, model validation is required to prove the models ability to make correct predictions on a generalized dataset. Model validation is the method of extracting a subset of a dataset, known as the test data, training a model excluding this extraction and feeding the model with the extracted dataset to test its performance in predicting the known results of the extracted dataset. This is important to allow for the quantified comparasion of different models.

To compare the performance of each tuned model, we will use Udacity's test_classifier function. This function uses Precision and Recall as the performance metrics for each model. The Precision of the classifier is its ability to correctly identify PoI without incorrectly labelling people that are not PoI as a PoI. The Recall of the classifier is its ability to correctly identify all of the PoI that should be identified. In order to perform the validation classification, the test function uses sklearns StratifiedShuffleSplit method to split the dataset into training and test data. This allows us to maximize the data points used in both the training and testing datasets by performing multiple training and validation experiments on a shuffled subset of the full dataset. Being a fairly sparse dataset with only 18 labelled examples of PoIs, it is important for the testing methodology to perform on multiple interations of training and test data to ensure accurate performance results. It is also important to include shuffling in the test data selection to ensure a random distribution of PoI and non-PoI data points in each dataset.

Reviewing the results, we see that the svm and KNeighbors classifiers have fairly good Precision, meaning they are able to properly identify PoI wihout too many incorrectly labelled PoI. The RandomForestClassiefier has poorer Precision performance. The Recall for all classifiers is rather poor, with the svm classifier having the highest Recall at 0.1255. This indicates that, although the classifiers do not miss-label many non-PoI employees, they tend to under-label correctly people that should be considered PoI. Considering the use case for this model is likely to be a filter that could provide a short-list of people to investigate further, having a high Recall is important to ensure the short-list captures people that are likely to be PoI. Given this, it is likely the above classifiers are not performant enough to provide sufficient value in this investigation. From visual inspection it was seen that the data relationships showed little classifcation potential and after exhaustively tuning these classifiers, that interpretation seems to be correct. The PCA featureset does not seem to provide enough informative characteristics to build a suitable classifier, and further feature engineering is likely required.

Although we managed to improve our best Recall score with a tuned KNeibors Classifier, the Precision performance was dramitacally reduced. At this point, it would likely be beneficial to perform anothet iteration of the model development process in order to better explore the data, develop features and investigate other classisifier algorithms. However, there is one more approach that may be useful to investigate: developing a hybrid model by combining the two input theories outlined above.

Hybrid Model

Using sklearns FeatureUninion module, we can combine the PCA reduced dimensionality dataset with the engineered feature dataset as a hybrid featureset for model development.

The Hybrid model was able to develop a classifier that satisfies the minimun target of 0.3 for both Precision and Recall with an SVM classifier. Although the model's Precision is reduced compared to previous iterations, it is better suited for correctly identifier employees that are PoI in order to develop a short-list of people to investigate further.

Conclusions and Reflection

In this analysis, we attempted to build a classification model that could predict whether someone is likely to be a Person of Interest (PoI) in the Enron scandal given their email and financial data. After exploring the data and removing outliers, we investigated two input featuresets: PCA transformed data and engineered features that provided a normalized representation of the email and financial data. Applying multiple classifier algorithms and tuning via GridSearch, the final model was developed which consisted of using a hybrid of the PCA and engineered featureset as the model inputs and a polynomial svm classifier with degree 3 and C and gamme values both 1. This final classifier was selected by reviewing the comparative performance of each models' Precision and Recall scores, with the selected model having a Precision of 0.314 and Recall of 0.333. These performance metrics were chosen as they provide a balance between the goal of producing a short-list of potential PoI candidates to flag for further investigation, while preventing the over classification of non-PoI individuals. Using other performance metrics, such as accuracy in this situation would lead to sub-optimal classifiers, as the number of correct predictions is not as useful as ensuring people that are PoI's are identified as such. This can be a common pitfall of performance measurement, and the goal of the model needs to be well defined before its performance can be measured.

Using the GridSearch functionality of sklearn allowed for the automated tuning of the classifier algorithms. This tuning allowed for the use of different parameters in the classifiers, such as the type of svm (linear, rbf or polynomial) or the number of nearest neighbors to use in the KNeighbors classifier. This tuning is important in producing an effective classifier, as different datasets will result in different patterns. A linear svm may work well with clearly split datasets, but a polynimal svm will be better suited to datasets that result in curved patterns. Without correctly tuning the classification algorithms, a sufficient model will be difficult to develop.

Although the final model meets the minimum goal of Precision and Recall >0.3, there is likely a more optimal classifier that could be developed by further exploring the dataset, engineering more intelligent features and investigating other classifier algorithms better suited to this problem. This model is also likely overfitted to this dataset and probably not directly useful in future investigations, but the proof of concept could be utilized by the investigators to generate their own model as they begin to identify PoI in their own investigation of a similiar nature to the Enron scandal. It would also be interesting to investigate if other features can be developed to help identify potential PoI, such as performing Natural Language Processing on the content of email messages to descern if any conversation patterns emerged between PoI.