README.md

Machine Learning for Email

The code in support of this short eBook provides a brief introduction to key concepts in machine learning on text data. The first two chapters focus on the R programming language and exploring data using it. The last two explore two specific case studies in machine learning on text data using a corpus of email. In these chapters we introduce methods for performing classification and ranking of these data with classic machine learning techniques.

Chapter 2 - Data Exploration

Here we provide a brief review several statistical concepts that are frequently used in basic data analysis and machine learning. We introduce many methods for exploring data in R, both statistically and visually. This chapter covers:

Exploration vs. confirmation

Inferring types and meanings

Numeric summaries

Means, medians and modes

Quantiles

Standard deviation and variance

Exploratory data visualization

Chapter 3 - Classification: Spam Filtering

Next, we come to our first case study in machine learning. In this chapter we introduce the naive Bayesian technique for classifying text data. To do so, we use the canonical Spam Assassin email data set. This chapter covers:

Loading the text data

Cleaning and preparing for feature extraction

Text mining the data

Writing a naive Bayesian classifier

Classification and testing

Visualizing the results

Chapter 4 - Ranking: Priority Inbox

Finally, we move from a binary classification task to ranking. Using the same Spam Assassin data, we construct a more complicated feature set to design our own "home brew" priority inbox ranker. This chapter covers: