Using automatic machine learning (AutoML), Comprehend lets you create custom Natural Language Processing (NLP) models using data that you already have, without the need to learn the ins and outs of ML. Based on your data set and use case, it automatically selects the right algorithm, tuning parameter, builds, and tests the resulting model.

If you already have a collection of tagged documents—support tickets, call center conversations (via Amazon Transcribe, forum posts, and so forth)— you can use them as a starting point. In this context, tagged simply means that you have examined each document and assigned a label that characterizes it in the desired way. Custom Classification needs at least 50 documents for each label, but can do an even better job if it has hundreds or thousands.

In this post I will focus on Custom Classification, and will show you how to train a model that separates clean text from text that contains profanities. Then I will show you how to use the model to classify new text.

Using Classifiers My starting point is a CSV file of training text that looks like this (I blurred all of the text; trust me that there’s plenty of profanity):

The training data must reside in an S3 object, with one label and one document per line:

Next, I navigate to the Amazon Comprehend Console and click Classification. I don’t have any existing classifiers, so I click Create classifier to make one:

I name my classifier and select a language for my documents, choose the S3 bucket where my training data resides, and then create an AWS Identity and Access Management (IAM) role that has permission to access the bucket. Then I click Create classifier to proceed:

The training process begins right away:

The status changes to Trained within minutes, and now I am ready to create an analysis job to classify some text, some of it also filled with profanity:

I put this text into another S3 bucket, click Analysis in the console, and click Create job. Then I give my job a name, choose Custom classification as the Analysis type, and select the classifier that I just built. I also point to the input bucket (with the file above), and another bucket that will receive the results, classified per my newly built classifier, and click Create job to proceed (important safety tip: if you use the same S3 bucket for the source and the destination, be sure to reference the input document by name):

The job begins right away, and also takes just minutes to complete:

The results are stored in the S3 bucket that I selected when I created the job: