Raw Text as Input

Explanation of Input Formats and Record Readers based on FileRecordReader

TextInputFormat’s record reader is based on the FileRecordReader

A record reader understands how to read a single record out of a file or directory as a unique record to process

Any input format (and record reader combination) that uses the FileRecordReader parent class (Image, Text, most) will automatically generate a label for the record

The system will look at the subdirectories in the input directory listed and then for every record in that subdirectory will attach a “label” to it as the string name of the subdirectory (in some cases a unique ID)

Setting up Data For Canova’s TextInputFormat

In the case of the text vectorization pipeline in Canova, we see that every line a document is considered a unique record

We want to prepare our data so that each line represents a unique vector we want to see in our output

Each subdirectory we put these files containing records in represents a label in our dataset

Example: Spam and Ham would have two directories under the main input directory (./input/spam and ./input/ham) and that’s how we get labels into Canova

these labels will show up in the vectorized output as integers

Example Directory Structure

../input/spam
../input/ham

With 1 or many files inside each directory representing records of the class labeled by the subdirectory name. The names are not as important given eventually the labels become integers in the vectorized output.

CSV Records as Input

Many databases export data as CSV as CSV is a universal and flexible format for records. We wanted the users to be able to model CSV data as quickly as possible without writing new code. This brought us to come up with schema transform system expressed in an ARFF-like vector schema setup as described below.

Input Format: LineInputFormat

org.canova.api.formats.input.impl.LineInputFormat

Record Schema coming out of input format: { string }

label is defined in vector schema

Input Files

Lines of CSV records in multiple files in a directory

Vector schema (described below) will tell the vectorization pipeline how to parse the columns

Setting up Data For Canova’s LineInputFormat To Work with the CSV Vector Schema System

In the case of the CSV vectorization pipeline in Canova we have an added dimension of a “vector schema”

CSV data already has column structure to it so we just tell the system which column is the label (example below)