Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field. It only takes a minute to sign up.

So, I'm just starting to learn how a neural network can operate to recognize patterns and categorize inputs, and I've seen how an artificial neural network can parse image data and categorize the images (demo with convnetjs), and the key there is to downsample the image and each pixel stimulates one input neuron into the network.

However, I'm trying to wrap my head around if this is possible to be done with string inputs? The use-case I've got is a "recommendation engine" for movies a user has watched. Movies have lots of string data (title, plot, tags), and I could imagine "downsampling" the text down to a few key words that describe that movie, but even if I parse out the top five words that describe this movie, I think I'd need input neurons for every english word in order to compare a set of movies? I could limit the input neurons just to the words used in the set, but then could it grow/learn by adding new movies (user watches a new movie, with new words)? Most of the libraries I've seen don't allow adding new neurons after the system has been trained?

Is there a standard way to map string/word/character data to inputs into a neural network? Or is a neural network really not the right tool for the job of parsing string data like this (what's a better tool for pattern-matching in string data)?

4 Answers
4

Using a neural network for prediction on natural language data can be a tricky task, but there are tried and true methods for making it possible.

In the Natural Language Processing (NLP) field, text is often represented using the bag of words model. In other words, you have a vector of length n, where n is the number of words in your vocabulary, and each word corresponds to an element in the vector. In order to convert text to numeric data, you simply count the number of occurrences of each word and place that value at the index of the vector that corresponds to the word. Wikipedia does an excellent job of describing this conversion process. Because the length of the vector is fixed, its difficult to deal with new words that don't map to an index, but there are ways to help mitigate this problem (lookup feature hashing).

This method of representation has many disadvantages -- it does not preserve the relationship between adjacent words, and results in very sparse vectors. Looking at n-grams helps to fix the problem of preserving word relationships, but for now let's focus on the second problem, sparsity.

It's difficult to deal directly with these sparse vectors (many linear algebra libraries do a poor job of handling sparse inputs), so often the next step is dimensionality reduction. For that we can refer to the field of topic modeling: Techniques like Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) allow the compression of these sparse vectors into dense vectors by representing a document as a combination of topics. You can fix the number of topics used, and in doing so fix the size of the output vector producted by LDA or LSA. This dimensionality reduction process drastically reduces the size of the input vector while attempting to lose a minimal amount of information.

Finally, after all of these conversions, you can feed the outputs of the topic modeling process into the inputs of your neural network.

$\begingroup$Incidentally, I can relate to "feature hashing" since that seems very similar to a bloom filter, which I'm familiar with from working with cryptocurrency code. I wonder if it's more effective to have a hashing function relate an input feature to multiple index positions (bloom-filter-style) rather than need a second hash function to set the sign of an index...$\endgroup$
– MidnightLightningAug 5 '14 at 13:44

Both the answers from @Emre and @Madison May make good points about the issue at hand. The problem is one of representing your string as a feature vector for input to the NN.

First, the problem depends on the size of the string you want to process. Long strings containing may tokens (usually words) are often called documents in this setting. There are separate methods for dealing with individual tokens/words.

There are a number of ways to represent documents. Many of them make the bag-of-words assumption. The simplest types represent the document as a vector of the counts of words, or term frequency (tf). In order to eliminate the effects of document length, usually people prefer to normalize by the number of documents a term shows up in, document frequency (tf-idf).

Another approach is topic modeling, which learns a latent lower-dimensional representation of the data. LDA and LSI/LSA are typical choices, but it's important to remember this is unsupervised. The representation learned will not necessarily be ideal for whatever supervised learning you're doing with your NN. If you want to do topic modeling, you might also try supervised topic models.

For individual words, you can use word2vec, which leverages NNs to embed words into an arbitrary-sized space. Similarity between two word vectors in this learned space tends to correspond to semantic similarity.

A more recently pioneered approach is that of paragraph vectors, which first learns a word2vec-like word model, then builds on that representation to learn a distributed representation of sets of words (documents of any size). This has shown state-of-the-art results in many applications.

This is not a problem about neural networks per se, but about representing textual data in machine learning. You can represent the movies, cast, and theme as categorical variables. The plot is more complicated; you'd probably want a topic model for that, but I'd leave that out until you get the hang of things. It does precisely that textual "downsampling" you mentioned.

Take a look at this tutorial to learn how to encode categorical variables for neural networks. And good luck!

$\begingroup$Worth noting that this is not explicitly a problem in all of machine learning, but only a problem when it comes to generating feature vectors, which are not ubiquitous in machine learning.$\endgroup$
– Slater VictoroffJul 30 '14 at 20:12

$\begingroup$Random forest is a good example of something for which getting a feature vector of the sort you see in neural nets is not an issue. A lot of unsupervised methods also work on raw words rather than feature vectors. Note: I didn't say there are methods that don't use features, only that there are methods which do not rely on strictly structured vectors.$\endgroup$
– Slater VictoroffJul 30 '14 at 23:00

I've tried the following 2 ways for trial-n-test implementation of neural networks with text. The latter one works fairly well, but with limitations.

Create vocabulary using word2vect or NLTK/custom word tokens and assign an index to each word. It is this index which represents the word as number.

Challenges:

The indexes must be "normalized" using feature scaling.

If the output of the neural network has even a slight variation, then the output may be an index to unexpected word (e.g. if expected output is 250; but NN outputs 249 or 251, then it might be a close output from numeric context; but they are indexes to different words). Recurrent NN to generate output index can be leveraged here.

If new words are added to the vocabulary, then the token indexes should be re-scaled. The model trained with previously scaled values may become invalid and must be re-trained.

Use identity matrix e.g. for "n" words use "n x n" or (n-1 x n-1) matrix where each row and column represents a word. Put "1" in intersection cell and "0" at rest places. (reference)

Challenges:

Every input and output value is "n x 1" vector. For a large sized vocabulary its a bulky computation and slower.

If new words are added to the vocabulary, then the identity matrix (i.e. word vector) should be re-calculated. The model trained with previously calculated vectors may become invalid and must be re-trained.