Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field. It only takes a minute to sign up.

What are all the possible ways to represent keywords in a machine learning model?

The two I am aware of are:

one hot encoding, using a static index.

vector representation, using
an embedding layer.

We have a specific problem where we are doing client side (browser) ml and need to convert text data into something the model can consume without sending it over to the server.

EDIT: (comment clarification)
The text data is extracted from the page on which our script loads, we then want to run a model locally on the browser - using text based features from the page (ideally). We are minimising in all elements any data needed to sent to the server.

In terms of model, that is not defined at this stage, also this question primary concerned with representation of text based features.

$\begingroup$Please clarify the client-server setting and what kind of ML is used. Also it's important to know if the keywords are extracted from the text or predefined.$\endgroup$
– ErwanAug 1 '19 at 17:22

$\begingroup$For plain sentences, have you tried word embeddings like Word2Vec or GloVe? Additionally there are new models for converting English words to vectors using transformers like BERT and ELMo which are context based(each vector for each word differs depending on the context of the sentence). Maybe check it out?$\endgroup$
– IronEdwardAug 2 '19 at 3:28

$\begingroup$@IronEdward Yeah I know all these approaches, but they need huge models stored on servers to convert words -> vectors.$\endgroup$
– dendogAug 2 '19 at 13:15

$\begingroup$@dendog thanks. but are you going to train a model or predict based on an existing model on the client side? (or both?) That changes whether you can have a predefined vocabulary or not. Anyway, in general a word is a categorical variable so the smallest possible representation is an index in a predefined array, like one-hot-encoding.$\endgroup$
– ErwanAug 2 '19 at 14:15

1 Answer
1

Since (word-based) one-hot encoding and real-valued vector representations are already mentioned in the question, I would only add the n-gram representation, especially the character-based n-gram representation.

For word-based n-gram representations you consider not individual words, but their ordered combinations in the text and use the one-hot encoding for the combinations. E.g. for n=2 you might end up with the bigrams ["John likes", "likes to", "to watch", "watch movies"] and each of them would be assigned to some dimension using a static index.

This also works with characters, so you can represent the word "encoding" e.g. with those 3-grams: ["enc", "nco", "cod", "odi", "din", "ing"]. The one-hot encodings of n-grams are typically added, so multiple occurances of the same n-gram are recognizable in the resulting Bag-of-n-grams representation. This kind of representation is especially useful for languages with rich morphology and/or compound words. In a one-hot representation each single word form would be encoded in its own dimension whereas a character n-gram approach helps preserve similarity between different forms. An example in the English language would be the similarity between "encode", "encoded" and "encoding" which would stay preserved this way. Similar techniques are also used by some word embedding algorithms which consider subword information like e.g. FastText.

Also, although it's not directly an encoding, but depending on your use case and language it might be worth looking at different preprocessing options like lemmatization and stemming where you reduce different word forms to their base form. This would also affect the choice of representation, e.g. the word-based one-hot encoding might make more sense if you choose to use these preprocessing techniques.