NK Labs: Grouping Conversations - Fun with Sentence Vectors

In the first NK Labs blog series, we introduced DARPA’s Data-Driven Discovery of Models (D3M) program, which aims to develop tools that automate various steps of the data science workflow with the end goal that users with no expertise in data science or machine learning will be able to use these pipelines to solve useful challenging problems.

As a key participant of this program, New Knowledge has contributed to D3M by driving advances in various technical areas, including natural language processing (NLP), big data summarization, time-series forecasting, clustering and classification, neural networks, deep learning, transfer learning, graph analytics, and more. These advances are packaged up as primitives, i.e. building-blocks, which can be combined to form a meta-learning algorithm with primitives submitted by other D3M performers to yield optimal machine learning pipelines.

We created a primitive for unsupervised learning of sentence embeddings by implementing a python wrapper for Sent2Vec [1], originally written in C++. Here’s a deep-dive on what that means, why it is useful, and how it’s used to fight disinformation at New Knowledge.

What are Sentence Embeddings?

Most machine learning algorithms aren’t able to process strings or text in their raw form, and instead, require a numerical representation to perform most tasks. Keeping that thought in mind, let’s look at word embeddings. It is the idea that the meaning of a word is captured in a vector using the cosine similarity property, i.e. the similarity between two words correlates with the cosine of the angle between their vectors. For example, the cosine similarity between USA and burgers might be 0.8, whereas it may be 0.3 for USA and bratwurst. Similarly, a sentence embedding uses a fixed-dimensional vector to represent a sentence or a small paragraph.

Figure 1. Semantic similarity is the measure of the degree to which two pieces of text carry the same meaning. (Source)

Why Sent2Vec?

Although there are many effective models for word embeddings, it is still challenging to produce useful semantic representations for sentences or small paragraphs. It is even more difficult to learn these representations in an unsupervised way, which would greatly advance various machine learning methods. There is an emerging trend in text understanding towards building extremely powerful and complicated models (RNNs, LSTMs, Neural Turing Machine architectures, etc.), but the big disadvantage is the slow training time for models because of the increased complexity. However, in recent years, simple (and scalable!) algorithms using averaged word vectors have shown to outperform these complex models for sentence embedding, with the additional advantage of being able to easily process large datasets.

Last year, Sent2Vec was introduced in the paper, Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features [2]. The library delivers numerical representations of short texts or sentences, which can be used as an input for any machine learning algorithm. In addition to word vectors, the model uses n-gram embeddings and simultaneously trains the composition and embedding vectors themselves. It can be thought of as an unsupervised version of Facebook’s fastText [3], designed to help build scalable solutions for text representation, and an extension of word2vec, model used to produce word embeddings, to sentences. This algorithm outperforms the state-of-the-art unsupervised models on most benchmark tasks and beats supervised models on numerous tasks as well. The source code and pre-trained models are publicly available on GitHub [4].

How does New Knowledge use this primitive?

We use most of the research we have done for the D3M program internally at New Knowledge to fight disinformation and tag potential bad actors who may influence conversations and conduct information warfare. For this specific primitive, we have found that text clustering social media data on Sent2Vec generated vector spaces is much more useful in understanding content similarity than traditional topic models, such as Latent Dirichlet Allocation (LDA) [5], Latent Semantic Indexing (LSI) [6], etc. An example of text clustering on Harry Potter books is shown in Figure 2. We are actively working on improving this primitive, as well as others, and finding more internal use cases to do our part in defending public discourse.

To learn more about the research mentioned in this article, please see the references below. New Knowledge’s python wrapper for the Sent2Vec model can be found at [1]. To read the paper in which Sent2Vec was proposed, please go to [2]. If you’re interested in Facebook’s fastText, check out [3]. The original source code and training models for Sent2Vec can be found at [4]. Read more about traditional topic models at [5, 6].