A Tool for Embedding Strings in Vector Spaces

Sally – Home

A tool for embedding strings

Sally is a small tool for mapping a set of strings to a set of
vectors. This mapping is referred to as embedding and allows
for applying techniques of machine learning and data mining for
analysis of string data. Sally can be applied to several types of
string data, such as text documents, DNA sequences or log files,
where it can handle string data in directories, archives and
text files.

Sally implements a standard technique for mapping strings to a
vector space that is often referred to as vector space model
or bag-of-words model. The strings are characterized by a set
of features, where each feature is associated with one dimension of
the vector space. The following types of features are supported by
Sally:
bytes, words, n-grams of bytes and n-grams
of words.

Sally proceeds by counting the occurrences of the specified features
in each string and generating a sparse vector of count
values. Alternatively, binary or TF-IDF values can be computed and
stored in the vectors. Sally then normalizes the vector, for example
using the L1 or L2 norm, and outputs it in a specified format, such
as plain text or in
LibSVM
or
Matlab format.

There are many applications for Sally, for example, in the areas
of natural language processing, bioinformatics, information
retrieval and computer security. To illustrate the merit of Sally,
we provide some examples including text
categorization, finding genes in DNA and analysing similarities of
languages. All examples come with data sets and instructions.