Software

Disclaimer: The software on this page is provided on an as is basis for research purposes. There is no additional support offered, nor are the author(s) or their institutions liable under any circumstances.

Acknowledgments

We gratefully acknowledge the funding from the following agencies/programs that made the research possible:

Arabesque: Distributed Graph Mining Framework

Distributed data processing platforms such as MapReduce and Pregel (Giraph, GraphLab, Graph-X) have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms such as computing Pagerank, SSP, CC etc.
However, these platforms do not represent a good match for graph mining problems, as for example finding frequent subgraphs (FSM) in a labeled graph or finding Cliques. Given an input graph, these problems require exploring a very large number of subgraphs(embeddings) and finding patterns that match some “interestingness” criteria desired by the user. The number of the embeddings that the system needs to explore grows exponential to the size of the embeddings.

Arabesque: Think like an Embedding paradigm
Our system Arabesque is the first distributed data processing platform focused on solving these graph mining problems. Arabesque automates the process of exploring a very large number of embeddings, and defines a high-level computational model (filter – process) that simplifies the development of scalable graph mining algorithms. In a nutshell, Arabesque explores embeddings and passes them to the user defined code, which must simply decide whether an embedding is of interest and should be further extended (filter function) and whether we should compute outputs for the embeddings that passed the filter function (process function). This allows the end-user to think of the problem in a natural way, Think like an Embedding (TLE), where the embedding is a first class citizen, and not to try to force the problem into other paradigms. Even so, Arabesque runs under the familiar and well supported Apache Giraph system in a Hadoop cluster. Arabesque simply uses Giraph as a generic data processing platform that provides a BSP execution flow.

Elegant but above all Efficient
Arabesque is quite unique in that despite the simple and elegant API and the distributed implementation it is extremely efficient. Indeed, in our SOSP 2015 paper, we show that a single threaded Arabesque provides a comparable performance to specialized centralized implementations. This is possible because the user-defined functions are not the computational expensive parts of the Graph Mining Algorithms. The computational expensive functions are handled by Arabesque and deal mostly with the graph exploration, the isomorphism computations and the canonicality checks all inherent to Graph mining algorithms, and thus can be handled as efficient as a centralized implementation.

Data Mining Template Library (DMTL)

DMTL is an open-source, high-performance, generic data mining toolkit, written in C++.
It provides a collection of generic algorithms and data structures for mining increasingly
complex and informative patterns types such as: Itemsets, Sequences, Trees and Graphs.

DMTL
utilizes a generic data mining approach, where all aspects of mining are controlled via a set
of properties. The kind of pattern to be mined, the kind of mining approach to use, and the
kind of data types and formats to mine over are all specified as a list of properties. This
provides tremendous flexibility to customize the toolkit for various applications.

DMTL implements all algorithms following the vertical data mining approach. For itemsets, the implementation
follows the Eclat approach (without diffsets). For sequences it follows the Spade approach.
For trees it implements the SLEUTH framework, which allows one to mine embedded/induced and
ordered/unordered trees. Finally, the graph mining framework uses a novel vertical approach.

Itemset Mining and Association Rules

The section contains the code for mining all frequent itemsets, all closed itemsets, and all maximal frequent itemsets.
It also includes the code for constructing the concept lattice (or iceberg lattice) and generating nonredundant association rules.

Frequent Itemsets (Eclat)

Eclat uses the original vertical tidset approach for mining all frequent itemsets
(1997-eclat), combined with the diffsets improvement (2003-diffsets).
The code also contains the maxeclat, clique, and maxclique approaches mentioned in
(2000-eclat:tkde).

Closed Itemsets (Charm, Charm-L) and Non-redundant Rule Generation

Charm mines all the frequent closed itemsets as described in (2002-charm).
Charm-L adds the ability to construct the entire frequent concept lattice (also called as the iceberg lattice), that is,
it adds the links
between all sub/super-concepts (or closed itemsets) (2005-charm:tkde).
This ability is used to mine the non-redundant association rules (2004-nonredundant:dmkd).

Frequent Boolean Expressions (BLOSOM)

The BLOSOM framework allows one to mine arbitrary frequent boolean expressions include AND clauses (itemsets), OR clauses, and CNF/DNF expressions. It focuses on mining the minimal boolean expressions.

Sequence Constraints (cSpade)

cSpade mines constrained frequent sequences. The constraints can take the form of length or width limitations on the
sequences, minimum or maximum gap constraints on consecutive
sequence elements, applying a time window on allowable
sequences, incorporating item constraints, and finding
sequences predictive of one or more classes. The class specific sequences can be used for sequence classification as
described in (2000-featuremine:is).

Tree Mining

TreeMiner and TreeMiner-D

Treeminer uses the vertical scope-list representation to mine frequent sequences (2005-treeminer:tkde). Treeminer counts all embeddings, whereas Treeminer-D counts only distinct occurrences, which can be more appropriate for some datasets.

Download

TreeMiner code: vtreeminer is the TreeMiner code, whereas htreeminer is the PatternMatcher code as mentioned in the paper below.

Graph Mining & Indexing

Representative Orthogonal Graph Mining (Origami)

Origami uses random walks over the graph partial order to mine a representative sample of the maximal frequent subgraph patterns. It next selects only the orthogonal representative patterns via a local clique-finding method.

Graph Pattern Sampling (Output Space Sampling)

Whereas Origami can mine a sample of maximal graph patterns it does not provide any uniformity guarantee. MUSK (2009-musk) proposes a Markov Chain Monte Carlo based approach to guarantee a uniform sample of all maximal patterns. The MCMC approach was further extended in (2009-graphsampling) to mine a sample of all frequent patterns, to mine support-biased patterns and also to mine a sample of discriminative patterns.

Clustering

This section contains code for mining categorical subspace clusters, shape based clusters, for clustering
based on a lower bound on similarity, and a new outlier-based approach for initial cluster seed selection.

Arbitrary Shape Clustering (Sparcl)

Sparcl finds shape-based clusters. It uses a two step approach: in the first step we select a relatively large number of candidate centroids (via ROBIN) to find seed clusters via the K-means algorithm and in the second step we use a novel similarity kernel to merge the initial seed clusters to yield the final arbitrary shaped clusters.

Microarray Gene Expression Clustering (TriCluster and MicroCluster)

Tricluster is the first tri-clustering algorithm for microarray expression clustering. It builds upon the new microCluster bi-clustering approach. Tricluster first mines all the bi-clusters across the gene-sample slices, and then it extends these into tri-clusters across time or space (depending on the third dimension). It can find both scaling and shifting patterns.

Biological Sequence Analysis

Hidden Markov Models: Variable Order HMM with Duration (VOGUE)

VOGUE is a variable order and gapped HMM with with duration. It uses sequence mining to extract frequent patterns in the data. It then uses these patterns to build a variable order HMM with explicit duration on the gap states, for sequence modeling and classification. VOGUE was applied to model protein sequences, as well as a number of other sequence datasets including weblogs.

Genome Scale Indexing: Disk-based Suffix Trees (Trellis and Trellis+)

Trellis is a disk-based suffix tree indexing methods (with suffix links) that is capable of indexing the entire human genome on a commodity PC with limited memory. Trellis+ extends Trellis by further removing some memory limitations by using a novel guide suffix tree in memory.

sMotif and exMotif are two complementary for searching and extracting/mining structured sequence motifs DNA sequences. A structured motif consists of simple motifs separated by different gap lengths. The simple motif may be a simple pattern or a position weighted matrix or profile. Given a template structured motif (pattern or profile), sMotif finds all matches in a given set of sequences. On the other hand, exMotif mines novel motifs matching some minimal conditions on the gaps and frequency.

Protein Docking and Partial Shape Matching (ContextShapes)

ContextShapes does rigid-body protein docking. It uses a novel contextshapes data structure to represent local surface regions/shapes on the protein. All critical points on both the receptor and ligand are represented via context shapes, and the best docking is found via pair-wise matching.

Protein Indexing (PSIST)

PSIST uses suffix trees to index protein 3D structure. It first converts the 3D structure into a structure-feature sequence over a new structural alphabet, which is then used to index protein structures. The PSIST index makes it very fast to query for a matching structural fragment.