Overview

The SHOGUN machine learning toolbox's focus is on large scale kernel
methods and especially on Support Vector Machines (SVM). It comes with a
generic interface for kernel machines and features 15 different SVM
implementations that all access features in a unified way via a general kernel
framework or in case of linear SVMs so called "DotFeatures", i.e., features
providing a minimalistic set of operations (like the dot product).

Features

SHOGUN includes the LinAdd accelerations for string kernels and the COFFIN
framework for on-demand computing of features for the contained linear SVMs.
In addition it contains more advanced Multiple Kernel Learning, Multi Task
Learning and Structured Output learning algorithms and other linear methods.
SHOGUN digests input feature-objects of basically any known type, e.g., dense,
sparse or variable length features (strings) of any type char/byte/word/int/long
int/float/double/long double.

The toolbox provides efficient implementations to 35 different kernels among
them the

Linear,

Polynomial,

Gaussian and

Sigmoid Kernel

and also provides a number of recent string kernels like the

Locality Improved,

Fischer,

TOP,

Spectrum,

Weighted Degree Kernel (with shifts)
.

For the latter the efficient LINADD optimizations are implemented. Also
SHOGUN offers the freedom of working with custom pre-computed kernels. One
of its key features is the combined kernel which can be constructed by a
weighted linear combination of a number of sub-kernels, each of which not
necessarily working on the same domain. An optimal sub-kernel weighting can be
learned using Multiple Kernel Learning. Currently SVM one-class, 2-class,
multi-class classification and regression problems are supported. However
SHOGUN also implements a number of linear methods like

Linear Discriminant Analysis (LDA)

Linear Programming Machine (LPM),

Perceptrons and features algorithms to train Hidden Markov Models.

The input feature-objects can be read from plain ascii files (tab separated
values for dense matrices; for sparse matrices libsvm/svmlight format),
a efficient native binary format and general support to the hdf5 based format,
supporting

dense

sparse or

strings of various types

that can often be converted between each other. Chains of preprocessors (e.g.
subtracting the mean) can be attached to each feature object allowing
for on-the-fly pre-processing.

Structure and Interfaces

SHOGUN's core is implemented in C++ and is provided as a library libshogun to
be readily usable for C++ application developers. Its common interface functions
are encapsulated in libshogunui, such that only minimal code (like setting or
getting a double matrix to/from the target language) is necessary.
This allowed us to easily create interfaces to Matlab(tm), R, Octave and Python.
(note that a modular object oriented and static interfaces are provided to r,
octave, matlab, python, python_modular, r_modular, octave_modular, cmdline,
libshogun).

Application

We have successfully applied SHOGUN to several problems from computational biology,
such as Super Family classification, Splice Site Prediction, Interpreting the
SVM Classifier, Splice Form Prediction, Alternative Splicing and Promoter
Prediction. Some of them come with no less than 10 million training
examples, others with 7 billion test examples.

Documentation

We use Doxygen for both user and developer documentation which may be read
online here. More than 600 documented examples for the interfaces
python_modular, octave_modular, r_modular, static python, static matlab and
octave, static r, static command line and C++ libshogun developer interface
can be found in the documentation.

This release features 8 successful Google Summer of Code projects and it is the result of an incredible effort by our students. All projects come with very cool ipython-notebooks that contain background, code examples and visualizations. These can be found on our webpage!

Features

In addition, the following features have been added:

Added method to importance sample the (true) marginal likelihood of a
Gaussian Process using a posterior approximation.

Added a new class for classical probability distribution that can be
sampled and whose log-pdf can be evaluated. Added the multivariate
Gaussian with various numerical flavours.

Cross-validation framework works now with Gaussian Processes

Added nu-SVR for LibSVR class

Modelselection is now supported for parameters of sub-kernels of
combined kernels in the MKL context. Thanks to Evangelos Anagnostopoulos

Probability output for multi-class SVMs is now supported using various
heuristics. Thanks to Shell Xu Hu.

Added an "equals" method to all Shogun objects that recursively
compares all registered parameters with those of another instance --
up to a specified accuracy.

Added a "clone" method to all Shogun objects that creates a deep copy

Multiclass LDA. Thanks to Kevin Hughes.

Added a new datatype, complex128_t, for complex numbers. Math functions,
support for SGVector/Matrix, SGSparseVector/Matrix, and serialization
with Ascii and Xml files added. [Soumyajit De].

This release also contains several enhancements, cleanups and bugfixes:

Features

Linear Time MMD two-sample test now works on streaming-features, which
allows to perform tests on infinite amounts of data. A block size may
be specified for fast processing. The below features were also added.
By Heiko Strathmann.

It is now possible to ask streaming features to produce an instance
of streamed features that are stored in memory and returned as a
CFeatures* object of corresponding type. See
CStreamingFeatures::get_streamed_features().

New concept of artificial data generator classes: Based on streaming
features. First implemented instances are CMeanShiftDataGenerator and
CGaussianBlobsDataGenerator.
Use above new concepts to get non-streaming data if desired.

A collection of kernel selection methods for MMD-based kernel two-
sample tests, including optimal kernel choice for single and combined
kernels for the linear time MMD. This finishes the kernel MMD framework
and also comes with new, more illustrative examples and tests.
By Heiko Strathmann.

Alpha version of Perl modular interface developed by Christian Montanari.

New framework for unit-tests based on googletest and googlemock by
Viktor Gal. A (growing) number of unit-tests from now on ensures basic
funcionality of our framework. Since the examples do not have to take
this role anymore, they should become more ilustrative in the future.

Changed the core of dimension reduction algorithms to the Tapkee library.

Bugfixes

Fix for shallow copy of gaussian kernel by Matt Aasted.

Fixed a bug when using StringFeatures along with kernel machines in
cross-validation which cause an assertion error. Thanks to Eric (yoo)!

Fix for 3-class case training of MulticlassLibSVM reported by Arya Iranmehr
that was suggested by Oksana Bayda.

New data-locking concept by Heiko Strathmann which allows to tell machines that data
is not going to change during training/testing until unlocked.
KernelMachines now make use of that by not recomputing kernel matrix in cross-validation.

Cross-validation for KernelMachines is now parallelized.

Cross-validation is now possible with custom kernels.

Features may now have arbritarily many index subsets (of subsets (of subsets (...))).

Support all data types from python_modular: dense, scipy-sparse csc_sparse matrices and strings of type bool, char, (u)int{8,16,32,64}, float{32,64,96}. In addition, individual vectors/strings can now be obtained and even changed. See examples/python_modular/features_*.py for examples.

AUC maximization now works with arbitrary kernel SVMs.

Documentation updates, many examples have been polished.

Slightly speedup Oligo kernel.

Bugfixes:

Fix reading strings from directory (f.load_from_directory()).

Update copyright to 2009.

Cleanup and API Changes:

Remove {Char,Short,Word,Int,Real}Features and only ever use the templated SimpleFeatures.

Split up examples in examples/python_modular to separate files.

Now use s.set_features(strs) instead of s.set_string_features(strs) to set string features.

The meaning of the width parameter for the Oligo Kernel changed, the OligoKernel has been renamed to OligoStringKernel.

This release contains several cleanups, feature enhancements and bugfixes:

Features:

configure now detects libshogun/ui installed in /usr/(local/)lib if libshogun/ui dirs are removed.

Improved documentation (and path and doxygen fixes).

Tutorial on how to develop with libshogun and to extend shogun.

Added the elwms (eilergendewollmilchsau) interface that is a chimera that in one file interfaces to python,octave,r,matlab and provides the run_{octave,python,r} command to run code in {octave,python,r} from within octave,r,matlab,python transparently making variables available to the target interface avoiding file i/o.

Implement AttributeFeatures for (attr,value) pairs, trees etc.

Bugfixes:

fix a crasher occurring with combined kernel and multiple threads.

configure now allows building of modular interfaces only.

n-dimensional arrays work now in octave.

Cleanup and API Changes:

Custom Kernel no longer requires features nor initialization, even not when used in CombinedKernel (the combined kernel will skip over custom kernels on init).

Implement DotFeatures and CombinedDotFeatures. DotFeatures need to provide dot-product and similar operations (hence the name). This enables training of linear methods with mixed datatypes (sparse and dense and other even the newly implemented string based SpecFeatures and WDFeatures).

MKL now does not require CPLEX any longer.

Add q-norm MKL support based on internal Newton implementation.

Add 1-norm MKL support based on GLPK.

Add multiclass MKL support based on the GLPK and the GMNP svm solver.

Implement Tensor Product Pair Kernel (TPPK).

Support compilation on the iPhone :)

Add an option to set wds kernel position weights.

Build static libshogun.a for libshogun target.

Testsuite can also test the modular R interface, added test for OligoKernel.

Comments

You say, "Some of them come with no less than 10 million training examples, others with 7 billion test examples." I'm not sure what this means. I have problems with mixed symbolic/numeric attributes and the training example sets don't fit in memory. Does SHOGUN require that training examples fit in memory?

Soeren Sonnenburg (on January 14, 2011, 18:12:01)

Shogun does not necessarily require examples to be in memory (if you use any of the FileFeatures). However, most algorithms within shogun are batch type - so using the non in-memory FileFeatures would probably be very slow.

This does not matter for doing predictions of course, even though the 7 billion test examples above referred to predicting gene starts on the whole human genome (in memory ~3.5GB and a context window of 1200nt was shifted around in that string).

In addition one can compute features (or feature space) on-the-fly potentially saving lots of memory.

Not sure how big your problem is but I guess this is better discussed on the shogun mailinglist.

Yuri Hoffmann (on September 14, 2013, 17:12:16)

cannot use the java interface in cygwin (already reported on github) nor in debian.