Working with time series is difficult due to the high dimensionality of the data, erroneous or extraneous data, and large datasets. At the core of time series data analytics there are (a) a time series representation and (b) a similarity measure to compare two time series. There are many desirable properties of similarity measures. Common similarity measures in the context of time series are Dynamic Time Warping (DTW) or the Euclidean Distance (ED). However, these are decades old and do not meet today’s requirements. The over-dependance of research on the UCR time series classification benchmark has led to two pitfalls, namely: (a) they focus mostly on accuracy and (b) they assume pre-processed datasets. There are additional desirable properties: (a) alignment-free structural similarity, (b) noise-robustness, and (c) scalability.
This repository contains a symbolic time series representation (SFA) and two time series models (BOSS and BOSSVS) for alignment-free, noise-robust and scalable time series data analytics.

Schäfer, P. and Leser, U. (2017).
Fast and Accurate Time Series Classification with WEASEL.
Int. Conf. on Information and Knowledge Management (CIKM). Singapore

PIEJoin is a trie-based index and algorithm for parallel set containment joins (SCJ). We provide source code, executables, implementations of four other SCJ algorithms, and data sets as used for evaluation in the paper.

SOFA is a novel and extensible optimizer for UDF-heacy data flows. It builds on a concise set of properties for describing the semantics of Map/Reduce-style UDFs and a small set of rewrite templates, which use these properties to find a much larger number of semantically equivalent plan rewrites than possible with traditional techniques. A salient feature of SOFA is extensibility: We arrange user-defined operators and their properties into a subsumption hierarchy, which considerably eases integration and optimization of new operators. Our system is made for big data sets to be analyzed in a distributed setting, and we use several third-party tools for providing domain-specific analysis. On this page, we provide a download and instructions that can be used to repeat the experiments described in the paper

Scientific workflows are complex objects, and their comparison entails a number of distinct steps from comparing atomic elements to comparison of the workflows as a whole. Various studies have implemented methods for scientific workflow comparison
and came up with often contradicting conclusions upon which algorithms work best. Comparing these results is cumbersome, as the original studies mixed different approaches for different steps and used different evaluation data and metrics. We contribute to the field (i) by disecting each previous approach into an explicitly defined and comparable set of subtasks, (ii) by comparing in isolation different approaches taken at each step of scientific workflow comparison, reporting on an number of unexpected findings, (iii) by investigating how these can best be combined into aggregated measures, and (iv) by making available a gold
standard of over 2000 similarity ratings contributed by 15 workflow experts on a corpus of almost 1500 workflows and re-implementations of all methods we evaluated.

FRESCO (Framework for REferential Sequence COmpression)
is a general open-source framework to compress large amounts of biological sequence data. FRESCO incorporates several techniques to increase compression ratios beyond state-of-the-art: 1) selecting a good reference sequence and 2) rewriting a reference sequence to allow for better compression. In addition, FRESCO further boosts compression ratios by applying referential compression to already referentially compressed files (so-called second-order compression). This technique allows for compression ratios way beyond state-of-the-art, for instance, 4000:1 and higher for human genomes. Our results show that real-time compression of highly-similar sequences at high compression ratios is possible on modern hardware.

The OmixAnalyzer is a web-based solution for integrated data management and analysis within large biomedical projects. A demo version of the software is available at the link above. It stores various types of processed microarray data (human, mouse, Affymetrix, Exon Chips, Agilent, etc.) and provides easy-to-use methods for quality control, clustering, and functional analysis of selected datasets.

The OmixAnalyzer was developed in the DFG-funded Collaborative Research Project (Sonderforschungsbereich / Transregio) TRR-54: Growth and Survival, Plasticity and Cellular Interactivity of Lymphatic Malignancies

This competition addressed an important problem for database research and related fields, i.e., approximate string matching. Applications are many, such as duplicate detection, information extraction, error- tolerant keyword search etc.
Participants of this workshop competed for the most efficient implementation of scalable approximate string matching techniques. The competition comprised two tracks: Similarity string search and similarity string join. The purpose was to get a clearer picture of the state-of-the-art in string matching by comparing algorithms using the same hardware and the same (large) data sets.

Results were presented at a workshop held in conjunction with EDBT/ICDT 2013, March 22, 2013, Genoa, Italy.

ChemSpot is a named entity recognition tool for identifying mentions of chemicals in natural language texts, including trivial names, drugs, abbreviations, molecular formulas and IUPAC entities. Since the different classes of relevant entities have rather different naming characteristics, ChemSpot uses a hybrid approach combining a Conditional Random Field with a dictionary. It currently achieves an F1 measure of 74.2% on the SCAI corpus.

Resources that have been derived from the text mining experiments for the CellFinder project which aims to establishing a central stem cell data repository, by utilizing and interlinking existing public databases regarding defined areas of human pluripotent stem cell research.

These resources were developed within the DFG-funded research project CellFinder: A Cell Data Repository.

GeneView is a web-based retrieval system for annotated biomedical texts. The system has indexed all PubMed abstracts plus the "data mining" subset of PMC (~200.000 full text). All texts are tagged for occurrences of gene names (using GNAT) and Mutations (using MutationFinder). Papers can be searched using the usual keyword search options, but results can be ranked by abstract content in terms of annotated entities.

GeneView was developed by the BMBF-fundes collaborative research project ColoNet.