Wick, C., Reul, C., Puppe, F.: Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus.JLCL: Special Issue on Automatic Text and Layout Recognition. (2019).

This paper proposes a combination of a convolutional and an LSTM network to improve the accuracy of OCR on early printed books. While the default approach of line based OCR is to use a single LSTM layer as provided by the well-established OCR software OCRopus (OCRopy), we utilize a CNN- and Pooling-Layer combination in advance of an LSTM layer as implemented by the novel OCR software Calamari. Since historical prints often require book specific models trained on manually labeled ground truth (GT) the goal is to maximize the recognition accuracy of a trained model while keeping the needed manual effort to a minimum. We show, that the deep model significantly outperforms the shallow LSTM network when using both many and only a few training examples, although the deep network has a higher amount of trainable parameters. Hereby, the error rate is reduced by a factor of up to 55%, yielding character error rates (CER) of 1% and below for 1,000 lines of training. To further improve the results, we apply a confidence voting mechanism to achieve CERs below 0.5%. A simple data augmentation scheme and the usage of pretrained models reduces the CER further by up to 62% if only few training data is available. Thus, we require only 100 lines of GT to reach an average CER of 1.2%. The runtime of the deep model for training and prediction of a book behaves very similar to a shallow network when trained on a CPU. However, the usage of a GPU, as supported by Calamari, reduces the prediction time by a factor of at least four and the training time by more than six.

When converting historical lexica into electronic form the goal is not only to obtain a high quality OCR result for the text but also to perform a precise automatic recognition of typographical attributes in order to capture the logical structure. For that purpose, we present a method that enables a fine-grained typography classification by training an open source OCR engine both on traditional OCR and typography recognition and show how to map the obtained typography information to the OCR recognized text output. As a test case, we used a German dictionary (Sander's "Wörterbuch der Deutschen Sprache") from the 19th century, which comprises a particularly complex semantic function of typography. Despite the very challenging material, we achieved a character error rate below 0.4% and a typography recognition that assigns the correct label to close to 99% of the words. In contrast to many existing methods, our novel approach works with real historical data and can deal with frequent typography changes even within lines.

Springmann, U., Reul, C., Dipper, S., Baiter, J.: Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin.JLCL: Special Issue on Automatic Text and Layout Recognition. (2019).

In this paper we describe a dataset of German and Latin ground truth (GT) for historical OCR in the form of printed text line images paired with their transcription. This dataset, called GT4HistOCR, consists of 313,173 line pairs covering a wide period of printing dates from incunabula from the 15th century to 19th century books printed in Fraktur types and is openly available under a CC-BY ⒋0 license. The special form of GT as line image/transcription pairs makes it directly usable to train state-of-the-art recognition models for OCR software employing recurring neural networks in LSTM architecture such as Tesseract 4 or OCRopus. We also provide some pretrained OCRopus models for subcorpora of our dataset yielding between 95% (early printings) and 98% (19th century Fraktur printings) character accuracy rates on unseen test cases, a Perl script to harmonize GT produced by different transcription rules, and give hints on how to construct GT for OCR purposes which has requirements that may differ from linguistically motivated transcriptions.

We combine three methods which significantly improve the OCR accuracy of OCR models trained on early printed books: (1) The pretraining method utilizes the information stored in already existing models trained on a variety of typesets (mixed models) instead of starting the training from scratch. (2) Performing cross fold training on a single set of ground truth data (line images and their transcriptions) with a single OCR engine (OCRopus) produces a committee whose members then vote for the best outcome by also taking the top-N alternatives and their intrinsic confidence values into account. (3) Following the principle of maximal disagreement we select additional training lines which the voters disagree most on, expecting them to offer the highest information gain for a subsequent training (active learning). Evaluations on six early printed books yielded the following results: On average the combination of pretraining and voting improved the character accuracy by 46% when training five folds starting from the same mixed model. This number rose to 53% when using different models for pretraining, underlining the importance of diverse voters. Incorporating active learning improved the obtained results by another 16% on average (evaluated on three of the six books). Overall, the proposed methods lead to an average error rate of 2.5% when training on only 60 lines. Using a substantial ground truth pool of 1,000 lines brought the error rate down even further to less than 1% on average.

Frequently, port scans are early indicators of more serious attacks. Unfortunately, the detection of slow port scans in company networks is challenging due to the massive amount of network data. This paper proposes an innovative approach for preprocessing flow-based data which is specifically tailored to the detection of slow port scans. The preprocessing chain generates new objects based on flow-based data aggregated over time windows while taking domain knowledge as well as additional knowledge about the network structure into account. The computed objects are used as input for the further analysis. Based on these objects, we propose two different approaches for detection of slow port scans. One approach is unsupervised and uses sequential hypothesis testing whereas the other approach is supervised and uses classification algorithms. We compare both approaches with existing port scan detection algorithms on the flow-based CIDDS-001 data set. Experiments indicate that the proposed approaches achieve better detection rates and exhibit less false alarms than similar algorithms.

Flow-based data sets are necessary for evaluating network-based intrusion de- tection systems (NIDS). In this work, we propose a novel methodology for gener- ating realistic flow-based network traffic. Our approach is based on Generative Adversarial Networks (GANs) which achieve good results for image generation. A major challenge lies in the fact that GANs can only process continuous at- tributes. However, flow-based data inevitably contain categorical attributes such as IP addresses or port numbers. Therefore, we propose three different preprocessing approaches for flow-based data in order to transform them into continuous values. Further, we present a new method for evaluating the gener- ated flow-based network traffic which uses domain knowledge to define quality tests. We use the three approaches for generating flow-based network traffic based on the CIDDS-001 data set. Experiments indicate that two of the three approaches are able to generate high quality data.

In this paper we describe our post-evaluation results for SemEval-2018 Task 7 on classification of semantic relations in scientific literature for clean (subtask 1.1) and noisy data (subtask 1.2). Due to space limitations we publish an extended version of Hettinger et al. (2018) including further technical details and changes made to the preprocessing step in the post-evaluation phase. Due to these changes Classification of Relations using Embeddings (ClaiRE) achieved an improved F1 score of 75.11% for the first subtask and 81.44% for the second.

The k-Nearest Neighbor (kNN) classification approach is conceptually simple - yet widely applied since it often performs well in practical applications. However, using a global constant k does not always provide an optimal solution, e. g., for datasets with an irregular density distribution of data points. This paper proposes an adaptive kNN classifier where k is chosen dynamically for each instance (point) to be classified, such that the expected accuracy of classification is maximized. We define the expected accuracy as the accuracy of a set of structurally similar observations. An arbitrary similarity function can be used to find these observations. We introduce and evaluate different similarity functions. For the evaluation, we use five different classification tasks based on geo-spatial data. Each classification task consists of (tens of) thousands of items. We demonstrate, that the presented expected accuracy measures can be a good estimator for kNN performance, and the proposed adaptive kNN classifier outperforms common kNN and previously introduced adaptive kNN algorithms. Also, we show that the range of considered k can be significantly reduced to speed up the algorithm without negative influence on classification accuracy.

Optical Character Recognition (OCR) on contemporary and historical data is still in the focus of many researchers. Especially historical prints require book specific trained OCR models to achieve applicable results (Springmann and L\"udeling, 2016, Reul et al., 2017a). To reduce the human effort for manually annotating ground truth (GT) various techniques such as voting and pretraining have shown to be very efficient (Reul et al., 2018a, Reul et al., 2018b). Calamari is a new open source OCR line recognition software that both uses state-of-the art Deep Neural Networks (DNNs) implemented in Tensorflow and giving native support for techniques such as pretraining and voting. The customizable network architectures constructed of Convolutional Neural Networks (CNNS) and Long-ShortTerm-Memory (LSTM) layers are trained by the so-called Connectionist Temporal Classification (CTC) algorithm of Graves et al. (2006). Optional usage of a GPU drastically reduces the computation times for both training and prediction. We use two different datasets to compare the performance of Calamari to OCRopy, OCRopus3, and Tesseract 4. Calamari reaches a Character Error Rate (CER) of 0.11% on the UW3 dataset written in modern English and 0.18% on the DTA19 dataset written in German Fraktur, which considerably outperforms the results of the existing softwares.

This chapter first provides an outline of the current results in the domains of: (a) quality-of-service (QoS) / quality-of-experience (QoE) control and management (CaM) for real-time multimedia services that is supported by software-defined networking (SDN), and (b) big data analytics and methods that are used for QoS/QoE CaM. Then, three specific use case scenarios with respect to video streaming services are presented, so as to illustrate the expected benefits of incorporating big data analytics into SDN-based CaM for the purposes of improving or optimizing QoS/QoE. In the end, we describe our vision and a high-level view of an SDN-based architecture for QoS/QoE CaM that is enriched with big data analytics' functional blocks and summarize corresponding challenges.

With the growth of the Social Web, a variety of new web-based services arose and changed the way users interact with the internet and consume information. One central phenomenon was and is tagging which allows to manage, organize and access information in social systems. Tagging helps to manage all kinds of resources, making their access much easier. The first type of social tagging systems were social bookmarking systems, i.e., platforms for storing and sharing bookmarks on the web rather than just in the browser. Meanwhile, (hash-)tagging is central in many other Social Media systems such as social networking sites and micro-blogging platforms. To allow for efficient information access, special algorithms have been developed to guide the user, to search for information and to rank the content based on tagging information contributed by the users.

In this paper we describe our post-evaluation results for SemEval-2018 Task 7 on classification of semantic relations in scientific literature for clean (subtask 1.1) and noisy data (subtask 1.2). Due to space limitations we publish an extended version of Hettinger et al. (2018) including further technical details and changes made to the preprocessing step in the post-evaluation phase. Due to these changes Classification of Relations using Embeddings (ClaiRE) achieved an improved F1 score of 75.11% for the first subtask and 81.44% for the second.

In this paper we introduce a method that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books. The method uses a combination of cross fold training and confidence based voting. After allocating the available ground truth in different subsets several training processes are performed, each resulting in a specific OCR model. The OCR text generated by these models then gets voted to determine the final output by taking the recognized characters, their alternatives, and the confidence values assigned to each character into consideration. Experiments on seven early printed books show that the proposed method outperforms the standard approach considerably by reducing the amount of errors by up to 50% and more.

In this paper we evaluate Optical Character Recognition (OCR) of 19th century Fraktur scripts without book-specific training using mixed models, i.e. models trained to recognize a variety of fonts and typesets from previously unseen sources. We describe the training process leading to strong mixed OCR models and compare them to freely available models of the popular open source engines OCRopus and Tesseract as well as the commercial state of the art system ABBYY. For evaluation, we use a varied collection of unseen data from books, journals, and a dictionary from the 19th century. The experiments show that training mixed models with real data is superior to training with synthetic data and that the novel OCR engine Calamari outperforms the other engines considerably, on average reducing ABBYYs character error rate (CER) by over 70%, resulting in an average CER below 1%.

We introduce a dynamically adapting tuning scheme for microtonal tuning of musical instruments, allowing the performer to play music in just intonation in any key. Unlike previous methods, which are based on a procedural analysis of the chordal structure, the suggested tuning scheme continually solves a system of linear equations without making explicit decisions. In complex situations, where not all intervals of a chord can be tuned according to just frequency ratios, the method automatically yields a tempered compromise. We outline the implementation of the algorithm in an open-source software project that we have provided in order to demonstrate the feasibility of the tuning method.

Convolutional neural networks (CNNs) have become popular especially in computer vision in the last few years because they achieved outstanding performance on different tasks, such as image classifications. We propose a nine-layer CNN for leaf identification using the famous Flavia and Foliage datasets. Usually the supervised learning of deep CNNs requires huge datasets for training. However, the used datasets contain only a few examples per plant species. Therefore, we apply data augmentation and transfer learning to prevent our network from overfitting. The trained CNNs achieve recognition rates above 99% on the Flavia and Foliage datasets, and slightly outperform current methods for leaf classification.

The aim of this pilot study was to analyze the off-training physical activity (PA) profile in national elite German U23 rowers during 31 days of their preparation period. The hours spent in each PA category (i.e. sedentary: <1.5 MET; light physical activity: 1.5–3 MET; moderate physical activity: 3–6 MET and vigorous intense physical activity: >6 MET) were calculated for every valid day (i.e. > 480 min of wear time). The off-training PA during 21 weekdays and 10 weekend days of the final 11-wk preparation period was assessed by a wrist-worn multisensory device (Microsoft Band II (MSBII)). A total of 11 rowers provided valid data (i.e. > 480 min/day) for 11.6 week days and 4.8 weekend days during the 31 days observation period. The average sedentary time was 11.63±1.25 hours per day during the week and 12.49±1.10 hours per day on the weekend, with a tendency to be higher on the weekend compared to weekdays (p = 0.06; d = 0.73). The average time in light, moderate and vigorous PA during the weekdays was 1.27±1.15, 0.76±0.37, 0.51±0.44 hours per day and 0.67±0.43, 0.59±0.37, 0.53±0.32 hours per weekend day. Light physical activity was higher during weekdays compared to the weekend (p = 0.04; d = 0.69) Based on our pilot study of eleven national elite rowers we conclude that rowers display a considerable sedentary off-training behavior of more than 11.5 hours/day.

Recently, Recurrent Neural Networks (RNNs) have been applied to the task of session-based recommendation. These approaches use RNNs to predict the next item in a user session based on the previ- ously visited items. While some approaches consider additional item properties, we argue that item dwell time can be used as an implicit measure of user interest to improve session-based item recommen- dations. We propose an extension to existing RNN approaches that captures user dwell time in addition to the visited items and show that recommendation performance can be improved. Additionally, we investigate the usefulness of a single validation split for model selection in the case of minor improvements and find that in our case the best model is not selected and a fold-like study with different validation sets is necessary to ensure the selection of the best model.

Sentiment Analysis is a Natural Language Processing-task that is relevant in a number of contexts, including the analysis of literature. We report on ongoing research towards enabling, for the first time, sentence-level Sentiment Analysis in the domain of German novels. We create a labelled dataset from sentences extracted from German novels and, by adapting existing sentiment classifiers, reach promising F1-scores of 0.67 for binary polarity classification.

Publicly available labelled data sets are necessary for evaluating anomaly-based Intrusion Detection Systems (IDS). However, existing data sets are often not up-to-date or not yet published because of privacy concerns. This paper identifies requirements for good data sets and proposes an approach for their generation. The key idea is to use a test environment and emulate realistic user behaviour with parameterised scripts on the clients. Comprehensive logging mechanisms provide additional information which may be used for a better understanding of the inner dynamics of an IDS. Finally, the proposed approach is used to generate the flow-based CIDDS-002 data set.

Sequential data can be found in many settings, e.g., as sequences of visited websites or as location sequences of travellers. To improve the understanding of the underlying mechanisms that generate such sequences, the HypTrails approach provides for a novel data analysis method. Based on first-order Markov chain models and Bayesian hypothesis testing, it allows for comparing a set of hypotheses, i.e., beliefs about transitions between states, with respect to their plausibility considering observed data. HypTrails has been successfully employed to study phenomena in the online and the offline world. In this talk, we want to give an introduction to HypTrails and showcase selected real-world applications on urban mobility and reading behavior on Wikipedia.

Company data are a valuable asset and must be protected against unauthorized access and manipulation. In this contribution, we report on our ongoing work that aims to support IT security experts with identifying novel or obfuscated attacks in company networks, irrespective of their origin inside or outside the company network. A new toolset for anomaly based network intrusion detection is proposed. This toolset uses flow-based data which can be easily retrieved by central network components. We study the challenges of analysing flow-based data streams using data mining algorithms and build an appropriate approach step by step. In contrast to previous work, we collect flow-based data for each host over a certain time window, include the knowledge of domain experts and analyse the data from three different views. We argue that incorporating expert knowledge and previous flows allow us to create more meaningful attributes for subsequent analysis methods. This way, we try to detect novel attacks while simultaneously limiting the number of false positives.

A method is presented that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books when only small amounts of diplomatic transcriptions are available. This is achieved by building from already existing models during training instead of starting from scratch. To overcome the discrepancies between the set of characters of the pretrained model and the additional ground truth the OCRopus code is adapted to allow for alphabet expansion or reduction. The character set is now capable of flexibly adding and deleting characters from the pretrained alphabet when an existing model is loaded. For our experiments we use a self-trained mixed model on early Latin prints and the two standard OCRopus models on modern English and German Fraktur texts. The evaluation on seven early printed books showed that training from the Latin mixed model reduces the average amount of errors by 43% and 26%, compared to training from scratch with 60 and 150 lines of ground truth, respectively. Furthermore, it is shown that even building from mixed models trained on standard data unrelated to the newly added training and test data can lead to significantly improved recognition results.

A method is presented that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books when only small amounts of diplomatic transcriptions are available. This is achieved by building from already existing models during training instead of starting from scratch. To overcome the discrepancies between the set of characters of the pretrained model and the additional ground truth the OCRopus code is adapted to allow for alphabet expansion or reduction. The character set is now capable of flexibly adding and deleting characters from the pretrained alphabet when an existing model is loaded. For our experiments we use a self-trained mixed model on early Latin prints and the two standard OCRopus models on modern English and German Fraktur texts. The evaluation on seven early printed books showed that training from the Latin mixed model reduces the average amount of errors by 43% and 26%, respectively compared to training from scratch with 60 and 150 lines of ground truth, respectively. Furthermore, it is shown that even building from mixed models trained on data unrelated to the newly added training and test data can lead to significantly improved recognition results.

The aim of this pilot study was to analyze the off-training physical activity (PA) profile in national elite German U23 rowers during 31 days of their preparation period. The hours spent in each PA category (i.e. sedentary: <1.5 MET; light physical activity: 1.5–3 MET; moderate physical activity: 3–6 MET and vigorous intense physical activity: >6 MET) were calculated for every valid day (i.e. > 480 min of wear time). The off-training PA during 21 weekdays and 10 weekend days of the final 11-wk preparation period was assessed by a wrist-worn multisensory device (Microsoft Band II (MSBII)). A total of 11 rowers provided valid data (i.e. > 480 min/day) for 11.6 week days and 4.8 weekend days during the 31 days observation period. The average sedentary time was 11.63±1.25 hours per day during the week and 12.49±1.10 hours per day on the weekend, with a tendency to be higher on the weekend compared to weekdays (p = 0.06; d = 0.73). The average time in light, moderate and vigorous PA during the weekdays was 1.27±1.15, 0.76±0.37, 0.51±0.44 hours per day and 0.67±0.43, 0.59±0.37, 0.53±0.32 hours per weekend day. Light physical activity was higher during weekdays compared to the weekend (p = 0.04; d = 0.69) Based on our pilot study of eleven national elite rowers we conclude that rowers display a considerable sedentary off-training behavior of more than 11.5 hours/day.

In this paper we introduce a method that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books. The method uses a combination of cross fold training and confidence based voting. After allocating the available ground truth in different subsets several training processes are performed, each resulting in a specific OCR model. The OCR text generated by these models then gets voted to determine the final output by taking the recognized characters, their alternatives, and the confidence values assigned to each character into consideration. Experiments on seven early printed books show that the proposed method outperforms the standard approach considerably by reducing the amount of errors by up to 50% and more.

Polarized (POL) training intensity distribution (TID) emphasizes high-volume low-intensity exercise in zone (Z)1 (< first lactate threshold) with a greater proportion of high-intensity Z3 (> second lactate threshold) compared to Z2 (between first and second lactate threshold). In highly trained rowers there is a lack of prospective controlled evidence whether POL is superior to pyramidal (PYR; i.e. greater volume in Z1 vs. Z2 vs. Z3) TID. The aim of the study was to compare the effect of POL vs. PYR TID in rowers during an 11-wk preparation period. Fourteen national elite male rowers participated (age: 20 ± 2 years, maximal oxygen uptake (⩒O2max): 66±5 mL/min/kg). The sample was split into PYR and POL by varying the percentage spent in Z2 and Z3 while Z1 was clamped to ~93% and matched for total and rowing volume. Actual TIDs were based on time within heart rate zones (Z1 and Z2) and duration of Z3-intervals. The main outcome variables were average power in 2000 m ergometer-test (P2000m), power associated with 4 mmol/L [blood lactate] (P4[BLa]), and ⩒O2max. To quantify the level of polarization, we calculated a Polarization-Index as log (\%Z1 x \%Z3/\%Z2). PYR and POL did not significantly differ regarding rowing or total volume, but POL had a higher percentage of Z3 intensities (6±3\% vs. 2±1\%; p < .005) while Z2 was lower (1±1\% vs. 3±2\%; p < .05) and Z1 was similar (94±3\% vs. 93±2\%, p = .37). Consequently, Polarization-Index was significantly higher in POL (3.0±0.7 a.u. vs. 1.9±0.4 a.u.; p < .01) P2000m did not significantly change with PYR (1.5±1.7\%, p = .06) nor POL (1.5±2.6\%, p = .26). ⩒O2max did not change (1.7±5.6\%, p = .52 or 0.6±2.6, p = .67) and a small increase in P4[BLa] was observed in PYR only (1.9±4.8\%, p = .37 or -0.5±4.1\%, p = .77). Changes from pre to post were not significantly different between groups in any performance measure. POL did not prove to be superior to PYR, possibly due to the high and very similar percentage of Z1 in this study.

For the popular task of tag recommendation, various (complex) approaches have been proposed. Recently however, research has focused on heuristics with low computational effort and particularly, a time-aware heuristic, called BLL, has been shown to compare well to various state-of-the-art methods. Here, we follow up on these results by presenting another time-aware approach leveraging user interaction data in an easily interpretable, on-the-fly computable approach that can successfully be combined with BLL. We investigate the influence of time as a parameter in that approach, and we demonstrate the effectiveness of the proposed method using two datasets from the popular public social tagging system BibSonomy.

Sequential traces of user data are frequently observed online and offline, e.g., as sequences of visited websites or as sequences of locations captured by GPS. However, understanding factors explaining the production of sequence data is a challenging task, especially since the data generation is often not homogeneous. For example, navigation behavior might change in different phases of browsing a website or movement behavior may vary between groups of users. In this work, we tackle this task and propose MixedTrails , a Bayesian approach for comparing the plausibility of hypotheses regarding the generative processes of heterogeneous sequence data. Each hypothesis is derived from existing literature, theory, or intuition and represents a belief about transition probabilities between a set of states that can vary between groups of observed transitions. For example, when trying to understand human movement in a city and given some data, a hypothesis assuming tourists to be more likely to move towards points of interests than locals can be shown to be more plausible than a hypothesis assuming the opposite. Our approach incorporates such hypotheses as Bayesian priors in a generative mixed transition Markov chain model, and compares their plausibility utilizing Bayes factors. We discuss analytical and approximate inference methods for calculating the marginal likelihoods for Bayes factors, give guidance on interpreting the results, and illustrate our approach with several experiments on synthetic and empirical data from Wikipedia and Flickr. Thus, this work enables a novel kind of analysis for studying sequential data in many application areas.

Sequential traces of user data are frequently observed online and offline, e.g., as sequences of visited websites or as sequences of locations captured by GPS. However, understanding factors explaining the production of sequence data is a challenging task, especially since the data generation is often not homogeneous. For example, navigation behavior might change in different phases of browsing a website or movement behavior may vary between groups of users. In this work, we tackle this task and propose MixedTrails , a Bayesian approach for comparing the plausibility of hypotheses regarding the generative processes of heterogeneous sequence data. Each hypothesis is derived from existing literature, theory, or intuition and represents a belief about transition probabilities between a set of states that can vary between groups of observed transitions. For example, when trying to understand human movement in a city and given some data, a hypothesis assuming tourists to be more likely to move towards points of interests than locals can be shown to be more plausible than a hypothesis assuming the opposite. Our approach incorporates such hypotheses as Bayesian priors in a generative mixed transition Markov chain model, and compares their plausibility utilizing Bayes factors. We discuss analytical and approximate inference methods for calculating the marginal likelihoods for Bayes factors, give guidance on interpreting the results, and illustrate our approach with several experiments on synthetic and empirical data from Wikipedia and Flickr. Thus, this work enables a novel kind of analysis for studying sequential data in many application areas.

When users interact with the Web today, they leave sequential digital trails on a massive scale. Examples of such human trails include Web navigation, sequences of online restaurant reviews, or online music play lists. Understanding the factors that drive the production of these trails can be useful, for example, for improving underlying network structures, predicting user clicks, or enhancing recommendations. In this work, we present a method called HypTrails for comparing a set of hypotheses about human trails on the Web, where hypotheses represent beliefs about transitions between states. Our method utilizes Markov chain models with Bayesian inference. The main idea is to incorporate hypotheses as informative Dirichlet priors and to calculate the evidence of the data under them. For eliciting Dirichlet priors from hypotheses, we present an adaption of the so-called (trial) roulette method, and to compare the relative plausibility of hypotheses, we employ Bayes factors. We demonstrate the general mechanics and applicability of HypTrails by performing experiments with (i) synthetic trails for which we control the mechanisms that have produced them and (ii) empirical trails stemming from different domains including Web site navigation, business reviews, and online music played. Our work expands the repertoire of methods available for studying human trails.

A semi-automatic open-source tool for layout analysis on early printed books is presented. LAREX uses a rule based connected components approach which is very fast, easily comprehensible for the user and allows an intuitive manual correction if necessary. The PageXML format is used to support integration into existing OCR workflows. Evaluations showed that LAREX provides an efficient and flexible way to segment pages of early printed books.

Assessing the degree of semantic relatedness between words is an important task with a variety of semantic applications, such as ontology learning for the Semantic Web, semantic search or query expansion. To accomplish this in an automated fashion, many relatedness measures have been proposed. However, most of these metrics only encode information contained in the underlying corpus and thus do not directly model human intuition. To solve this, we propose to utilize a metric learning approach to improve existing semantic relatedness measures by learning from additional information, such as explicit human feedback. For this, we argue to use word embeddings instead of traditional high-dimensional vector representations in order to leverage their semantic density and to reduce computational cost. We rigorously test our approach on several domains including tagging data as well as publicly available embeddings based on Wikipedia texts and navigation. Human feedback about semantic relatedness for learning and evaluation is extracted from publicly available datasets such as MEN or WS-353. We find that our method can significantly improve semantic relatedness measures by learning from additional information, such as explicit human feedback. For tagging data, we are the first to generate and study embeddings. Our results are of special interest for ontology and recommendation engineers, but also for any other researchers and practitioners of Semantic Web techniques.

Assessing the degree of semantic relatedness between words is an important task with a variety of semantic applications, such as ontology learning for the Semantic Web, semantic search or query expansion. To accomplish this in an automated fashion, many relatedness measures have been proposed. However, most of these metrics only encode information contained in the underlying corpus and thus do not directly model human intuition. To solve this, we propose to utilize a metric learning approach to improve existing semantic relatedness measures by learning from additional information, such as explicit human feedback. For this, we argue to use word embeddings instead of traditional high-dimensional vector representations in order to leverage their semantic density and to reduce computational cost. We rigorously test our approach on several domains including tagging data as well as publicly available embeddings based on Wikipedia texts and navigation. Human feedback about semantic relatedness for learning and evaluation is extracted from publicly available datasets such as MEN or WS-353. We find that our method can significantly improve semantic relatedness measures by learning from additional information, such as explicit human feedback. For tagging data, we are the first to generate and study embeddings. Our results are of special interest for ontology and recommendation engineers, but also for any other researchers and practitioners of Semantic Web techniques.

This paper provides the first thorough documentation of a high quality digitization process applied to an early printed book from the incunabulum period (1450-1500). The entire OCR related workflow including preprocessing, layout analysis and text recognition is illustrated in detail using the example of ‘Der Heiligen Leben’, printed in Nuremberg in 1488. For each step the required time expenditure was recorded. The character recognition yielded excellent results both on character (97.57%) and word (92.19%) level. Furthermore, a comparison of a highly automated (LAREX) and a manual (Aletheia) method for layout analysis was performed. By considerably automating the segmentation the required human effort was reduced significantly from over 100 hours to less than six hours, resulting in only a slight drop in OCR accuracy. Realistic estimates for the human effort necessary for full text extraction from incunabula can be derived from this study. The printed pages of the complete work together with the OCR result is available online 1 ready to be inspected and downloaded.

We propose a high-performance fully convolutional neural network (FCN) for historical handwritten document segmentation that is designed to process a single page in one step. The advantage of this model beside its speed is its ability to directly learn from raw pixels instead of using preprocessing steps e. g. feature computation or superpixel generation. We show that this network yields better results than existing methods on different public data sets. For evaluation of this model we introduce a novel metric that is independent of ambiguous ground truth called Foreground Pixel Accuracy (FgPA). This pixel based measure only counts foreground pixels in the binarized page, any background pixel is omitted. The major advantage of this metric is, that it enables researchers to compare different segmentation methods on their ability to successfully segment text or pictures and not on their ability to learn and possibly overfit the peculiarities of an ambiguous hand-made ground truth segmentation.

In recent years clinical data warehouses (CDW) have become more and more popular to support scientific work in the medical domain. Despite the tool support for many subtasks it is still a laborious task to establish a CDW in an existing clinical data environment. We present a workflow which can be taken as a blueprint for newly established CDW projects and the implementation of this blueprint at the University Clinic Würzburg.

In social tagging systems, like Mendeley, CiteULike, and BibSonomy, users can post, tag, visit, or export scholarly publications. In this paper, we compare citations with metrics derived from users’ activities (altmetrics) in the popular social bookmarking system BibSonomy. Our analysis, using a corpus of more than 250,000 publications published before 2010, reveals that overall, citations and altmetrics in BibSonomy are mildly correlated. Furthermore, grouping publications by user-generated tags results in topic-homogeneous subsets that exhibit higher correlations with citations than the full corpus. We find that posts, exports, and visits of publications are correlated with citations and even bear predictive power over future impact. Machine learning classifiers predict whether the number of citations that a publication receives in a year exceeds the median number of citations in that year, based on the usage counts of the preceding year. In that setup, a Random Forest predictor outperforms the baseline on average by seven percentage points.

Identifying plot structure in novels is a valuable step towards automatic processing of literary corpora. We present an approach to classify novels as either having a happy ending or not. To achieve this, we use features based on different sentiment lexica as input for an SVM- classifier, which yields an average F1-score of about 73%.

Social tagging systems have established themselves as a quick and easy way to organize information by annotating resources with tags. In recent work, user behavior in social tagging systems was studied, that is, how users assign tags, and consume content. However, it is still unclear how users make use of the navigation options they are given. Understanding their behavior and differences in behavior of different user groups is an important step towards assessing the effectiveness of a navigational concept and of improving it to better suit the users’ needs. In this work, we investigate navigation trails in the popular scholarly social tagging system BibSonomy from six years of log data. We discuss dynamic browsing behavior of the general user population and show that different navigational subgroups exhibit different navigational traits. Furthermore, we provide strong evidence that the semantic nature of the underlying folksonomy is an essential factor for explaining navigation.

We present a new method for detecting interpretable subgroups with exceptional transition behavior in sequential data. Identifying such patterns has many potential applications, e.g., for studying human mobility or analyzing the behavior of internet users. To tackle this task, we employ exceptional model mining, which is a general approach for identifying interpretable data subsets that exhibit unusual interactions between a set of target attributes with respect to a certain model class. Although exceptional model mining provides a well-suited framework for our problem, previously investigated model classes cannot capture transition behavior. To that end, we introduce first-order Markov chains as a novel model class for exceptional model mining and present a new interestingness measure that quantifies the exceptionality of transition subgroups. The measure compares the distance between the Markov transition matrix of a subgroup and the respective matrix of the entire data with the distance of random dataset samples. In addition, our method can be adapted to find subgroups that match or contradict given transition hypotheses. We demonstrate that our method is consistently able to recover subgroups with exceptional transition models from synthetic data and illustrate its potential in two application examples. Our work is relevant for researchers and practitioners interested in detecting exceptional transition behavior in sequential data.

Semantic relatedness between words has been successfully extracted from navigation on Wikipedia pages. However, the navigational data used in the corresponding works are sparse and expected to be biased since they have been collected in the context of games. In this paper, we raise this limitation and explore if semantic relatedness can also be extracted from unconstrained navigation. To this end, we first highlight structural differences between unconstrained navigation and game data. Then, we adapt a state of the art approach to extract semantic relatedness on Wikipedia paths. We apply this approach to transitions derived from two unconstrained navigation datasets as well as transitions from WikiGame and compare the results based on two common gold standards. We confirm expected structural differences when comparing unconstrained navigation with the paths collected by WikiGame. In line with this result, the mentioned state of the art approach for semantic extraction on navigation data does not yield good results for unconstrained navigation. Yet, we are able to derive a relatedness measure that performs well on both unconstrained navigation data as well as game data. Overall, we show that unconstrained navigation data on Wikipedia is suited for extracting semantics.

In recent years clinical data warehouses (CDW) have become more and more popular to support scientific work in the medical domain. Despite the tool support for many subtasks it is still a laborious task to establish a CDW in an existing clinical data environment. We present a workflow which can be taken as a blueprint for newly established CDW projects and the implementation of this blueprint at the University Clinic Würzburg.

Semantic relatedness between words has been extracted from a variety of sources.In this ongoing work, we explore and compare several options for determining if semantic relatedness can be extracted from navigation structures in Wikipedia. In that direction, we first investigate the potential of representation learning techniques such as DeepWalk in comparison to previously applied methods based on counting co-occurrences. Since both methods are based on (random) paths in the network, we also study different approaches to generate paths from Wikipedia link structure. For this task, we do not only consider the link structure of Wikipedia, but also actual navigation behavior of users. Finally, we analyze if semantics can also be extracted from smaller subsets of the Wikipedia link network. As a result we find that representation learning techniques mostly outperform the investigated co-occurrence counting methods on the Wikipedia network. However, we find that this is not the case for paths sampled from human navigation behavior.

Doerfel, S., Zoller, D., Singer, P., Niebler, T., Hotho, A., Strohmaier, M.: What Users Actually do in a Social Tagging System: A Study of User Behavior in BibSonomy.ACM Transactions on the Web.10,14:1--14:32 (2016).

Social tagging systems have established themselves as an important part in today’s web and have attracted the interest of our research community in a variety of investigations. Henceforth, several aspects of social tagging systems have been discussed and assumptions have emerged on which our community builds their work. Yet, testing such assumptions has been difficult due to the absence of suitable usage data in the past. In this work, we thoroughly investigate and evaluate four aspects about tagging systems, covering social interaction, retrieval of posted resources, the importance of the three different types of entities, users, resources, and tags, as well as connections between these entities’ popularity in posted and in requested content. For that purpose, we examine live server log data gathered from the real-world, public social tagging system BibSonomy. Our empirical results paint a mixed picture about the four aspects. While for some, typical assumptions hold to a certain extent, other aspects need to be reflected in a very critical light. Our observations have implications for the understanding of social tagging systems, and the way they are used on the web. We make the dataset used in this work available to other researchers.

With regard to a computational representation of literary plot, this paper looks at the use of sentiment analysis for happy ending detection in German novels. Its focus lies on the investigation of previously proposed sentiment features in order to gain insight about the relevance of specific features on the one hand and the implications of their performance on the other hand. Therefore, we study various partitionings of novels, considering the highly variable concept of "ending". We also show that our approach, even though still rather simple, can potentially lead to substantial findings relevant to literary studies.

With regard to a computational representation of literary plot, this paper looks at the use of sentiment analysis for happy ending detection in German novels. Its focus lies on the investigation of previously proposed sentiment features in order to gain insight about the relevance of specific features on the one hand and the implications of their performance on the other hand. Therefore, we study various partitionings of novels, considering the highly variable concept of "ending". We also show that our approach, even though still rather simple, can potentially lead to substantial findings relevant to literary studies.

Subgroup discovery is a key data mining method that aims at identifying descriptions of subsets of the data that show an interesting distribution with respect to a pre-defined target concept. For practical applications the integration of numerical data is crucial. Therefore, a wide variety of interestingness measures has been proposed in literature that use a numerical attribute as the target concept. However, efficient mining in this setting is still an open issue. In this paper, we present novel techniques for fast exhaustive subgroup discovery with a numerical target concept. We initially survey previously proposed measures in this setting. Then, we explore options for pruning the search space using optimistic estimate bounds. Specifically, we introduce novel bounds in closed form and ordering-based bounds as a new technique to derive estimates for several types of interestingness measures with no previously known bounds. In addition, we investigate efficient data structures, namely adapted FP-trees and bitset-based data representations, and discuss their interdependencies to interestingness measures and pruning schemes. The presented techniques are incorporated into two novel algorithms. Finally, the benefits of the proposed pruning bounds and algorithms are assessed and compared in an extensive experimental evaluation on 24 publicly available datasets. The novel algorithms reduce runtimes consistently by more than one order of magnitude.

With regard to a computational representation of literary plot, this paper looks at the use of sentiment analysis for happy ending detection in German novels. Its focus lies on the investigation of previously proposed sentiment features in order to gain insight about the relevance of specific features on the one hand and the implications of their performance on the other hand. Therefore, we study various partitionings of novels, considering the highly variable concept of "ending". We also show that our approach, even though still rather simple, can potentially lead to substantial findings relevant to literary studies.

Social tagging systems have established themselves as a quick and easy way to organize information by annotating resources with tags. In recent work, user behavior in social tagging systems was studied, that is, how users assign tags, and consume content. However, it is still unclear how users make use of the navigation options they are given. Understanding their behavior and differences in behavior of different user groups is an important step towards assessing the effectiveness of a navigational concept and of improving it to better suit the users’ needs. In this work, we investigate navigation trails in the popular scholarly social tagging system BibSonomy from six years of log data. We discuss dynamic browsing behavior of the general user population and show that different navigational subgroups exhibit different navigational traits. Furthermore, we provide strong evidence that the semantic nature of the underlying folksonomy is an essential factor for explaining navigation.

Semantic relatedness between words has been extracted from a variety of sources. In this ongoing work, we explore and compare several options for determining if semantic relatedness can be extracted from navigation structures in Wikipedia. In that direction, we first investigate the potential of representation learning techniques such as DeepWalk in comparison to previously applied methods based on counting co-occurrences. Since both methods are based on (random) paths in the network, we also study different approaches to generate paths from Wikipedia link structure. For this task, we do not only consider the link structure of Wikipedia, but also actual navigation behavior of users. Finally, we analyze if semantics can also be extracted from smaller subsets of the Wikipedia link network. As a result we find that representa- tion learning techniques mostly outperform the investigated co-occurrence counting methods on the Wikipedia network. However, we find that this is not the case for paths sampled from human navigation behavior.

Identifying plot structure in novels is a valuable step towards automatic processing of literary corpora. We present an approach to classify novels as either having a happy ending or not. To achieve this, we use features based on different sentiment lexica as input for an SVM- classifier, which yields an average F1-score of about 73%.

Social tagging systems have established themselves as a quick and easy way to organize information by annotating resources with tags. In recent work, user behavior in social tagging systems was studied, that is, how users assign tags, and consume content. However, it is still unclear how users make use of the navigation options they are given. Understanding their behavior and differences in behavior of different user groups is an important step towards assessing the effectiveness of a navigational concept and of improving it to better suit the users’ needs. In this work, we investigate navigation trails in the popular scholarly social tagging system BibSonomy from six years of log data. We discuss dynamic browsing behavior of the general user population and show that different navigational subgroups exhibit different navigational traits. Furthermore, we provide strong evidence that the semantic nature of the underlying folksonomy is an essential factor for explaining navigation.

Identifying plot structure in novels is a valuable step towards automatic processing of literary corpora. We present an approach to classify novels as either having a happy ending or not. To achieve this, we use features based on different sentiment lexica as input for an SVM- classifier, which yields an average F1-score of about 73%.

Social tagging systems have established themselves as an important part in today’s web and have attracted the interest of our research community in a variety of investigations. Henceforth, several aspects of social tagging systems have been discussed and assumptions have emerged on which our community builds their work. Yet, testing such assumptions has been difficult due to the absence of suitable usage data in the past. In this work, we thoroughly investigate and evaluate four aspects about tagging systems, covering social interaction, retrieval of posted resources, the importance of the three different types of entities, users, resources, and tags, as well as connections between these entities’ popularity in posted and in requested content. For that purpose, we examine live server log data gathered from the real-world, public social tagging system BibSonomy. Our empirical results paint a mixed picture about the four aspects. While for some, typical assumptions hold to a certain extent, other aspects need to be reflected in a very critical light. Our observations have implications for the understanding of social tagging systems, and the way they are used on the web. We make the dataset used in this work available to other researchers.

In this work feature extraction techniques for leaf classification are evaluated in a cross dataset scenario. First, a leaf identification system consisting of six feature classes is described and tested on five established publicly available datasets by using standard evaluation procedures within the datasets. Afterwards, the performance of the developed system is evaluated in the much more challenging scenario of cross dataset evaluation. Finally, a new dataset is introduced as well as a web service, which allows to identify leaves both photographed on paper and when still attached to the tree. While the results obtained during classification within a dataset come close to the state of the art, the classification accuracy in cross dataset evaluation is significantly worse. However, by adjusting the system and taking the top five predictions into consideration very good results of up to 98% are achieved. It is shown that this difference is down to the ineffectiveness of certain feature classes as well as the increased severity of the task as leaves that grew under different environmental influences can differ significantly not only in colour, but also in shape.

Doerfel, S., Zoller, D., Singer, P., Niebler, T., Hotho, A., Strohmaier, M.: What Users Actually do in a Social Tagging System: A Study of User Behavior in BibSonomy.ACM Transactions on the Web.10,14:1--14:32 (2016).

Social tagging systems have established themselves as an important part in today’s web and have attracted the interest of our research community in a variety of investigations. Henceforth, several aspects of social tagging systems have been discussed and assumptions have emerged on which our community builds their work. Yet, testing such assumptions has been difficult due to the absence of suitable usage data in the past. In this work, we thoroughly investigate and evaluate four aspects about tagging systems, covering social interaction, retrieval of posted resources, the importance of the three different types of entities, users, resources, and tags, as well as connections between these entities’ popularity in posted and in requested content. For that purpose, we examine live server log data gathered from the real-world, public social tagging system BibSonomy. Our empirical results paint a mixed picture about the four aspects. While for some, typical assumptions hold to a certain extent, other aspects need to be reflected in a very critical light. Our observations have implications for the understanding of social tagging systems, and the way they are used on the web. We make the dataset used in this work available to other researchers.

Athletes adapt their training daily to optimize performance, as well as avoid fatigue, overtraining and other undesirable effects on their health. To optimize training load, each athlete must take his/her own personal objective and subjective characteristics into consideration and an increasing number of wearable technologies (wearables) provide convenient monitoring of various parameters. Accordingly, it is important to help athletes decide which parameters are of primary interest and which wearables can monitor these parameters most effectively. Here, we discuss the wearable technologies available for non-invasive monitoring of various parameters concerning an athlete's training and health. On the basis of these considerations, we suggest directions for future development. Furthermore, we propose that a combination of several wearables is most effective for accessing all relevant parameters, disturbing the athlete as little as possible, and optimizing performance and promoting health.

Social tagging systems have established themselves as a quick and easy way to organize information by annotating resources with tags. In recent work, user behavior in social tagging systems was studied, that is, how users assign tags, and consume content. However, it is still unclear how users make use of the navigation options they are given. Understanding their behavior and differences in behavior of different user groups is an important step towards assessing the effectiveness of a navigational concept and of improving it to better suit the users' needs. In this work, we investigate navigation trails in the popular scholarly social tagging system BibSonomy from six years of log data. We discuss dynamic browsing behavior of the general user population and show that different navigational subgroups exhibit different navigational traits. Furthermore, we provide strong evidence that the semantic nature of the underlying folksonomy is an essential factor for explaining navigation.

In social tagging systems, like Mendeley, CiteULike, and BibSonomy, users can post, tag, visit, or export scholarly publications. In this paper, we compare citations with metrics derived from users’ activities (altmetrics) in the popular social bookmarking system BibSonomy. Our analysis, using a corpus of more than 250,000 publications published before 2010, reveals that overall, citations and altmetrics in BibSonomy are mildly correlated. Furthermore, grouping publications by user-generated tags results in topic-homogeneous subsets that exhibit higher correlations with citations than the full corpus. We find that posts, exports, and visits of publications are correlated with citations and even bear predictive power over future impact. Machine learning classifiers predict whether the number of citations that a publication receives in a year exceeds the median number of citations in that year, based on the usage counts of the preceding year. In that setup, a Random Forest predictor outperforms the baseline on average by seven percentage points.

To protect the health of human and environment, the European Union implemented the REACH regulation for chemical substances. REACH is an acronym for Registration, Evaluation, Authorization, and Restriction of Chemicals. Under REACH, the authorities have the task of assessing chemical substances, especially those that might pose a risk to human health or environment. The work under REACH is scientifically, technically and procedurally a complex and knowledge-intensive task that is jointly performed by the European Chemicals Agency and member state authorities in Europe. The assessment of substances under REACH conducted in the German Environment Agency is supported by the knowledge-based system KnowSEC, which is used for the screening, documentation, and decision support when working on chemical substances. The software KnowSEC integrates advanced semantic technologies and strong problem solving methods. It allows for the collaborative work on substances in the context of the European REACH regulation. We discuss the applied methods and process models and we report on experiences with the implementation and use of the system.

This chapter describes and evaluates the design and implementation of a new fully autonomous quadrocopter, which is capable of self-reliant search, count and localization of a predefined object on the ground inside a room. In a preliminary calibration scan the parameters of the object are defined: As an example object a red ball is used. The scan determines the colour and radius of the ball. The implementation and principles of the object recognition and search will be described in detail. After determining the scanning parameters, the autonomous search can be executed. This is done autonomously by the quadrocopter, which uses inertial, infrared, ultrasonic, pressure sensors and an optical flow sensor to determine and control its orientation and position in 6 DOF (degree of freedom). Furthermore the quadrocopter can be equipped with sensors for obstacle detection and collision avoidance such as ultrasonic, infrared, pmd (photo mixing device) and sv (stereo vision) cameras. A camera attached to the quadrocopter and directed at the ground is used to find the searched objects and to determine its positions during the autonomous flight. Hence, objects which fulfil the scanning parameters can be found in different positions. Based on its own known position and the position of the object in the picture of the camera, the position of the detected objects can be determined. Thus repeated detections of objects can be excluded. Consequently, objects can be counted and localized autonomously. The position of the object is transferred to the ground station and compared with the true position to evaluate the system. Two different search situations and two different strategies, breadth first search (BFS) and depth first search (DFS), are investigated and their results are compared. The evaluation shows the potential, constraints and drawbacks of this approach just as the effect of the search strategy, and the most important parameters and indicators such as field of view, masking area and minimal object distance as well as accuracy, performance and completeness of the search. The entire system is composed of low-cost components and constructed from scratch. Its integration in the innovative real-time operating system Rodos developed by the German Aerospace Centre is described in detail. Rodos has been developed for embedded systems such as satellites and comparable aerospace systems.

We assessed the prevalence of moderately severe or severe mitral regurgitation (MR) justifying edge-to-edge mitral valve (MV) repair (MitraClip(®)) in patients attending the University Hospital Wuerzburg, a tertiary care centre located in Wuerzburg, Germany.Transcatheter edge-to-edge MV repair of advanced MR is a non-surgical treatment option in inoperable and high-risk patients. It is unknown how many patients are potentially eligible for MitraClip(®) since several anatomical prerequisites of the MV apparatus have to be met for optimal treatment results.Using a novel clinical data warehouse we searched for all patients attached to our Department of Internal Medicine from 01/2008 to 01/2012 with moderately severe or severe MR and aged ≥18 years. The current status of their treatment regime and eligibility for MitraClip(®) was assessed and re-evaluated according to current guidelines and echocardiographic criteria.The search of electronic medical records amongst 43,690 patients employed an innovative validated text extraction method and identified 331 patients with moderately severe or severe MR who had undergone echocardiographic assessment at our institution. Of these, 125 (38 %) received MV surgery and 206 (62 %) medical therapy only. Most patients not undergoing surgery had secondary MR (70 %). After evaluation of medical and echocardiographic data of medically treated patients (n = 206), 81 (39 %) were potential candidates for MitraClip(®) therapy, and 90 (44 %) died during the median follow-up time of 23 months.A large fraction of patients with moderately severe or severe MR but not operated was detected. Medically treated patients had a bad prognosis and about 40 % of them were potential candidates for MitraClip(®) therapy.

In this study an expectation-driven approach is proposed to extract data stored as pixel structures in medical ultrasound images. Prior knowledge about certain properties like the position of the text and its background and foreground grayscale values is utilized. Several open source Java libraries are used to pre-process the image and extract the textual information. The results are presented in an Excel table together with the outcome of several consistency checks. After manually correcting potential errors, the outcome is automatically stored in the main database. The proposed system yielded excellent results, reaching an accuracy of 99.94% and reducing the necessary human effort to a minimum.

Today’s system developers and operators face the challenge of creating software systems that make efficient use of dynamically allocated resources under highly variable and dynamic load profiles, while at the same time delivering reliable performance. Benchmarking of systems under these constraints is difficult, as state-of-the-art benchmarking frameworks provide only limited support for emulating such dynamic and highly vari- able load profiles for the creation of realistic workload scenarios. Industrial benchmarks typically confine themselves to workloads with constant or stepwise increasing loads. Alternatively, they support replaying of recorded load traces. Statistical load inten- sity descriptions also do not sufficiently capture concrete pattern load profile variations over time. To address these issues, we present the Descartes Load Intensity Model (DLIM). DLIM provides a modeling formalism for describing load intensity variations over time. A DLIM instance can be used as a compact representation of a recorded load intensity trace, providing a powerful tool for benchmarking and performance analysis. As manually obtaining DLIM instances can be time consuming, we present three different automated extraction methods, which also help to enable autonomous system analysis for self-adaptive systems. Model expressiveness is validated using the presented extraction methods. Extracted DLIM instances exhibit a median modeling error of 12.4% on average over nine different real-world traces covering between two weeks and seven months. Additionally, extraction methods perform orders of magnitude faster than existing time series decomposition approaches.

Scholarly success is traditionally measured in terms of citations to publications. With the advent of publication man- agement and digital libraries on the web, scholarly usage data has become a target of investigation and new impact metrics computed on such usage data have been proposed – so called altmetrics. In scholarly social bookmarking sys- tems, scientists collect and manage publication meta data and thus reveal their interest in these publications. In this work, we investigate connections between usage metrics and citations, and find posts, exports, and page views of publications to be correlated to citations.

Today’s system developers and operators face the challenge of creating software systems that make efficient use of dynamically allocated resources under highly variable and dynamic load profiles, while at the same time delivering reliable performance. Benchmarking of systems under these constraints is difficult, as state-of-the-art benchmarking frameworks provide only limited support for emulating such dynamic and highly vari- able load profiles for the creation of realistic workload scenarios. Industrial benchmarks typically confine themselves to workloads with constant or stepwise increasing loads. Alternatively, they support replaying of recorded load traces. Statistical load inten- sity descriptions also do not sufficiently capture concrete pattern load profile variations over time. To address these issues, we present the Descartes Load Intensity Model (DLIM). DLIM provides a modeling formalism for describing load intensity variations over time. A DLIM instance can be used as a compact representation of a recorded load intensity trace, providing a powerful tool for benchmarking and performance analysis. As manually obtaining DLIM instances can be time consuming, we present three different automated extraction methods, which also help to enable autonomous system analysis for self-adaptive systems. Model expressiveness is validated using the presented extraction methods. Extracted DLIM instances exhibit a median modeling error of 12.4% on average over nine different real-world traces covering between two weeks and seven months. Additionally, extraction methods perform orders of magnitude faster than existing time series decomposition approaches.

The issue of sustainability is at the top of the political and societal agenda, being considered of extreme importance and urgency. Human individual action impacts the environment both locally (e.g., local air/water quality, noise disturbance) and globally (e.g., climate change, resource use). Urban environments represent a crucial example, with an increasing realization that the most effective way of producing a change is involving the citizens themselves in monitoring campaigns (a citizen science bottom-up approach). This is possible by developing novel technologies and IT infrastructures enabling large citizen participation. Here, in the wider framework of one of the first such projects, we show results from an international competition where citizens were involved in mobile air pollution monitoring using low cost sensing devices, combined with a web-based game to monitor perceived levels of pollution. Measures of shift in perceptions over the course of the campaign are provided, together with insights into participatory patterns emerging from this study. Interesting effects related to inertia and to direct involvement in measurement activities rather than indirect information exposure are also highlighted, indicating that direct involvement can enhance learning and environmental awareness. In the future, this could result in better adoption of policies towards decreasing pollution.

Understanding the way people move through urban areas represents an important problem that has implications for a range of societal challenges such as city planning, public transportation, or crime analysis. In this paper, we present an interactive visualization tool called VizTrails for exploring and understanding such human movement. It features visualizations that show aggregated statistics of trails for geographic areas that correspond to grid cells on a map, e.g., on the number of users passing through or on cells commonly visited next. Amongst other features, system allows to overlay the map with the results of SPARQL queries in order to relate the observed trajectory statistics with its geo-spatial context, e.g., considering a city's points of interest. The systems functionality is demonstrated using trajectory examples extracted from the social photo sharing platform Flickr. Overall, VizTrails facilitates deeper insights into geo-spatial trajectory data by enabling interactive exploration of aggregated statistics and providing geo-spatial context.

Online newspapers have been established as a crucial information source, at least partially replacing traditional media like television or print media. As all other media, online newspapers are potentially affected by media bias.This describes non-neutral reporting of journalists and other news producers, e.g. with respect to specific opinions or political parties. Analysis of media bias has a long tradition in political science. However, traditional techniques rely heavily on manual annotation and are thus often limited to the analysis of small sets of articles. In this paper, we investigate a dataset that covers all political and economical news from four leading German online newspapers over a timespan of four years. In order to analyze this large document set and compare the political orientation of different newspapers, we propose a variety of automatically computable measures that can indicate media bias. As a result, statistically significant differences in the reporting about specific parties can be detected between the analyzed online newspapers.

Understanding the way people move through urban areas represents an important problem that has implications for a range of societal challenges such as city planning, public transportation, or crime analysis. In this paper, we present an interactive visualization tool called VizTrails for exploring and understanding such human movement. It features visualizations that show aggregated statistics of trails for geographic areas that correspond to grid cells on a map, e.g., on the number of users passing through or on cells commonly visited next. Amongst other features, system allows to overlay the map with the results of SPARQL queries in order to relate the observed trajectory statistics with its geo-spatial context, e.g., considering a city's points of interest. The systems functionality is demonstrated using trajectory examples extracted from the social photo sharing platform Flickr. Overall, VizTrails facilitates deeper insights into geo-spatial trajectory data by enabling interactive exploration of aggregated statistics and providing geo-spatial context.

A distance measure between objects is a key requirement for many data mining tasks like clustering, classification or outlier detection. However, for objects characterized by categorical attributes, defining meaningful distance measures is a challenging task since the values within such attributes have no inherent order, especially without additional domain knowledge. In this paper, we propose an unsupervised distance measure for objects with categorical attributes based on the idea that categorical attribute values are similar if they appear with similar value distributions on correlated context attributes. Thus, the distance measure is automatically derived from the given data set. We compare our new distance measure to existing categorical distance measures and evaluate on different data sets from the UCI machine-learning repository. The experiments show that our distance measure is recommendable, since it achieves similar or better results in a more robust way than previous approaches.

To optimize the workflow on commercial crowdsourcing platforms like Amazon Mechanical Turk or Microworkers, it is important to understand how users choose their tasks. Current work usually explores the underlying processes by employing user studies based on surveys with a limited set of participants. In contrast, we formulate hypotheses based on the different findings in these studies and, instead of verifying them based on user feedback, we compare them directly on data from a commercial crowdsourcing platform. For evaluation, we use a Bayesian approach called HypTrails which allows us to give a relative ranking of the corresponding hypotheses. The hypotheses considered, are for example based on task categories, monetary incentives or semantic similarity of task descriptions. We find that, in our scenario, hypotheses based on employers as well the the task descriptions work best. Overall, we objectively compare different factors influencing users when choosing their tasks. Our approach enables crowdsourcing companies to better understand their users in order to optimize their platforms, e.g., by incorparting the gained knowledge about these factors into task recommentation systems.

The temporal evolution of the entanglement between two qubits evolving by random interactions is studied analytically and numerically. Two different types of randomness are investigated. Firstly we analyze an ensemble of systems with randomly chosen but time-independent interaction Hamiltonians. Secondly we consider the case of a temporally fluctuating Hamiltonian, where the unitary evolution can be understood as a random walk on the SU (4) group manifold. As a by-product we compute the metric tensor and its inverse as well as the Laplace-Beltrami for SU (4).

This paper presents exploratory subgroup analytics on ubiquitous data: We propose subgroup discovery and assessment approaches for obtaining interesting descriptive patterns and provide a novel graph-based analysis approach for assessing the relations between the obtained subgroup set. This exploratory visualization approaches allows for the comparison of subgroups according to their relations to other subgroups and to include further parameters, e.g., geo-spatial distribution indicators. We present and discuss analysis results utilizing real-world data given by geo-tagged noise measurements with associated subjective perceptions and a set of tags describing the semantic context.

The issue of sustainability is at the top of the political and societal agenda, being considered of extreme importance and urgency. Human individual action impacts the environment both locally (e.g., local air/water quality, noise disturbance) and globally (e.g., climate change, resource use). Urban environments represent a crucial example, with an increasing realization that the most effective way of producing a change is involving the citizens themselves in monitoring campaigns (a citizen science bottom-up approach). This is possible by developing novel technologies and IT infrastructures enabling large citizen participation. Here, in the wider framework of one of the first such projects, we show results from an international competition where citizens were involved in mobile air pollution monitoring using low cost sensing devices, combined with a web-based game to monitor perceived levels of pollution. Measures of shift in perceptions over the course of the campaign are provided, together with insights into participatory patterns emerging from this study. Interesting effects related to inertia and to direct involvement in measurement activities rather than indirect information exposure are also highlighted, indicating that direct involvement can enhance learning and environmental awareness. In the future, this could result in better adoption of policies towards decreasing pollution.

<p>The issue of sustainability is at the top of the political and societal agenda, being considered of extreme importance and urgency. Human individual action impacts the environment both locally (e.g., local air/water quality, noise disturbance) and globally (e.g., climate change, resource use). Urban environments represent a crucial example, with an increasing realization that the most effective way of producing a change is involving the citizens themselves in monitoring campaigns (a citizen science bottom-up approach). This is possible by developing novel technologies and IT infrastructures enabling large citizen participation. Here, in the wider framework of one of the first such projects, we show results from an international competition where citizens were involved in mobile air pollution monitoring using low cost sensing devices, combined with a web-based game to monitor perceived levels of pollution. Measures of shift in perceptions over the course of the campaign are provided, together with insights into participatory patterns emerging from this study. Interesting effects related to inertia and to direct involvement in measurement activities rather than indirect information exposure are also highlighted, indicating that direct involvement can enhance learning and environmental awareness. In the future, this could result in better adoption of policies towards decreasing pollution.</p>

A distance measure between objects is a key requirement for many data mining tasks like clustering, classification or outlier detection. However, for objects characterized by categorical attributes, defining meaningful distance measures is a challenging task since the values within such attributes have no inherent order, especially without additional domain knowledge. In this paper, we propose an unsupervised distance measure for objects with categorical attributes based on the idea that categorical attribute values are similar if they appear with similar value distributions on correlated context attributes. Thus, the distance measure is automatically derived from the given data set. We compare our new distance measure to existing categorical distance measures and evaluate on different data sets from the UCI machine-learning repository. The experiments show that our distance measure is recommendable, since it achieves similar or better results in a more robust way than previous approaches.

Scholarly success is traditionally measured in terms of citations to publications. With the advent of publication man- agement and digital libraries on the web, scholarly usage data has become a target of investigation and new impact metrics computed on such usage data have been proposed – so called altmetrics. In scholarly social bookmarking sys- tems, scientists collect and manage publication meta data and thus reveal their interest in these publications. In this work, we investigate connections between usage metrics and citations, and find posts, exports, and page views of publications to be correlated to citations.

Social tagging systems have established themselves as an important part in today's web and have attracted the interest from our research community in a variety of investigations. The overall vision of our community is that simply through interactions with the system, i.e., through tagging and sharing of resources, users would contribute to building useful semantic structures as well as resource indexes using uncontrolled vocabulary not only due to the easy-to-use mechanics. Henceforth, a variety of assumptions about social tagging systems have emerged, yet testing them has been difficult due to the absence of suitable data. In this work we thoroughly investigate three available assumptions - e.g., is a tagging system really social? - by examining live log data gathered from the real-world public social tagging system BibSonomy. Our empirical results indicate that while some of these assumptions hold to a certain extent, other assumptions need to be reflected and viewed in a very critical light. Our observations have implications for the design of future search and other algorithms to better reflect the actual user behavior.

The combination of ubiquitous and social computing is an emerging research area which integrates different but complementary methods, techniques and tools. In this paper, we focus on the Ubicon platform, its applications, and a large spectrum of analysis results. Ubicon provides an extensible framework for building and hosting applications targeting both ubiquitous and social environments. We summarize the architecture and exemplify its implementation using four real-world applications built on top of Ubicon. In addition, we discuss several scientific experiments in the context of these applications in order to give a better picture of the potential of the framework, and discuss analysis results using several real-world data sets collected utilizing Ubicon.

Social tagging systems have established themselves as an important part in today’s web and have attracted the interest of our research community in a variety of investigations. Henceforth, several assumptions about social tagging systems have emerged on which our community also builds their work. Yet, testing such assumptions has been difficult due to the absence of suitable usage data in the past. In this work, we investigate and evaluate four assumptions about tagging systems by examining live server log data gathered from the public social tagging system BibSonomy. Our empirical results indicate that while some of these assumptions hold to a certain extent, other assumptions need to be reflected in a very critical light.

Sensor data is objective. But when measuring our environment, measured values are contrasted with our perception, which is always subjective. This makes interpreting sensor measurements difficult for a single person in her personal environment. In this context, the EveryAware projects directly connects the concepts of objective sensor data with subjective impressions and perceptions by providing a collective sensing platform with several client applications allowing to explicitly associate those two data types. The goal is to provide the user with personalized feedback, a characterization of the global as well as her personal environment, and enable her to position her perceptions in this global context. In this poster we summarize the collected data of two EveryAware applications, namely WideNoise for noise measurements and AirProbe for participatory air quality sensing. Basic insights are presented including user activity, learning processes and sensor data to perception correlations. These results provide an outlook on how this data can further be used to understand the connection between sensor data and perceptions.