This paper presents a deep learning benchmark on a complex dataset known as KFUPM Handwritten Arabic TexT (KHATT). The KHATT data-set consists of complex patterns of handwritten Arabic text-lines. This paper contributes mainly in three aspects i.e., (1) pre-processing, (2) deep learning based approach, and (3) data-augmentation. The pre-processing step includes pruning of white extra spaces plus de-skewing the skewed text-lines. We deploy a deep learning approach based on Multi-Dimensional Long Short-Term Memory (MDLSTM) networks and Connectionist Temporal Classification (CTC). The MDLSTM has the advantage of scanning the Arabic text-lines in all directions (horizontal and vertical) to cover dots, diacritics, strokes and fine inflammation. The data-augmentation with a deep learning approach proves to achieve better and promising improvement in results by gaining 80.02% Character Recognition (CR) over 75.08% as baseline.

In this paper, we introduce an end-to-end Amharic text-line image recognition approach based on recurrent neural networks. Amharic is an indigenous Ethiopic script which follows a unique syllabic writing system adopted from an ancient Geez script. This script uses 34 consonant characters with the seven vowel variants of each (called basic characters) and other labialized characters derived by adding diacritical marks and/or removing parts of the basic characters. These associated diacritics on basic characters are relatively smaller in size, visually similar, and challenging to distinguish from the derived characters. Motivated by the recent success of end-to-end learning in pattern recognition, we propose a model which integrates a feature extractor, sequence learner, and transcriber in a unified module and then trained in an end-to-end fashion. The experimental results, on a printed and synthetic benchmark Amharic Optical Character Recognition (OCR) database called ADOCR, demonstrated that the proposed model outperforms state-of-the-art methods by 6.98% and 1.05%, respectively.

In this paper, the authors propose to increase the efficiency of blockchain mining by using a population-based approach. Blockchain relies on solving difficult mathematical problems as proof-of-work within a network before blocks are added to the chain. Brute force approach, advocated by some as the fastest algorithm for solving partial hash collisions and implemented in Bitcoin blockchain, implies exhaustive, sequential search. It involves incrementing the nonce (number) of the header by one, then taking a double SHA-256 hash at each instance and comparing it with a target value to ascertain if lower than that target. It excessively consumes both time and power. In this paper, the authors, therefore, suggest using an inner for-loop for the population-based approach. Comparison shows that it’s a slightly faster approach than brute force, with an average speed advantage of about 1.67% or 3,420 iterations per second and 73% of the time performing better. Also, we observed that the more the total particles deployed, the better the performance until a pivotal point. Furthermore, a recommendation on taming the excessive use of power by networks, like Bitcoin’s, by using penalty by consensus is suggested.

Automatic analysis of scanned historical documents comprises a wide range of image analysis tasks, which are often challenging for machine learning due to a lack of human-annotated learning samples. With the advent of deep neural networks, a promising way to cope with the lack of training data is to pre-train models on images from a different domain and then fine-tune them on historical documents. In the current research, a typical example of such cross-domain transfer learning is the use of neural networks that have been pre-trained on the ImageNet database for object recognition. It remains a mostly open question whether or not this pre-training helps to analyse historical documents, which have fundamentally different image properties when compared with ImageNet. In this paper, we present a comprehensive empirical survey on the effect of ImageNet pre-training for diverse historical document analysis tasks, including character recognition, style classification, manuscript dating, semantic segmentation, and content-based retrieval. While we obtain mixed results for semantic segmentation at pixel-level, we observe a clear trend across different network architectures that ImageNet pre-training has a positive effect on classification as well as content-based retrieval.

This paper introduces a dataset for an exotic, but very interesting script, Amharic. Amharic follows a unique syllabic writing system which uses 33 consonant characters with their 7 vowels variants of each. Some labialized characters derived by adding diacritical marks on consonants and or removing part of it. These associated diacritics on consonant characters are relatively smaller in size and challenging to distinguish the derived (vowel and labialized) characters. In this paper we tackle the problem of Amharic text-line image recognition. In this work, we propose a recurrent neural network based method to recognize Amharic text-line images. The proposed method uses Long Short Term Memory (LSTM) networks together with CTC (Connectionist Temporal Classification). Furthermore, in order to overcome the lack of annotated data, we introduce a new dataset that contains 337,332 Amharic text-line images which is made freely available at http://www.dfki.uni-kl.de/~belay/. The performance of the proposed Amharic OCR model is tested by both printed and synthetically generated datasets, and promising results are obtained.

We propose a novel approach towards adversarial attacks on neural networks (NN), focusing on tampering the data used for training instead of generating attacks on trained models. Our network-agnostic method creates a backdoor during training which can be exploited at test time to force a neural network to exhibit abnormal behaviour. We demonstrate on two widely used datasets (CIFAR-10 and SVHN) that a universal modification of just one pixel per image for all the images of a class in the training set is enough to corrupt the training procedure of several state-of-the-art deep neural networks, causing the networks to misclassify any images to which the modification is applied. Our aim is to bring to the attention of the machine learning community, the possibility that even learning-based methods that are personally trained on public datasets can be subject to attacks by a skillful adversary.

In this paper we present an approach for the PAN 2019 Author Profiling challenge. The task here is to detect Twitter bots and also to classify the gender of human Twitter users as male or female, based on a hundred select tweets from their profile. Focusing on feature engineering, we explore the semantic categories present in tweets. We combine these semantic features with part of speech tags and other stylistic features – e.g. character floodings and the use of capital letters – for our eventual feature set. We have experimented with different machine learning techniques, including ensemble techniques, and found AdaBoost to be the most successful (attaining an F1-score of 0.99 on the development set). Using this technique, we achieved an accuracy score of 89.17% for English language tweets in the bot detection subtask

Offline signature verification is a challenging pattern recognition task where a writer model is inferred using only a small number of genuine signatures. A combination of complementary writer models can make it more difficult for an attacker to deceive the verification system. In this work, we propose to combine a recent structural approach based on graph edit distance with a statistical approach based on deep triplet networks. The combination of the structural and statistical models achieve significant improvements in performance on four publicly available benchmark datasets, highlighting their complementary perspectives.

This essay discusses current research efforts in conversational systems from the philosophy of science point of view and evaluates some conversational systems research activities from the standpoint of naturalism philosophical theory. Conversational systems or chatbots have advanced over the decades and now have become mainstream applications. They are software that users can communicate with, using natural language. Particular attention is given to the Alime Chat conversational system, already in industrial use, and the related research. The competitive nature of systems in production is a result of different researchers and developers trying to produce new conversational systems that can outperform previous or state-of-the-art systems. Different factors affect the quality of the conversational systems produced, and how one system is assessed as being better than another is a function of objectivity and of the relevant experimental results. This essay examines the research practices from, among others, Longino’s view on objectivity and Popper’s stand on falsification. Furthermore, the need for qualitative and large datasets is emphasized. This is in addition to the importance of the peer-review process in scientific publishing, as a means of developing, validating, or rejecting theories, claims, or methodologies in the research community. In conclusion, open data and open scientific discussion fora should become more prominent over the mere publication-focused trend.

A pivotal question in Automatic Speech Recognition (ASR) is the robustness of the trained models. In this study, we investigate the combination of two methods commonly applied to increase the robustness of ASR systems. On the one hand, inspired by auditory experiments and signal processing considerations, multi-band band processing has been used for decades to improve the noise robustness of speech recognition. On the other hand, dropout is a commonly used regularization technique to prevent overfitting by keeping the model from becoming over-reliant on a small set of neurons. We hypothesize that the careful combination of the two approaches would lead to increased robustness, by preventing the resulting model from over-rely on any given band. To verify our hypothesis, we investigate various approaches for the combination of the two methods using the Aurora-4 corpus. The results obtained corroborate our initial assumption, and show that the proper combination of the two techniques leads to increased robustness, and to significantly lower word error rates (WERs). Furthermore, we find that the accuracy scores attained here compare favourably to those reported recently on the clean training scenario of the Aurora-4 corpus.

In this paper we propose a novel CNN based approach for Amharic character image recognition. The proposed method is designed by leveraging the structure of Amharic graphemes. Amharic characters could be decomposed in to a consonant and a vowel. As a result of this consonant-vowel combination structure, Amharic characters lie within a matrix structure called 'Fidel Gebeta'. The rows and columns of 'Fidel Gebeta' correspond to a character's consonant and the vowel components, respectively. The proposed method has a CNN architecture with two classifiers that detect the row/consonant and column/vowel components of a character. The two classifiers share a common feature space before they fork-out at their last layers. The method achieves state-of-the-art result on a synthetically generated dataset. The proposed method achieves 94.97% overall character recognition accuracy.

This work tackles a particular image-to-image translation problem, where the goal is to transform an image from a source domain (modern printed electronic document) to a target domain (historical handwritten document). The main motivation of this task is to generate massive synthetic datasets of "historic" documents which can be used for the training of document analysis systems. By completing this task, it becomes possible to consider the generation of a tremendous amount of synthetic training data using only one single deep learning algorithm. Existing approaches for synthetic document generation rely on heuristics, or 2D and 3D geometric transformation-functions and are typically targeted at degrading the document. We tackle the problem of document synthesis and propose to train a particular form of Generative Adversarial Neural Networks, to learn a mapping function from an input image to an output image. With several experiments, we show that our algorithm generates an artificial historical document image that looks like a real historical document - for expert and non-expert eyes - by transferring the "historical style" to the classical electronic document.

In this paper, we present a large historical database of Chinese family records with the aim to develop robust systems for historical document analysis. In this direction, we propose a Historical Document Reading Challenge on Large Chinese Structured Family Records (ICDAR 2019 HDRCCHINESE).The objective of the competition is to recognizeand analyze the layout, and finally detect and recognize thetextlines and characters of the large historical document image dataset containing more than 10000 pages. Cascade R-CNN, CRNN, and U-Net based architectures were trained to evaluatethe performances in these tasks. Error rate of 0.01 has been recorded for textline recognition (Task1) whereas a Jaccard Index of 99.54% has been recorded for layout analysis (Task2).The graph edit distance based total error ratio of 1.5% has been recorded for complete integrated textline detection andrecognition (Task3).

14. Labeling, Cutting, Grouping

Alberti, Michele

et al.

Document Image and Voice Analysis Group (DIVA), University of Fribourg, Switzerland.

Vögtlin, Lars

Document Image and Voice Analysis Group (DIVA), University of Fribourg, Switzerland.

Pondenkandath, Vinaychandran

Document Image and Voice Analysis Group (DIVA), University of Fribourg, Switzerland.

This paper introduces a new way for text-line extraction by integrating deep-learning based pre-classification and state-of-the-art segmentation methods. Text-line extraction in complex handwritten documents poses a significant challenge, even to the most modern computer vision algorithms. Historical manuscripts are a particularly hard class of documents as they present several forms of noise, such as degradation, bleed-through, interlinear glosses, and elaborated scripts. In this work, we propose a novel method which uses semantic segmentation at pixel level as intermediate task, followed by a text-line extraction step. We measured the performance of our method on a recent dataset of challenging medieval manuscripts and surpassed state-of-the-art results by reducing the error by 80.7%. Furthermore, we demonstrate the effectiveness of our approach on various other datasets written in different scripts. Hence, our contribution is two-fold. First, we demonstrate that semantic pixel segmentation can be used as strong denoising pre-processing step before performing text line extraction. Second, we introduce a novel, simple and robust algorithm that leverages the high-quality semantic segmentation to achieve a text-line extraction performance of 99.42% line IU on a challenging dataset.

n this paper, we introduce the use of Semantic Hashing as embedding for the task of Intent Classification and achieve state-of-the-art performance on three frequently used benchmarks. Intent Classification on a small dataset is a challenging task for data-hungry state-of-the-art Deep Learning based systems. Semantic Hashing is an attempt to overcome such a challenge and learn robust text classification. Current word embedding based methods [11], [13], [14] are dependent on vocabularies. One of the major drawbacks of such methods is out-of-vocabulary terms, especially when having small training datasets and using a wider vocabulary. This is the case in Intent Classification for chatbots, where typically small datasets are extracted from internet communication. Two problems arise with the use of internet communication. First, such datasets miss a lot of terms in the vocabulary to use word embeddings efficiently. Second, users frequently make spelling errors. Typically, the models for intent classification are not trained with spelling errors and it is difficult to think about ways in which users will make mistakes. Models depending on a word vocabulary will always face such issues. An ideal classifier should handle spelling errors inherently. With Semantic Hashing, we overcome these challenges and achieve state-of-the-art results on three datasets: Chatbot, Ask Ubuntu, and Web Applications [3]. Our benchmarks are available online.

Recently, the development of depth sensing technologies such as Leap motion and Microsoft Kinect sensors facilitate a touch-less environment to interact with computers and mobile devices. Several research have been carried out for the air-written text recognition with the help of these devices. However, there are several countries (like India) where multiple scripts are used to write official languages. Therefore, for the development of an effective text recognition system, the script of the text has to be identified first. The task becomes more challenging when it comes to 3D handwriting. Since, the 3D text written in air is consists of single stoke only. This paper presents a 3D script identification and recognition system written in three languages, namely, Hindi, English and Punjabi using Leap motion sensor. In the first stage, script identification was carried out in one of the three language. Next, Hidden Markov Model (HMM) was used to recognize the words. An accuracy of 96.4% was recorded in script identification whereas accuracies of 72.99%, 73.25% and 60.5% were recorded in script identification of Hindi, English and Punjabi scripts, respectively.

We introduce DeepDIVA: an infrastructure designed to enable quick and intuitive setup of reproducible experiments with a large range of useful analysis functionality. Reproducing scientific results can be a frustrating experience, not only in document image analysis but in machine learning in general. Using DeepDIVA a researcher can either reproduce a given experiment or share their own experiments with others. Moreover, the framework offers a large range of functions, such as boilerplate code, keeping track of experiments, hyper-parameter optimization, and visualization of data and results. To demonstrate the effectiveness of this framework, this paper presents case studies in the area of handwritten document analysis where researchers benefit from the integrated functionality. DeepDIVA is implemented in Python and uses the deep learning framework PyTorch. It is completely open source(1), and accessible as Web Service through DIVAServices(2).

Cross-depiction is the problem of identifying the same object even when it is depicted in a variety of manners.This is a common problem in handwritten historical document image analysis, for instance when the same letter or motif is depicted in several different ways. It is a simple task for humans yet conventional computer vision methods struggle to cope with it. In this paper we address this problem using state-of-the-art deep learning techniques on a dataset of historical watermarks containing images created with different methods of reproduction, such as hand tracing, rubbing, and radiography.To study the robustness of deep learning based approaches to the cross-depiction problem, we measure their performance on two different tasks: Classification and similarity rankings. For the former we achieve a classification accuracy of 96 % using deep convolutional neural networks. For the latter we have a false positive rate at 95% recall of 0.11. These results outperform state-of-the-art methods by a significant margin

This paper introduces a very challenging dataset of historic German documents and evaluates Fully Convolutional Neural Network (FCNN) based methods to locate handwritten annotations of any kind in these documents. The handwritten annotations can appear in form of underlines and text by using various writing instruments, e.g., the use of pencils makes the data more challenging. We train and evaluate various end-to-end semantic segmentation approaches and report the results. The task is to classify the pixels of documents into two classes: Background and handwritten annotation. The best model achieves a mean Intersection over Union (IOU) score of 95.6% on the test documents of the presented dataset. We also present a comparison of different strategies used for data augmentation and training on our presented dataset. For evaluation, we use the Layout Analysis Evaluator for the ICDAR 2017 Competition on Layout Analysis for Challenging Medieval Manuscripts.

In this article we present the concept of DIVADesk â a Virtual Research Environment (VRE) for scholarly work on historical documents inspired by the shift toward working with digital facsimiles. The contribution of this article is three-fold. First, a review of existing tools and projects shows that a holistic workspace integrating the latest outcomes of computational Document Image Analysis (DIA) research is still a desideratum that can only be achieved by intensive interdisciplinary collaboration. Second, the underlying modular architecture of the digital workspace is presented. It consists of a set of services that can be combined according to individual scholars’ requirements. Furthermore, interoperability with existing frameworks and services allows the research data to be shared with other VREs. The proposed DIVADesk addresses specific research with historical documents, as this is one of the hardest cases in computational DIA. The outcomes of this paradigmatic research can be transferred to other use cases in the humanities. The third contribution of this article is a description of already existing services and user interfaces to be integrated in DIVADesk. They are part of ongoing research at the DIVA research group at the University of Fribourg, Switzerland. The labeling tool DIVADIA, for example, provides methods for layout analysis, script analysis, and text recognition of historical documents. These methods build on the concept of incremental learning and provide users with semi-automatic labeling of document parts, such as text, images, and initials. The conception and realization of DIVADesk promises research outcomes both in computer science and in the humanities. Therefore, an interdisciplinary approach and intensive collaboration between scholars in the two research fields are of crucial importance.