While the HPC community is working towards the development of the first Exaflop computer (expected around 2020), after reaching the Petaflop milestone in 2008 still only few HPC applications are able to fully exploit the capabilities of Petaflop systems. In this paper we argue that efforts for preparing HPC applications for Exascale should start before such systems become available. We identify challenges that need to be addressed and recommend solutions in key areas of interest, including formal modeling, static analysis and optimization, runtime analysis and optimization, and autonomic computing. Furthermore, we outline a conceptual framework for porting HPC applications to future Exascale computing systems and propose steps for its implementation.

Scientific and technological innovations have become increasingly important as we face the benefits and challenges of both globalization and a knowledge-based economy. Still, enrolment rates in STEM degrees are low in many European countries and consequently there is a lack of adequately educated workforce in industries. We believe that this can be mainly attributed to pedagogical issues, such as the lack of engaging hands-on activities utilized for science and math education in middle and high schools. In this paper, we report our work in the SciChallenge European project, which aims at increasing the interest of pre-university students in STEM disciplines, through its distinguishing feature, the systematic use of social media for providing and evaluation of the student-generated content. A social media-aware contest and platform were thus developed and tested in a pan-European contest that attracted >700 participants. The statistical analysis and results revealed that the platform and contest positively influenced participants STEM learning and motivation, while only the gender factor for the younger study group appeared to affect the outcomes (confidence level – p<.05).

In the process of a scientific experiment a workflow is executed multiple times using various values of the parameters of activities. For real-world workflows that may contain hundreds of activities, each having several parameters, it is practically not feasible to conduct a parameter sensitivity study by simply following a ”brute-force approach” (that is experimental evaluation of all possible cases). We believe that a heuristic-guided approach enables to find a near-optimal solution using a reasonable amount of resources without the need for the evaluation of all possibilities. In this paper we present a novel methodology for determination of parameter significance of scientific workflows that is based on Ant Colony Optimization (ACO). We refer to our methodology, which is a customization of ACO for Parameter Significance determination, as ACO4PS. We use ACO4PS to identify (1) which parameter strongly affects the overall result of the workflow and (2) for which combination of parameter values we obtain the expected result. ACO4PS generates a list of all workflow parameters sorted by significance as well as is capable of generating a subset of significant parameters. We empirically evaluate our methodology using a real-world scientific workflow that deals with the Non-Invasive Glucose Measurement.

Usually workflow activities in the scientific domain depend on a collection of parameters. These parameters determine the output of the activity, and consequently the output of the whole workflow. In the scientific domain, workflows have exploratory nature and are used to understand a scientific phenomenon or answer scientific questions. In the process of a scientific experiment a workflow is executed multiple times using various values of the parameters of activities. It is relevant to identify (1) which parameter strongly affects the overall result of the workflow and (2) for which combination of parameter values we obtain the expected result. Foreseeing these issues, in this paper we present our methodology to estimate the significance of all scientific workflow parameters as well as to estimate the most significant parameter to the workflow. The estimation of parameter significance will enable the scientist to fine tune, and optimize his results efficiently. Furthermore, we empirically validate our methodology on Non-Invasive Glucose Measurement workflow and discuss our results. The NIGM workflow uses the neural network model to calculate the glucose level in patient blood. The neural network model has a set of parameters, which affect the result of the workflow significantly. But, unfortunately the impact significance of these parameters is commonly unknown to the user. We present our approach for estimating and quantifying impact significance of neural network parameters.

The Grid is evolving and new concepts like Semantic Grid, Knowledge Grid are rapidly emerging, where humans and distributed machines share, exchange, and manage data and resources intelligently. Computational scientists typically use workflows to describe and manage scientific discovery processes. However, the credibility of the obtained results in the scientific community is questionable if the computational experiment is not reproducible. This issue is being addressed in our research reported in this paper via development of workflow provenance system for Grid-enabled scientific workflows. Workflow provenance collects data on workflow activities, data flow and workflow clients. Provenance information can be used to trace and test workflows and the data produced. Our approach supports reproducibility (i.e. to support re-enactment of workflow by an independent user) and dataflow visualization (i.e. visualization of statistical characteristics of input/output data). We illustrate our approach on the Non-Invasive Glucose Measurement (NIGM) application.

We present a machine learning based method for noise classification using a low-power and inexpensive IoT unit. We use Mel-frequency cepstral coefficients for audio feature extraction and supervised classification algorithms (that is, support vector machine and k-nearest neighbors) for noise classification. We evaluate our approach experimentally with a dataset of about 3000 sound samples grouped in eight sound classes (such as, car horn, jackhammer, or street music). We explore the parameter space of support vector machine and k-nearest neighbors algorithms to estimate the optimal parameter values for classification of sound samples in the dataset under study. We achieve a noise classification accuracy in the range 85% -- 100%. Training and testing of our k-nearest neighbors (k = 1) implementation on Raspberry Pi Zero W is less than a second for a dataset with features of more than 3000 sound samples.

Noise is any undesired environmental sound. A sound at the same dB level may be perceived as annoying noise or as pleasant music. Therefore, it is necessary to go beyond the state-of-the-art approaches that measure only the dB level and also identify the type of noise. In this paper, we present a machine learning based method for urban noise identification using an inexpensive IoT unit. We use Mel-frequency cepstral coefficients for audio feature extraction and supervised classification algorithms (that is, support vector machine, k-nearest neighbors, bootstrap aggregation, and random forest) for noise classification. We evaluate our approach experimentally with a data-set of about 3000 sound samples grouped in eight sound classes (such as car horn, jackhammer, or street music). We explore the parameter space of the four algorithms to estimate the optimal parameter values for classification of sound samples in the data-set under study. We achieve a noise classification accuracy in the range 88% - 94%.

Modern multicore and manycore systems enjoy the benefits of technology scaling and promise impressive performance. However, harvesting this potential is not straightforward. While multicore and manycore processors alleviate several problems that are related to single-core processors – known as memory-, power-, or instruction-level parallelism-wall – they raise the issue of the programmability and programming effort. This topic focuses on novel solutions for multicore and manycore programmability and efficient programming in the context of generalpurpose systems.

PEPPHER, a three-year European FP7 project, addresses efficient utilization of hybrid (heterogeneous) computer systems consisting of multicore CPUs with GPU-type accelerators. This article outlines the PEPPHER performance-aware component model, performance prediction means, runtime system, and other aspects of the project. A larger example demonstrates performance portability with the PEPPHER approach across hybrid systems with one to four GPUs.

The European FP7 project PEPPHER is addressing programmability and performance portability for current and emerging heterogeneous many-core architectures. As its main idea, the project proposes a multi-level parallel execution model comprised of potentially parallelized components existing in variants suitable for different types of cores, memory configurations, input characteristics, optimization criteria, and couples this with dynamic and static resource and architecture aware scheduling mechanisms. Crucial to PEPPHER is that components can be made performance aware, allowing for more efficient dynamic and static scheduling on the concrete, available resources. The flexibility provided in the software model, combined with a customizable, heterogeneous, memory and topology aware run-time system is key to efficiently exploiting the resources of each concrete hardware configuration. The project takes a holistic approach, relying on existing paradigms, interfaces, and languages for the parallelization of components, and develops a prototype framework, a methodology for extending the framework, and guidelines for constructing performance portable software and systems – including paths to migration of existing software – for heterogeneous many-core processors. This paper gives a high-level project overview, and presents a specific example showing how the PEPPHER component variant model and resource-aware run-time system enable performance portability of a numerical kernel.

PEPPHER takes a pluralistic and parallelization agnostic approach to programmability and performance portability for heterogeneous many-core architectures. The PEPPHER framework is in principle language independent but focuses on supporting C++ code with PEPPHER-specific annotations as pragmas or external annotations. The framework is open and extensible; the PEPPHER methodology details how new architectures are incorporated. The PEPPHER methodology consists of rules for how to extend the framework for new architectures. This mainly concerns adaptivity and autotuning for algorithm libraries, the necessary hooks and extensions for the run-time system and any supporting algorithms and data structures that this relies on. Offloading is a specific technique for programming heterogeneous platforms that can sometimes be applied with high efficiency. Offload as developed by the PEPPHER partner Codeplay is a particular, nonintrusive C++ extension allowing portable C++ code to support diverse heterogeneous multicore architectures in a single code base.

Many important scientific and engineering problems may be solved by combining multiple applications in the form of a Grid workflow. We consider that for the wide acceptance of Grid technology it is important that the user has the possibility to express requirements on Quality of Service (QoS) at workflow specification time. However, most of the existing workflow languages lack constructs for QoS specification. In this paper we present an approach for high level workflow specification that considers a comprehensive set of QoS requirements. Besides performance related QoS, it includes economical, legal and security aspects. For instance, for security or legal reasons the user may express the location affinity regarding Grid resources on which certain workflow tasks may be executed. Our QoS-aware workflow system provides support for the whole workflow life cycle from specification to execution. Workflow is specified graphically, in an intuitive manner, based on a standard visual modeling language. A set of QoS-aware service-oriented components is provided for workflow planning to support automatic constraint-based service negotiation and workflow optimization. For reducing the complexity of workflow planning, we introduce a QoS-aware workflow reduction technique. We illustrate our approach with a real-world workflow for maxillo facial surgery simulation.

Commonly, at a high level of abstraction Grid applications are specified based on the workflow paradigm. However, majority of Grid workflow systems either do not support Quality of Service (QoS), or provide only partial QoS support for certain phases of the workflow lifecycle. In this paper we present Amadeus, which is a holistic service-oriented environment for QoS-aware Grid workflows. Amadeus considers user requirements, in terms of QoS constraints, during workflow specification, planning, and execution. Within the Amadeus environment workflows and the associated QoS constraints are specified at a high level using an intuitive graphical notation. A distinguishing feature of our system is the support of a comprehensive set of QoS requirements, which considers in addition to performance and economical aspects also legal and security aspects. A set of QoS-aware service-oriented components is provided for workflow planning to support automatic constraint-based service negotiation and workflow optimization. For improving the efficiency of workflow planning we introduce a QoS-aware workflow reduction technique. Furthermore, we present our static and dynamic planning strategies for workflow execution in accordance with user-specified requirements. For each phase of the workflow lifecycle we experimentally evaluate the corresponding Amadeus components.

While modern parallel computing systems provide high performance resources, utilizing them to the highest extent requires advanced programming expertise. Programming for parallel computing systems is much more difficult than programming for sequential systems. OpenMP is an extension of C++ programming language that enables to express parallelism using compiler directives. While OpenMP alleviates parallel programming by reducing the lines of code that the programmer needs to write, deciding how and when to use these compiler directives is up to the programmer. Novice programmers may make mistakes that may lead to performance degradation or unexpected program behavior. Cognitive computing has shown impressive results in various domains, such as health or marketing. In this paper, we describe the use of IBM Watson cognitive system for education of novice parallel programmers. Using the dialogue service of the IBM Watson we have developed a solution that assists the programmer in avoiding common OpenMP mistakes. To evaluate our approach we have conducted a survey with a number of novice parallel programmers at the Linnaeus University, and obtained encouraging results with respect to usefulness of our approach. (C) 2017 The Authors. Published by Elsevier B.V.

The introduction of Intel® Xeon Phi™ coprocessors opened up new possibilities in development of highly parallel applications. Even though the architecture allows developers to use familiar programming paradigms and techniques, high-level development of programs that utilize all available processors (host+coprocessors) in a system at the same time is a challenging task.

In this paper we present a new high-level parallel library construct which makes it easy to apply a function to every member of an array in parallel. In addition, it supports the dynamic distribution of work between the host CPUs and one or more coprocessors. We describe associated runtime support and use a physical simulation example to demonstrate that our library can facilitate the creation of C++ applications that benefit significantly from hybrid execution. Experimental results show that a single optimized source code is sufficient to simultaneously exploit all of the host's CPU cores and coprocessors efficiently.

Performance engineering of parallel and distributed applications is a complex task that iterates through various phases, ranging from modeling and prediction, to performance measurement, experiment management, data collection, and bottleneck analysis. There is no evidence so far that all of these phases should/can be integrated into a single monolithic tool. Moreover, the emergence of computational Grids as a common single wide-area platform for high-performance computing raises the idea to provide tools as interacting Grid services that share resources, support interoperability among different users and tools, and, most importantly, provide omnipresent services over the Grid. We have developed the ASKALON tool set to support performance-oriented development of parallel and distributed (Grid) applications. ASKALON comprises four tools, coherently integrated into a service-oriented architecture. SCALEA is a performance instrumentation, measurement, and analysis tool of parallel and distributed applications. ZENTURIO is a general purpose experiment management tool with advanced support for multi-experiment performance analysis and parameter studies. AKSUM provides semi-automatic high-level performance bottleneck detection through a special-purpose performance property specification language. The PerformanceProphet enables the user to model and predict the performance of parallel applications at the early stages of development. In this paper we describe the overall architecture of the ASKALON tool set and outline the basic functionality of the four constituent tools. The structure of each tool is based on the composition and sharing of remote Grid services, thus enabling tool interoperability. In addition, a data repository allows the tools to share the common application performance and output data that have been derived by the individual tools. A service repository is used to store common portable Grid service implementations. A general-purpose Factory service is employed to create service instances on arbitrary remote Grid sites. Discovering and dynamically binding to existing remote services is achieved through registry services. The ASKALON visualization diagrams support both online and post-mortem visualization of performance and output data. We demonstrate the usefulness and effectiveness of ASKALON by applying the tools to real-world applications.

Cloud Computing is one of the most intensively developed solutions for large-scale distributed processing. Effective use of such environments, management of their high complexity and ensuring appropriate levels of Quality of Service (QoS) require advanced monitoring systems. Such monitoring systems have to support the scalability, adaptability and reliability of Cloud. Most of existing monitoring systems do not incorporate any Artificial Intelligence (AI) algorithms for supporting the change inside the task stream or environment itself. They focus only on monitoring or enabling the control of the system as a part of a separated service. An effective monitoring system for the Cloud environment should gather information about all stages of tasks processing and should actively control the monitored environment. In this paper, we present a novel Multi-Agent System based Cloud Monitoring (MAS-CM) model that supports the performance and security of tasks gathering, scheduling and execution processes in large-scale service-oriented environments. Such models are explicitly designed to control the performance and security objectives of the environment. In our work, we focus on prevention of unauthorized task injection and modification, optimization of scheduling process and maximization of resource usage.We evaluate the effectiveness of MAS-CM empirically using an evolutionary driven implementation of Independent Batch Scheduler and FastFlow framework. The obtained results demonstrate the effectiveness of the proposed approach and the performance improvement.

Science education is tremendously shaping the present and future of modern societies. Thus, Europe needs all its talents to increase creativity and competitiveness. Especially young boys and girls have to be engaged to pursue careers in Science, Technology, Engineering and Mathematics (STEM). However, statistics still show that enrolment rates in STEM-based degree programs are decreasing. On the long run, this will lead to a workforce problem in the research and development based economy as well as in the scientific sector of all EU member states. But how can we manage it to get young people more interested in STEM?The EU-funded research project SciChallenge (project.scichallenge.eu) addresses this challenge by proposing a social-media-based STEM-contest for young people between 10 to 20 years. The contest pilot is currently running (until April 30th 2017). With its multi-level approach, SciChallenge aims at increasing the attractiveness of science education and careers among young girls and boys on a pan- European level.In the first part, the paper introduces the project and highlights the main steps of the preparation of the contest. This includes the development of the contest concept and the processual framework as well as the main steps that were done for preparing the contest. It also presents the resources that are provided for the participants. The second part of the paper highlights the idea, design and implementation of the digital contest platform (www.scichallenge.eu), which serves as the core of the contest. It will present for example the novel submission and rating system that utilize the power of social networking platforms such as Facebook, as well as the contest dashboards, a convenient, easy to use informational map for the users to observe the status of the contest and related information. Furthermore, it will show how intelligent social media syndication tools can support the awareness creation. The third part of the paper will provide a status update on the currently running contest pilot. It will provide a summary of the experiences that the consortium made with this novel approach. It will also elaborate on the main obstacles the consortium was facing and present the lessons-learned for a future implementation, before drawing preliminary conclusions in the final part regarding the question if such an approach can be a way to increase interest of young people in STEM-education and careers.

PEPPHER is a 3-year EU FP7 project that develops a novel approach and framework to enhance performance portability and programmability of heterogeneous multi-core systems. Its primary target is single-node heterogeneous systems, where several CPU cores are supported by accelerators such as GPUs. This poster briefly surveys the PEPPHER framework for single-node systems, and elaborates on the prospectives for leveraging the PEPPHER approach to generate performance-portable code for heterogeneous multi-node systems.

PEPPHER is a 3-year EU FP7 project that develops a novel approach and framework to enhance performance portability and programmability of heterogeneous multi-core systems. Its primary target is single-node heterogeneous systems, where several CPU cores are supported by accelerators such as GPUs. This poster briefly surveys the PEPPHER framework for singlenode systems, and elaborates on the prospectives for leveraging the PEPPHER approach to generate performance-portable code for heterogeneous multi-node systems.

We discuss three complementary approaches that can provide both portability and an increased level of abstraction for the programming of heterogeneous multicore systems. Together, these approaches also support performance portability, as currently investigated in the EU FP7 project PEPPHER. In particular, we consider (1) a library-based approach, here represented by the integration of the SkePU C++ skeleton programming library with the StarPU runtime system for dynamic scheduling and dynamic selection of suitable execution units for parallel tasks; (2) a language-based approach, here represented by the Offload-C++ high-level language extensions and Offload compiler to generate platform-specific code; and (3) a component-based approach, specifically the PEPPHER component system for annotating user-level application components with performance metadata, thereby preparing them for performance-aware composition. We discuss the strengths and weaknesses of these approaches and show how they could complement each other in an integrational programming framework for heterogeneous multicore systems.

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. However, exploiting the available performance of heterogeneous architectures may be challenging. There are various parallel programming frameworks (such as, OpenMP, OpenCL, OpenACC, CUDA) and selecting the one that is suitable for a target context is not straightforward. In this paper, we study empirically the characteristics of OpenMP, OpenACC, OpenCL, and CUDA with respect to programming productivity, performance, and energy. To evaluate the programming productivity we use our homegrown tool CodeStat, which enables us to determine the percentage of code lines required to parallelize the code using a specific framework. We use our tools MeterPU and x-MeterPU to evaluate the energy consumption and the performance. Experiments are conducted using the industry-standard SPEC benchmark suite and the Rodinia benchmark suite for accelerated computing on heterogeneous systems that combine Intel Xeon E5 Processors with a GPU accelerator or an Intel Xeon Phi co-processor.

The DNA sequence analysis is a data and computationally intensive problem and therefore demands suitable parallel computing resources and algorithms. In this paper, we describe an optimized approach for DNA sequence analysis on a heterogeneous platform that is accelerated with the Intel Xeon Phi. Such platforms commonly comprise one or two general purpose host central processing units (CPUs) and one or more Xeon Phi devices. We present a parallel algorithm that shares the work of DNA sequence analysis between the host CPUs and the Xeon Phi device to reduce the overall analysis time. For automatic worksharing we use a supervised machine learning approach, which predicts the performance of DNA sequence analysis on the host and device and accordingly maps fractions of the DNA sequence to the host and device. We evaluate our approach empirically using real-world DNA segments for human and various animals on a heterogeneous platform that comprises two 12-core Intel Xeon E5 CPUs and an Intel Xeon Phi 7120P device with 61 cores.

Genetic information is increasing exponentially, doubling every 18 months. Analyzing this information within a reasonable amount of time requires parallel computing resources. While considerable research has addressed DNA analysis using GPUs, so far not much attention has been paid to the Intel Xeon Phi coprocessor. In this paper we present an algorithm for large-scale DNA analysis that exploits thread-level and the SIMD parallelism of the Intel Xeon Phi. We evaluate our approach for various numbers of cores and thread allocation affinities in the context of real-world DNA sequences of mouse, cat, dog, chicken, human and turkey. The experimental results on Intel Xeon Phi show speed-ups of up to 10× compared to a sequential implementation running on an Intel Xeon processor E5.

Rapid analysis of DNA sequences is important in preventing the evolution of different viruses and bacteria during an early phase, early diagnosis of genetic predispositions to certain diseases (cancer, cardiovascular diseases), and in DNA forensics. However, real-world DNA sequences may comprise several Gigabytes and the process of DNA analysis demands adequate computational resources to be completed within a reasonable time. In this paper we present a scalable approach for parallel DNA analysis that is based on Finite Automata, and which is suitable for analysing very large DNA segments. We evaluate our approach for real-world DNA segments of mouse (2.7GB), cat (2.4GB), dog (2.4GB), chicken (1GB), human (3.2GB) and turkey (0.2GB). Experimental results on a dual-socket shared-memory system with 24 physical cores show speedups of up to 17.6x. Our approach is up to 3x faster than a pattern-based parallel approach that uses the RE2 library.

Analysis of DNA sequences is a data and computational intensive problem, and therefore, it requires suitable parallel computing resources and algorithms. In this paper, we describe our parallel algorithm for DNA sequence analysis that determines how many times a pattern appears in the DNA sequence. The algorithm is engineered for heterogeneous platforms that comprise a host with multi-core processors and one or more many-core devices. For combinatorial optimization, we use the simulated annealing algorithm. The optimization goal is to determine the number of threads, thread affinities, and DNA sequence fractions for host and device, such that the overall execution time of DNA sequence analysis is minimized. We evaluate our approach experimentally using real-world DNA sequences of various organisms on a heterogeneous platform that comprises two Intel Xeon E5 processors and an Intel Xeon Phi 7120P co-processing device. By running only about 5% of possible experiments, our optimization method finds a near-optimal system configuration for DNA sequence analysis that yields with average speedup of 1.6 × and 2 × compared with the host-only and device-only execution.

We describe an approach that uses combinatorial optimization and machine learning to share the work between the host and device of heterogeneous computing systems such that the overall application execution time is minimized. We propose to use combinatorial optimization to search for the optimal system configuration in the given parameter space (such as, the number of threads, thread affinity, work distribution for the host and device). For each system configuration that is suggested by combinatorial optimization, we use machine learning for evaluation of the system performance. We evaluate our approach experimentally using a heterogeneous platform that comprises two 12-core Intel Xeon E5 CPUs and an Intel Xeon Phi 7120P co-processor with 61 cores. Using our approach we are able to find a near-optimal system configuration by performing only about 5% of all possible experiments.

Big data streaming applications require utilization of heterogeneous parallel computing systems, which may comprise multiple multi-core CPUs and many-core accelerating devices such as NVIDIA GPUs and Intel Xeon Phis. Programming such systems require advanced knowledge of several hardware architectures and device-specific programming models, including OpenMP and CUDA. In this paper, we present HSTREAM, a compiler directive-based language extension to support programming stream computing applications for heterogeneous parallel computing systems. HSTREAM source-to-source compiler aims to increase the programming productivity by enabling programmers to annotate the parallel regions for heterogeneous execution and generate target specific code. The HSTREAM runtime automatically distributes the workload across CPUs and accelerating devices. We demonstrate the usefulness of HSTREAM language extension with various applications from the STREAM benchmark. Experimental evaluation results show that HSTREAM can keep the same programming simplicity as OpenMP, and the generated code can deliver performance beyond what CPUs-only and GPUs-only executions can deliver.

The efficient utilization of the available resources in modern parallel computing systems requires advanced parallel programming expertise. However, parallel programming is more difficult than sequential programming. To alleviate the difficulties of parallel programming, high-level programming frameworks, such as OpenMP, have been proposed. Yet, there is evidence that novice parallel programmers make common mistakes that may lead to performance degradation or unexpected program behavior. In this paper, we present our cognitive Parallel Programming Assistant (PAPA) that aims at educating and assisting novice parallel programmers to avoid common OpenMP mistakes. PAPA combines different IBM Watson services to provide a dialog-based interaction (through text and voice) for programmers. We use the Watson Conversation service to implement the dialog-based interaction, and the Speech-to-Text and Text-to-Speech services to enable the voice interaction. The Watson Natural Language Understanding and WordsAPI Synonyms services are used to train PAPA with OpenMP-related publications. We evaluate our approach using a user experience questionnaire with a number of novice parallel programmers at Linnaeus University.

Regular expression matching is essential for many applications, such as finding patterns in text, exploring substrings in large DNA sequences, or lexical analysis. However, sequential regular expression matching may be time-prohibitive for large problem sizes. In this paper, we describe a novel algorithm for parallel regular expression matching via deterministic finite automata. Furthermore, we present our tool PaREM that accepts regular expressions and finite automata as input and automatically generates the corresponding code for our algorithm that is amenable for parallel execution on shared-memory systems. We evaluate our parallel algorithm empirically by comparing it with a commonly used algorithm for sequential regular expression matching. Experiments on a dual-socket shared-memory system with 24 physical cores show speed-ups of up to 21× for 48 threads.

Genetic information is increasing exponentially, doubling every 18 months. Analyzing this information within a reasonable amount of time requires parallel computing resources. While considerable research has addressed DNA analysis using GPUs, so far not much attention has been paid to the Intel Xeon Phi coprocessor. In this paper we present an algorithm for large-scale DNA analysis that exploits the thread-level and the SIMD parallelism of the Intel Xeon Phi coprocessor. We evaluate our approach for various numbers of cores and thread allocation affinities in the context of real-world DNA sequences of mouse, cat, dog, chicken, human and turkey. The experimental results on Intel Xeon Phi show speed-ups of up to 10× compared to a sequential implementation running on an Intel Xeon processor E5.

Heterogeneous computing systems offer high peak performance and energy efficiency, and utilizing this potential is essential to achieve extreme-scale performance. However, optimal sharing of the work among processing elements in heterogeneous systems is not straightforward. In this paper, we propose an approach that uses combinatorial optimization to search for optimal system configuration in a given parameter space. The optimization goal is to determine the number of threads, thread affinities, and workload partitioning, such that the overall execution time is minimized. For combinatorial optimization we use the Simulated Annealing. We evaluate our approach with a DNA sequence analysis application on a heterogeneous platform that comprises two Intel Xeon E5 processors and an Intel Xeon Phi 7120P co-processor. The obtained results demonstrate that using the near-optimal system configuration, determined by our algorithm based on the simulated annealing, application performance is improved.

Optimized software execution on parallel computing systems demands consideration of many parameters at run-time. Determining the optimal set of parameters in a given execution context is a complex task, and therefore to address this issue researchers have proposed different approaches that use heuristic search or machine learning. In this paper, we undertake a systematic literature review to aggregate, analyze and classify the existing software optimization methods for parallel computing systems. We review approaches that use machine learning or meta-heuristics for scheduling parallel computing systems. Additionally, we discuss challenges and future research directions. The results of this study may help to better understand the state-of-the-art techniques that use machine learning and meta-heuristics to deal with the complexity of scheduling parallel computing systems. Furthermore, it may aid in understanding the limitations of existing approaches and identification of areas for improvement.

While modern parallel computing systems offer high performance, utilizing these powerful computing resources to the highest possible extent demands advanced knowledge of various hardware architectures and parallel programming models. Furthermore, optimized software execution on parallel computing systems demands consideration of many parameters at compile-time and run-time. Determining the optimal set of parameters in a given execution context is a complex task, and therefore to address this issue researchers have proposed different approaches that use heuristic search or machine learning. In this paper, we undertake a systematic literature review to aggregate, analyze and classify the existing software optimization methods for parallel computing systems. We review approaches that use machine learning or meta-heuristics for software optimization at compile-time and run-time. Additionally, we discuss challenges and future research directions. The results of this study may help to better understand the state-of-the-art techniques that use machine learning and meta-heuristics to deal with the complexity of software optimization for parallel computing systems. Furthermore, it may aid in understanding the limitations of existing approaches and identification of areas for improvement.

We present IoTutor that is a cognitive computing solution for education of students in the IoT domain. We implement the IoTutor as a platform-independent web-based application that is able to interact with users via text or speech using natural language. We train the IoTutor with selected scientific publications relevant to the IoT education. To investigate users' experience with the IoTutor, we ask a group of students taking an IoT master level course at the Linnaeus University to use the IoTutor for a period of two weeks. We ask students to express their opinions with respect to the attractiveness, perspicuity, efficiency, stimulation, and novelty of the IoTutor. The evaluation results show a trend that students express an overall positive attitude towards the IoTutor with majority of the aspects rated higher than the neutral value.

In this chapter, we describe an optimized approach for DNA sequence analysis on a heterogeneous platform that is accelerated with the Intel Xeon Phi. Such platforms commonly comprise one or two general purpose CPUs and one (or more) Xeon Phi coprocessors. Our parallel DNA sequence analysis algorithm is based on Finite Automata and finds patterns in large-scale DNA sequences. To determine the optimal worksharing (that is, DNA sequence fractions for the host and accelerating device) we propose a solution that combines combinatorial optimization and machine learning. The objective function that we aim to minimize is the execution time of the DNA sequence analysis. We use combinatorial optimization to efficiently explore the system configuration space and determine with machine learning the near-optimal system configuration for execution of the DNA sequence analysis. We evaluate our approach empirically using real-world DNA segments of various organisms. For experimentation, we use an accelerated platform that comprises two 12-core Intel Xeon E5 CPUs and an Intel Xeon Phi 7120P accelerator with 61 cores.

We report a simulation study of a smart living IoT solution for elderly people living in their own houses. Our study was conducted in the context of BoIT project in Sweden that investigates the use of various IoT devices for remote housing and care-giving services. We focus on a carephone device that enables to establish a voice connection via IP with care givers or relatives. We have developed a simulation model to study the IoT solution for elderly care in the Vaxjo municipality in Sweden. The simulation model can be used to address various issues, such as determining the lack or excess of resources or long waiting times, and study the system behavior when the number of alarms is increased. Simulation results indicate that a 15% increase in the arrivals rate would cause unacceptable long waiting times for patients to receive the care.

In this chapter we argue that an intelligent program development environment that proactively supports the user helps a mainstream programmer to overcome the difficulties of programming multicore computing systems. We propose a programming environment based on intelligent software agents that enables users to work at a high level of abstraction while automating low-level implementation activities. The programming environment supports program composition in a model-driven development fashion using parallel building blocks and proactively assists the user during major phases of program development and performance tuning. We highlight the potential benefits of using such a programming environment with usage scenarios. An experiment with a parallel building block on a Sun UltraSPARC T2 Plus processor shows how the system may assist the programmer in achieving performance improvements.

In this position paper we argue that an intelligent program development environment that proactively supports the user helps a mainstream programmer to overcome the difficulties of programming multi-core computing systems. We propose a programming environment based on intelligent software agents that enables users to work at a high level of abstraction while automating low-level implementation activities. The programming environment supports program composition in a model-driven development fashion using parallel building blocks and proactively assists the user during major phases of program development and performance tuning. We highlight the potential benefits of using such a programming environment with usage-scenarios. An experiment with a parallel building block on a Sun UltraSPARC T2 Plus processor shows how the system may assist the programmer in achieving performance improvements.

We present a novel approach for hybrid performance modelling and prediction of large-scale parallel and distributed computing systems, which combines Mathematical Modelling (MathMod) and Discrete-Event Simulation (DES). We use MathMod to develop parameterised performance models for components of the system. Thereafter, we use DES to describe the structure of system and the interaction among its components. As a result we obtain a high-level performance model, which combines the evaluation speed of mathematical models with the structure awareness and fidelity of the simulation model. We evaluate empirically our approach with a real-world material science program that comprises more than 15,000 lines of code.

We address the issue of the development of performance models for programs that may be executed on large-scale computing systems. The commonly used approaches apply non-standard notations for model specification and often require that the software engineer has a thorough understanding of the underlying performance modeling technique. We propose to bridge the gap between the performance modeling and software engineering by incorporating UML. In our approach we aim to permit the graphical specification of performance model in a human-intuitive fashion on one hand, but on the other hand we aim for a machine-efficient model evaluation. The user specifies graphically the performance model using UML. Thereafter, the transformation of the performance model from the human-usable UML representation to the machine-efficient C++ representation is done automatically. We describe our methodology and illustrate it with the automatic transformation of a sample performance model. Furthermore, we demonstrate the usefulness of our approach by modeling and simulating a real-world material science program.