Presentations

Needs

Over the last years the sources (photons and neutrons), beamline instrumentation and especially detectors have been dramatically improved. Detectors capable of hundreds and even thousands of
images per second with readout data rates exceeding a Gigabyte per second per detector are enabling previously unimaginable experiments. The consequence of this is what is known as the data deluge. The
ESFRI roadmap projects in PaNDaaS are in the forefront of this trend where a single experimental session can generate tens of Terabytes of data and in the near future hundreds of Terabytes 3 . With tens
of experiments running simultaneously at each RI this poses a major challenge. This issue to be solved can be more challenging than the ones faced by the LHC due to the number of detectors and their
combined data rates for a single site. Interacting with such data sets is becoming a major bottleneck for scientists using the RIs because:

Data are too big to transfer - it is increasingly difficult to take data away due to their size

Data analysis constitutes a barrier – data reduction and analysis can be a blocking issue for new users

Experiments need feedback from online data analysis – providing feedback about the quality of of raw data will considerably improve the efficiency and quality of experiments

Objectives

PaNDaaS proposes to solve the above problems by keeping raw data at the RIs and combining the expertise and resources of all participants of the consortium to solve the IT and data analysis problems.
The consortium will implement a reference architecture which will provide a cloud-like platform for data analysis. Each RI will roll out the reference implementation and install and manage their private cloud. The private clouds will provide the resources to access the raw data at the facility and allow users remote access to browse, reduce and analyse their data. The ultimate goal is to provide Data Analysis as a Service (DaaS). In order to fulfil the mission of providing Data Analysis as a Service the RIs must expand capacities and services for data storage and computations so that data reduction and analysis can be done during and after the experiments. Failing to do so will inevitably lead to a decrease of scientific productivity of the RIs. The challenge of “big data” cannot be addressed by a single facility and therefore requires a concerted effort and a common approach, for the benefit of the participating RIs and especially for the scientific users of the RIs. The importance of “big data” challenges has been recognised by the participating organisations as a high priority issue to be addressed urgently. The PaNDaaS proposal has found wide endorsement by scientists in academia and industry in the quest
to keep ahead of the data deluge.

Analytical facilities are used by scientists from many different scientific domains who are not necessarily experts in IT, nor in computing in general. The typical users are not used to nor equipped to manage or process large amounts of data. The scientific users of the RIs come from thousands of laboratories scattered over Europe and beyond and have frequently only limited network bandwidth and limited access to IT resources. This is a key difference with High Energy Physics, where a stable scientific community has learnt how to deal with complex computing issues and has access to HPC resources and optimised private network connectivity 4 . The users at RIs are experts in their own domain, e.g. life sciences, electronics alaeontology, environmental science etc., but are not experts in neutron or photon facilities or the techniques employed at these facilities. This is increasingly the case as such RIs are becoming routine tools in the European science programmes, used by a growing and broadening community. The time spent at the instrument is often simply too short to become familiar with the intricate data calibration, reduction and analysis procedures. It is the task of the in-house scientists to accompany the visiting scientists and guide them through the experiments. Unfortunately, this guidance is in general not pursued for the data reduction/analysis process, because this is not formally part of the services offered by the RIs. Many in-house cientists will on a best effort basis try to help the visiting scientists and therefore to increase the likelihood of a successful publication. It is commonly recognised now that there is a growing publication backlog on some experiments due to the difficulties users face with the data analysis. A well structured, integrated and efficient data processing environment, useable by non-experts, commonly deployed and serviced by the ESFRIs and European Photon and Neutron RIs, is therefore essential in overcoming this backlog.

Project funding

The PaNDaaS project was submitted to the H2020 call for proposals in 2014 but was not funded. The need for DaaS has not gone away in the PaN community instead it has got stronger with even bigger data sets.
The project has therefore been adapted and is going ahead with in-house resources of the different partners.