CERN Document Server (CDS, cds.cern.ch) is the CERN Institutional Repository based on the Invenio open source digital repository framework. It is a heterogeneous repository, containing more than 2 million records, including research publications, audiovisual material, images, and the CERN archives. Its mission is to store and preserve all the content produced at CERN as well as to make it easily available to any outlet interested. CDS aims to be the CERN’s document hub. To achieve this we are transforming CDS into an aggregator over specialized repositories, each having its own software stack, with features enabled based on the repository’s content. The aim is to enable each content producer community to have its own identity, both visually and functionally, as well as increased control on the data model and the submission, curation, management, and dissemination of the data. This separation is made possible by using the Invenio 3 framework. The first specialized repository created is CDS Videos (videos.cern.ch). It has been launched in December 2017, and is the first step in the long-term project to migrate the entire CDS to the Invenio 3 framework. CDS Videos provides an integrated submission, long-term archival and dissemination of CERN video material. It offers a complete solution for the CERN video team, as well as for any department or user at CERN, to upload video productions. The CDS Videos system will ingest the video material, interact with the transcoding server for generating web and broadcaster subformats, mint DOI persistent identifiers, generate embeddable code to be reused by any other website, and store the master files for long-term archival. The talk will detail the software architecture of the CDS Videos as well as the infrastructure needed to run such a large-scale web application. It will present the technical solutions adopted, including the Python-based software stack (using among others Flask, IIIF, ElasticSearch, Celery, RabbitMQ) and the new AngularJS-based user interface which was exclusively designed for CDS Videos. It will also present our solution to a lossless migration of data: more than 5'000 videos from 1954 to 2017, summing up to 30TB of files, have been migrated from DFS to EOS in order to populate the CDS Videos platform. All this could be of high interest to other institutes wanting to reuse the CDS Videos open source code for creating their own video platform. Last but not least, the talk will detail how the user community at CERN and beyond can take advantage of the CDS Videos platform for creating and disseminating video content.

This presentation guides listeners through all the stages of publication life cycle at CERN Document Server, from the ingestion using one of the various tools, through curation and processing, until the data is ready to be exported to other systems. It describes different tools that we are using to curate the incoming publications as well as to further improve the existing data on CDS. The second part of the talk goes through various challenges we have faced in the past and how we are going to overcome them in the new version of CDS.

The Large Hadron Collider at CERN is producing 600 million collisions every second. Only 1 in a million collisions is interesting. It requires a fast programming language to analyze and filter this amount of data. Is Python such a language? No, it’s not. Does it mean there is no place for Python in one of the largest scientific facilities in the world? Quite the contrary. The ease of use and a very low learning curve makes Python a perfect programming language for many physicists and other people without the computer science background. CERN does not only produce large amounts of data. The interesting bits of data have to be stored, analyzed, shared and published. Work of many scientists across various research facilities around the world has to be synchronized. This is the area where Python flourishes. And with CERN’s pursuit to create and use open source software, many interesting projects were born. To facilitate the analysis of data, ROOT framework [https://root.cern.ch/] was created. It’s a C++ framework focused on big data processing, statistical analysis, visualization and storage. It has been around for more than 20 years, but since nowadays more and more scientists have at least basic Python knowledge, the PyROOT project [https://root.cern.ch/pyroot] was born. PyROOT is a Python extension module that allows users to interact with ROOT from Python interpreter. It combines the ease of use of Python with the powerful capabilities of the ROOT framework. All the discoveries, small and big ones, results in thousands of publications that has to go through the whole publication workflow. For that purpose, a digital library framework called Invenio was created [http://invenio-software.org/]. It can be used to easily build your own fully customized digital library, institutional repository, multimedia archive, or research data repository on the web. Some examples of websites build with Invenio are: https://zenodo.org/, https://cds.cern.ch/ or https://analysispreservation.cern.ch/. Another of CERN’s missions is to share the knowledge, and that can be done through various lectures, workshops and conferences. All those events can easily be organized with the help of Indico [http://indico-software.org/]. Indico comes also with a room booking module and can be easily integrated with various collaborative tools.

We present how a research data repository manager can build custom open data solutions to ingest, describe, preserve, and disseminate the open research environments, datasets and software using the Invenio digital library framework. We discuss a concrete use case example of the CERN Open Data and Zenodo services, describing technological challenges in preparing large sets of data for general public. We address the questions of efficient linking and sharing of large quantities of data without unnecessary duplication on the backend, the role of the file transfer protocols, as well as the means to visualise data to make it more accessible and interactive for general public. The technological challenges and discussed solutions can be applied to any research discipline outside the domain of particle physics.

CERN Document Server (CDS) is the CERN Institutional Repository, playing a key role in the storage, dissemination and archival for all research material published at CERN, as well as multimedia and some administrative documents. As the CERN’s document hub, it joins together submission and publication workflows dedicated to the CERN experiments, but also to the video and photo teams, to the administrative groups, as well as outreach groups. In the past year, Invenio, the underlying software platform for CDS, has been undergoing major changes, transitioning from a digital library system to a digital library framework, and moving to a new software stack (Invenio is now built on top of the Flask web development framework, using Jinja2 template engine, SQLAlchemy ORM, JSONSchema data model, and Elasticsearch for information retrieval). In order to reflect these changes on CDS, we are launching a parallel service, CDSLabs, with the goal of offering our users a continuous view of the reshaping of CDS, as well as increasing the feedback from the community in the development phase, rather than after release.