This presentation guides listeners through all the stages of publication life cycle at CERN Document Server, from the ingestion using one of the various tools, through curation and processing, until the data is ready to be exported to other systems. It describes different tools that we are using to curate the incoming publications as well as to further improve the existing data on CDS. The second part of the talk goes through various challenges we have faced in the past and how we are going to overcome them in the new version of CDS.

The Large Hadron Collider at CERN is producing 600 million collisions every second. Only 1 in a million collisions is interesting. It requires a fast programming language to analyze and filter this amount of data. Is Python such a language? No, it’s not. Does it mean there is no place for Python in one of the largest scientific facilities in the world? Quite the contrary. The ease of use and a very low learning curve makes Python a perfect programming language for many physicists and other people without the computer science background. CERN does not only produce large amounts of data. The interesting bits of data have to be stored, analyzed, shared and published. Work of many scientists across various research facilities around the world has to be synchronized. This is the area where Python flourishes. And with CERN’s pursuit to create and use open source software, many interesting projects were born. To facilitate the analysis of data, ROOT framework [https://root.cern.ch/] was created. It’s a C++ framework focused on big data processing, statistical analysis, visualization and storage. It has been around for more than 20 years, but since nowadays more and more scientists have at least basic Python knowledge, the PyROOT project [https://root.cern.ch/pyroot] was born. PyROOT is a Python extension module that allows users to interact with ROOT from Python interpreter. It combines the ease of use of Python with the powerful capabilities of the ROOT framework. All the discoveries, small and big ones, results in thousands of publications that has to go through the whole publication workflow. For that purpose, a digital library framework called Invenio was created [http://invenio-software.org/]. It can be used to easily build your own fully customized digital library, institutional repository, multimedia archive, or research data repository on the web. Some examples of websites build with Invenio are: https://zenodo.org/, https://cds.cern.ch/ or https://analysispreservation.cern.ch/. Another of CERN’s missions is to share the knowledge, and that can be done through various lectures, workshops and conferences. All those events can easily be organized with the help of Indico [http://indico-software.org/]. Indico comes also with a room booking module and can be easily integrated with various collaborative tools.

We present how a research data repository manager can build custom open data solutions to ingest, describe, preserve, and disseminate the open research environments, datasets and software using the Invenio digital library framework. We discuss a concrete use case example of the CERN Open Data and Zenodo services, describing technological challenges in preparing large sets of data for general public. We address the questions of efficient linking and sharing of large quantities of data without unnecessary duplication on the backend, the role of the file transfer protocols, as well as the means to visualise data to make it more accessible and interactive for general public. The technological challenges and discussed solutions can be applied to any research discipline outside the domain of particle physics.

CERN Document Server (CDS) is the CERN Institutional Repository, playing a key role in the storage, dissemination and archival for all research material published at CERN, as well as multimedia and some administrative documents. As the CERN’s document hub, it joins together submission and publication workflows dedicated to the CERN experiments, but also to the video and photo teams, to the administrative groups, as well as outreach groups. In the past year, Invenio, the underlying software platform for CDS, has been undergoing major changes, transitioning from a digital library system to a digital library framework, and moving to a new software stack (Invenio is now built on top of the Flask web development framework, using Jinja2 template engine, SQLAlchemy ORM, JSONSchema data model, and Elasticsearch for information retrieval). In order to reflect these changes on CDS, we are launching a parallel service, CDSLabs, with the goal of offering our users a continuous view of the reshaping of CDS, as well as increasing the feedback from the community in the development phase, rather than after release.