Announcements:

2 months ANL internship is available to work on HEP software (fall 2017). Contact chekanov[AT]angl.gov

ProMC

Next generation input-output file format

ProMC is a library for Monte Carlo event records or any structural data, including experimental data, in
very compact binary form. The main features are:

Streams data into a binary form and dynamically writes less interesting numeric information
with reduced precision compared to more interesting records.
Such content-dependent "compression" can substentially reduce file size.

Fast. No CPU overhead due to decompression.

Self-describing data format based on a template approach to encode complicated data structures.
One can generate C++, Java and Python analysis code for reading and writing
data from the ProMC file itself.

Multiplatform. Data records can be written and read in C++, Java and Python.

Forwards-compatible and backwards-compatible binary format.

Metadata for each event can easily be encoded.

Random access. Events (and metadata) can be read starting at any index.

No external dependence. The library is small and self-contained.

ProMC ("ProtocolBuffers" MC) is based on Google's
Protocol Buffers,
language-neutral, platform-neutral and extensible mechanism for serializing structured data.
It uses "varints" as a way to store and compress integers
using variable number of bytes.
Smaller numbers take a smaller number of bytes. This means that low energetic particles
(jets, clusters, cells, tracks etc.) can be represented by smaller number of bytes, since values needed to represent such particles
are smaller compared
to high-energetic particles or other objects.

ProMC is optimized for efficient storage of numeric data which
have a small signal and large background (or “noise”).
For HEP, it is optimized for data storage of events with large number of soft particles ("pileup" or "noise").
Benchmarks indicate
that ProMC files use 40-50% less disk storage for events with pileup compared to IO
with fixed-length representation of numbers and gzip/zip compression.
The data reduction depends on the underlying energy spectrum: low-energetic part of the spectrum
is "compressed" more effectively than high-energy part.