Attachments

Reading this, I wonder if there is a meaningful distinction between
data compressed with a "general compressor" and simply "Opaque data".
It may be better to not have encodings that in any way generic, and
only add specific ones as they come into existence. I would presume
that an FDSN update to the miniseed3 to assign the next code number to
"32-bit IEEE floats, Brotli compression, bla bla bla" would be
possible. In the mean time, individuals that wish to experiment with
other compression types can do so by using the opaque data code 100
and using a, perhaps standardized, the optional header to specify
information about how the opaque data is suppose to be extracted.

Don't add a code until there is a specific implementation, and that
code is tied to a single specific algorithm.

Yes, I think there is a meaningful distinction between the two. Generic compressed data that requires interpretation of a string, as described in the proposal, would either need to be controlled or we leave the possibility of getting lots of different ones. If we control which ones are used allowed then we might as well assign them encoding values and we do not want 100s of those. So while I can appreciate the concept expressed in the proposal to allow lots of flexibility for future compressors, I don't think we actually want that much flexibility.

When this was added to the straw man it was the intention to ultimately have an encoding that is explicit, similar to "32-bit IEEE floats, Brotli compression". What was left for discussion is if Brotli is the right choice or if some other algorithm (or small number of algorithms) is/are better.

Some lengthy background to explain the motivations for adding a generic compression encoding in the straw man follows.

The main advantages are to provide a single encoding that can be used with all sample types (including floats/doubles for which we have no compression) and to leverage the extensive work done by those outside of seismology.

In my opinion the most important guidelines for a general compressor are:
1) general and usable for any sample type,
2) efficient at very small payloads (not a common scenario in the compression world),
3) broad support in languages and environments and freely usable and
4) a realistic possibility of integrating with existing miniSEED libraries/processors.

Obviously also needs to be a documented standard (whether FDSN does it or adopts it).

The reasons Brotli was raised as a potential candidate are:

1) It is designed for and efficient at small payload sizes. For example, many formats store the "dictionary" with the payload, whereas Brotli has a default, static dictionary. Even though the static dictionary is designed for text it works well on binary data.

3) It is a general compressor. Ints, floats, doubles, whatever sample type. We can always get more compression out of tailoring a compressor for seismological time series, but we'd probably have to invent it and support it (aka Steim encodings).

4) There is already quite broad support in many languages.

5) It is designed to be efficiently decoded, with more of the cost going into encoding. This fits the seismological data use case well, where data is decompressed much more often than compressed.

6) There is a reference encoder and decoder from Google. This C language code is simpler and more portable than the high performance, complicated DEFLATE compressors (gzip, lzham, etc.), which would dwarf libmseed and qlib2 in size/complexity.

Between the RFC'd format definition and the MIT-licensed reference library, Brotli is about as open as it gets and cannot be revoked. More in-depth technical evaluation is needed to ensure that Brotli's performance on seismic data is acceptable.

The KMI change proposal raises a good point about efficiency. We should be mindful of resource limitations in field recorders, etc.. Then again there would still be value in an encoding that is only used once data reaches a center.

Reading this, I wonder if there is a meaningful distinction between
data compressed with a "general compressor" and simply "Opaque data".
It may be better to not have encodings that in any way generic, and
only add specific ones as they come into existence. I would presume
that an FDSN update to the miniseed3 to assign the next code number to
"32-bit IEEE floats, Brotli compression, bla bla bla" would be
possible. In the mean time, individuals that wish to experiment with
other compression types can do so by using the opaque data code 100
and using a, perhaps standardized, the optional header to specify
information about how the opaque data is suppose to be extracted.

Don't add a code until there is a specific implementation, and that
code is tied to a single specific algorithm.

Yes, I think there is a meaningful distinction between the two. Generic compressed data that requires interpretation of a string, as described in the proposal, would either need to be controlled or we leave the possibility of getting lots of different ones. If we control which ones are used allowed then we might as well assign them encoding values and we do not want 100s of those. So while I can appreciate the concept expressed in the proposal to allow lots of flexibility for future compressors, I don't think we actually want that much flexibility.

When this was added to the straw man it was the intention to ultimately have an encoding that is explicit, similar to "32-bit IEEE floats, Brotli compression". What was left for discussion is if Brotli is the right choice or if some other algorithm (or small number of algorithms) is/are better.

Some lengthy background to explain the motivations for adding a generic compression encoding in the straw man follows.

The main advantages are to provide a single encoding that can be used with all sample types (including floats/doubles for which we have no compression) and to leverage the extensive work done by those outside of seismology.

In my opinion the most important guidelines for a general compressor are:
1) general and usable for any sample type,
2) efficient at very small payloads (not a common scenario in the compression world),
3) broad support in languages and environments and freely usable and
4) a realistic possibility of integrating with existing miniSEED libraries/processors.

Obviously also needs to be a documented standard (whether FDSN does it or adopts it).

The reasons Brotli was raised as a potential candidate are:

1) It is designed for and efficient at small payload sizes. For example, many formats store the "dictionary" with the payload, whereas Brotli has a default, static dictionary. Even though the static dictionary is designed for text it works well on binary data.

3) It is a general compressor. Ints, floats, doubles, whatever sample type. We can always get more compression out of tailoring a compressor for seismological time series, but we'd probably have to invent it and support it (aka Steim encodings).

4) There is already quite broad support in many languages.

5) It is designed to be efficiently decoded, with more of the cost going into encoding. This fits the seismological data use case well, where data is decompressed much more often than compressed.

6) There is a reference encoder and decoder from Google. This C language code is simpler and more portable than the high performance, complicated DEFLATE compressors (gzip, lzham, etc.), which would dwarf libmseed and qlib2 in size/complexity.

Between the RFC'd format definition and the MIT-licensed reference library, Brotli is about as open as it gets and cannot be revoked. More in-depth technical evaluation is needed to ensure that Brotli's performance on seismic data is acceptable.

The KMI change proposal raises a good point about efficiency. We should be mindful of resource limitations in field recorders, etc.. Then again there would still be value in an encoding that is only used once data reaches a center.

Reading this, I wonder if there is a meaningful distinction between
data compressed with a "general compressor" and simply "Opaque data".
It may be better to not have encodings that in any way generic, and
only add specific ones as they come into existence. I would presume
that an FDSN update to the miniseed3 to assign the next code number to
"32-bit IEEE floats, Brotli compression, bla bla bla" would be
possible. In the mean time, individuals that wish to experiment with
other compression types can do so by using the opaque data code 100
and using a, perhaps standardized, the optional header to specify
information about how the opaque data is suppose to be extracted.

Don't add a code until there is a specific implementation, and that
code is tied to a single specific algorithm.