Network Working Group P. Thierry
Internet-Draft Thierry Technologies
Intended status: Experimental may 8, 2018
Expires: November 9, 2018
Binary Uniform Language Kit 1.0
draft-thierry-bulk-03
Abstract
This specification describes a uniform, decentrally extensible and
efficient format for data serialization.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on November 9, 2018.
Copyright Notice
Copyright (c) 2018 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Thierry Expires November 9, 2018 [Page 1]
Internet-Draft BULK1 may 2018
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Rationale . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1. Definitions . . . . . . . . . . . . . . . . . . . . . 3
1.1.2. State of the art . . . . . . . . . . . . . . . . . . 4
1.2. Format overview . . . . . . . . . . . . . . . . . . . . . 5
1.3. Conventions and Terminology . . . . . . . . . . . . . . . 6
2. BULK syntax . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1. Parsing algorithm . . . . . . . . . . . . . . . . . . . . 8
2.1.1. Evaluation . . . . . . . . . . . . . . . . . . . . . 9
2.2. Forms . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1. starting marker byte . . . . . . . . . . . . . . . . 10
2.2.2. ending marker byte . . . . . . . . . . . . . . . . . 10
2.2.3. Difference between sequence and form . . . . . . . . 10
2.3. Atoms . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1. nil . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2. Array . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3. Binary words . . . . . . . . . . . . . . . . . . . . 11
2.3.4. Reserved marker bytes . . . . . . . . . . . . . . . . 14
2.3.5. Reference . . . . . . . . . . . . . . . . . . . . . . 14
3. Standard namespaces . . . . . . . . . . . . . . . . . . . . . 15
3.1. BULK core namespace . . . . . . . . . . . . . . . . . . . 15
3.1.1. Version . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2. true . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.3. false . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.4. Strings encoding . . . . . . . . . . . . . . . . . . 16
3.1.5. IANA registered character set . . . . . . . . . . . . 16
3.1.6. Windows code page . . . . . . . . . . . . . . . . . . 17
3.1.7. Namespaces . . . . . . . . . . . . . . . . . . . . . 17
3.1.8. Definitions . . . . . . . . . . . . . . . . . . . . . 18
3.1.9. Arithmetic . . . . . . . . . . . . . . . . . . . . . 21
3.1.10. Compact formats . . . . . . . . . . . . . . . . . . . 22
4. Extension namespaces . . . . . . . . . . . . . . . . . . . . 26
5. Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1. Profile redundancy . . . . . . . . . . . . . . . . . . . 26
5.2. Standard profile . . . . . . . . . . . . . . . . . . . . 26
6. Security Considerations . . . . . . . . . . . . . . . . . . . 27
6.1. Parsing . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2. Forwarding . . . . . . . . . . . . . . . . . . . . . . . 27
6.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 27
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 27
8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 28
9. References . . . . . . . . . . . . . . . . . . . . . . . . . 28
9.1. Normative References . . . . . . . . . . . . . . . . . . 29
9.2. Informative references . . . . . . . . . . . . . . . . . 29
9.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Appendix A. Robust namespace definition . . . . . . . . . . . . 30
Thierry Expires November 9, 2018 [Page 2]
Internet-Draft BULK1 may 2018
A.1. Selective authority . . . . . . . . . . . . . . . . . . . 30
A.2. Open authority . . . . . . . . . . . . . . . . . . . . . 30
Appendix B. Verifiable namespace bootstrap . . . . . . . . . . . 31
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 31
1. Introduction
1.1. Rationale
This specification aims at finding an original trade-off between
uniformity, generality, extensibility, decentralization, compactness
and processing speed for a data format. It is our opinion that every
widely used existing format occupy a different position than this one
in the solution space for formats, hence this new design. It is also
our opinion that most of those existing formats constitute an optimal
solution for their specific use case, either in a absolute sense, or
at least at the time of their design. But the ever-changing field of
IT now faces new challenges that call for a new approach.
In particular, whereas the previous trend for Internet and Web
standards and programming tools has been to create human-readable
syntaxes for data and protocols, the advent of technologies like
protocol buffers [protobuf], Thrift [Thrift], the various binary
serializations for JSON like Avro [Avro] or Smile [Smile], or the
binary HTTP/2 [HTTP2] seem to indicate that the time is ripe for a
generalized use of binary, reserved until now for the low-level
protocols and arbitrary data storage. The lessons about flexibility
learnt in the previous switch from binary to plain text can now be
applied to efficient binary syntaxes.
1.1.1. Definitions
By *uniformity*, we mean the property of a syntax that can be parsed
even by an application that doesn't understand the semantics of every
part of the processed data. Of course, almost all syntaxes that
feature uniformity contain a limited number of non uniform elements.
Also, uniformity really only has value in the face of extension, as a
fixed syntax doesn't need uniformity (it only makes the
implementation simpler).
Almost all extensible syntaxes have their extensible part uniform to
a great degree. For the purpose of this specification, uniformity
has hence been evaluated on two criteria: first, the number of non
uniform elements (and, incidentally, their diversity), second, the
fact that the uniformity of the extensible part is not a limitation
to the users (i.e. that the temptation to extend the language in a
non-uniform way is as absent as possible).
Thierry Expires November 9, 2018 [Page 3]
Internet-Draft BULK1 may 2018
A good counter-example is found in most programming languages.
Adding a new branching construct cannot be done in a terse way
without modifying the underlying implementation. Such a construct
either cannot be defined by user code (because of evaluation rules)
or can in a terribly verbose and inconvenient way (with lots of
boilerplate code). Notable exceptions to this limitation of
programming languages are Lisp, Haskell and stack programming
languages.
On the other hand, a stack programming language is the canonical
example of a non-uniform language. Each operator takes a number of
operands from the stack. Not knowing the arity of an operator makes
it impossible to continue parsing, even when its evaluation was
optional to the final processing. In the design space, stack
programming languages completely sacrifice uniformity to achieve one
of the highest combination of extensibility, compactness and speed of
processing.
By *generality*, we mean the ability of a syntax to lend itself to
describe any kind of data with a reasonable (or better yet, high)
level of compactness and simplicity. For example, although both
arrays and linked lists could be considered very general as they are
both able to store any kind of data, they actually are at the
respective cost of complexity (arrays need the embedding of data
structure in the data or in the processing logic) and size (in-memory
linked lists can waste as much as half or two third of the space for
the overhead of the data structure).
By *decentralization*, we mean the ability to extend the syntax in a
way that avoid naming collisions without the use of a central
registry. Note that the DNS, as we use it, is NOT decentralized in
this sense, but distributed, as it cannot work without its root
servers and not even without prior knowledge of their location.
1.1.2. State of the art
Uniformity, generality and extensibility are usually highly-valued
traits in formats design. Programming languages obviously feature
them foremost, although their generality usually stops at what they
are supposed to express: procedures. Most of them are ill-suited to
represent arbitrary data, but notable exceptions include Lisp (where
"code is data") and Javascript, from which a subset has been
extracted to exchange data, JSON, which has seen a tremendous success
for this purpose. JSON may lack in generality and compactness, but
its design makes its parsing really straightforward and fast. All of
them, though, lack decentralization. Some of them make it possible
to extend them in a distrubuted way if some discipline is followed
(for example, by naming modules after domain names), but the
Thierry Expires November 9, 2018 [Page 4]
Internet-Draft BULK1 may 2018
discipline is not mandatory (and even with domain names, a change of
ownership makes it possible for name collisions).
The SGML/XML family of formats also feature uniformity, generality
and extensibility and actually fare much better than programming
languages on the three fronts. XML namespaces also make XML naming
distributed and there have been attempts at making it compact (e.g.
EXI from W3C, Fast Infoset from ISO/ITU or EBML).
All the previously cited formats clearly lack compactness, although
just applying standard compression techniques would sacrifice only
very little processing time to gain huge size reductions on most of
their intended use cases.
So-called binary formats pretty much exhibit the opposite trade-offs.
Most of them are not uniform to achieve better compactness. Some are
specifically designed for a great generality, but many lack
extensibility. When they are extensible, it's never in a
decentralized way, again for reasons that have to do with
compactness. They are usually extremely fast to parse.
Actually, many binary formats are not so much formats but formats
frameworks, and exclude extensibility by design. For each use case,
an IDL compiler creates a brand new format that is essentially
incompatible with all other formats created by the same compiler
(EBML specifically cites this property among its own disadvantages).
If the IDL compiler and framework are correctly designed, such a
format usually represent an optimum in compactness and speed of
processing, as the compiler can also automatically generate an ad-hoc
optimized parser.
1.2. Format overview
A BULK stream is a stream of 8-bit bytes, in big-endian order.
Parsing a BULK stream yields a sequence of expressions, which can be
either atoms or forms, which are sequences of expressions. The
syntax of forms is entirely uniform, without a single exception: a
starting byte marker, a sequence of expressions and an ending byte
marker. Among atoms, only nil (the null byte), arrays and fixed-
sized binary words have a special syntax, for efficiency purposes.
Even booleans and floating-point numbers follow the uniform syntax
that every other expression follows.
Non uniform atoms start with a marker byte, followed by a static or
dynamic number of bytes, depending on the type.
Any other atom is a reference, which consists of a namespace marker
(in almost all cases, a single byte) followed by an identifier within
Thierry Expires November 9, 2018 [Page 5]
Internet-Draft BULK1 may 2018
this namespace (a single byte). All in all, a very little sacrifice
is made in compactness for the benefit of a very simple syntax: apart
from nil, nothing is smaller than 2 bytes, and as most forms involve
a reference followed by some content, a form is usually 4 bytes + its
content.
A namespace marker in a BULK stream is associated to a namespace
identified by some identifier guaranteed to be unique without
coordination (like a UUID or cryptographical hash), thus ensuring
decentralized extensibility. The stream can be processed even if the
application doesn't recognize the namespace. Parsing remains
possible thanks to the uniform syntax.
Combination of BULK namespaces, BULK streams and even other formats
doesn't need any content transformation to work. Here are some
examples:
o The content of a BULK stream, enclosed in sequence starting and
ending byte markers, constitute a valid BULK expression. Thus
BULK streams can be packed or annotated within a BULK stream
without modification. Annotation use cases include adding
metadata or cryptographic signature.
o A BULK format could specify in its syntax the place for an
expression holding metadata. Whether the specification provides
its own metadata forms or not, an application could use a BULK
serialization for MARC, TEI Header, XML or RDF for this metadata
expression. The vocabulary selected would be univocally expressed
by the namespace and every vocabulary would be parsed by the same
mechanisms.
o Whenever a content must be stored as-is instead of serialized or a
highly-optimized ad hoc serialization exists for some data,
anything can always be stored within an array. They can contain
arbitray bytes and there is no limit to their size.
Furthermore, BULK expressions can be evaluated. Most expressions
evaluate to themselves, but some evaluate by default to the result of
a function call, making it possible to serialize data in an even more
compact form, by eliminating boilerplate data and repeated patterns.
1.3. Conventions and Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
Thierry Expires November 9, 2018 [Page 6]
Internet-Draft BULK1 may 2018
Literal numerical values are provided in decimal or hexadecimal as
appropriate. Hexadecimal literals are prefixed with "0x" to
distinguish them from decimal literals.
The text notation of the BULK stream uses mnemonics for some bytes
sequences. Mnemonics are series of characters, excluding all capital
letters and white space, like "this-is-one-mnemonic" or "what-
the-%S.!?#-is-that?". They are always separated by white space.
Outside the use of mnemonics, a sequence of bytes (of one or more
bytes) can be represented by its hexadecimal value as an unsigned
integer (e.g. "0x3F" or "0x3A0B770F"). Some types in this
specification define a special syntax for their representation in the
text notation.
In the grammar, a shape is a pattern of bytes, following the rules of
the text notation for a BULK stream. Apart from mnemonics and fixed
sequences of bytes, a shape can contain:
o an arbitrary sequence of a fixed number of bytes, represented by
its size, i.e. a number of bytes in decimal immediately followed
by a B uppercase letter (e.g. "4B")
o a typed sequence of bytes, represented by the name of its type, a
capitalized word (e.g. "Foo"); this means a sequence of bytes
whose specific yield (cf. Section 2.1) has this type
o a named sequence of bytes (of zero or more bytes), represented by
a series of any character excluding '{}' between '{' and '}' (e.g.
"{quux}"); a named sequence can be typed or sized, in which case
it is immediately followed by ':' and a type or size (e.g.
"{quux}:Bar" or "{quux}:12B")
When an entire shape describes the byte sequence of an atom, it is
the normative specification for parsing it, but shapes of forms are
only normative with respect to their default evaluation and the
corresponding semantics. A reference defined with a form shape can
be used in different shapes, albeit with different semantics and
value and even when used in its default shape, a processing
application MAY give it alternative semantics (although this is not
recommended).
For example, this specification defines a way do specify a string
encoding with forms of the shape "( stringenc {enc}:Expr )". But the
shapes "( stringenc {arg1}:Int {arg2}:Int )" or "( {arg1}:Int
stringenc {arg2}:Int )" are syntactly valid. They just have
unspecified semantics, as far as this specification is concerned.
Thierry Expires November 9, 2018 [Page 7]
Internet-Draft BULK1 may 2018
2. BULK syntax
A BULK stream is a sequence of 8-bit bytes. Bits and bytes are in
big-endian order. The result of parsing a BULK stream is a sequence
of abstract data, called the abstract yield. BULK parsing is
injective: a BULK stream has only one abstract yield, but different
BULK streams can have the same abstract yield.
A processing application is not expected to actually produce the
abstract yield, but an adaptation of the abstract yield to its own
implementation, called the concrete yield. Also, some expressions in
a BULK stream may have the semantics of a transformation of the
abstract yield. A processing application MAY thus not produce or
retain the concrete yield but the result of its transformation. This
specification deals mainly with the byte sequence and the abstract
yield and occasionnally provides guidelines about the concrete yield.
Of course, a processing application MAY not produce the concrete
yield at all but produce various side effects from parsing the BULK
stream.
The abstract yield is a sequence of expressions. Expressions can be
atoms or forms. Forms are sequences of expressions. If a byte
sequence is parsed as one or several expressions, this byte sequence
is said to denote these expressions.
When a sequence of bytes is named in a shape, its name can be used in
this specification to designate either the byte sequence, or the
expression or sequence of expressions it denotes. When there could
be ambiguity, this specification specifies which is designated.
2.1. Parsing algorithm
The parser operates with a context, which is a sequence of
expressions. Each time an expression is parsed, it is appended at
the end of the context. The initial context is the abstract yield.
At the beginning of a BULK stream and after having consumed the byte
sequence denoting a complete expression, the parser is at the
dispatch stage. At this stage, the next byte is a marker byte, which
tells the parser what kind of expression comes next (the marker byte
is the first byte of the sequence that denotes an expression). The
expression appended to the context after reading a byte sequence is
called the specific yield of the byte sequence.
The "0x1" and "0x2" marker bytes are special cases. When the parser
reads "0x1", it immediately appends an empty sequence to the current
context. This sequence becomes the new context. This new context
has the previous context as parent. Then the parser returns to its
Thierry Expires November 9, 2018 [Page 8]
Internet-Draft BULK1 may 2018
dispatch stage. When the parser reads "0x2", it appends nothing to
the context, but instead the parent of the current context becomes
the new context and the parser returns to the dispatch stage. Thus
it is a parsing error to read "0x2" when the context is the abstract
yield.
The scope of an expression is the part of its context that follows
the expression.
This specification designates the context where the expressions
contained in a form are appended as the inner scope of the form. Its
parent context is designated as the outer scope of the form.
Whenever a parsing error is encountered, parsing of the BULK stream
MUST stop.
2.1.1. Evaluation
A processing application MAY implement evaluation of BULK expressions
and streams. When evaluating a BULK stream, when the parser gets to
the dispatch stage and the context is the abstract yield, the last
expression in the context is replaced by what it evaluates to. (of
course, this description is supposed to provide the semantics of BULK
evaluation, but a processing application MAY implement evaluation
with a different algorithm as long as it provides the same semantics)
The default evaluation rule is that an expression evaluates to
itself. A name within a namespace can have a value, which is what a
reference associated to this name evaluates to. A reference whose
marker value is associated to no namespace or whose name has no value
evaluates to itself. How self-evaluating BULK expressions are
represented in the concrete yield is application-dependent, but
future specifications MAY define a standard API to access it, similar
to the Document Object Model for XML.
The evaluation of a sequence obeys a special rule, though: if the
first expression of the sequence has type "Function", that function
is called with an argument list and the sequence evaluates to the
return value.
If the function has type "LazyFunction", the argument list is the
rest of the sequence. If the function has type "EagerFunction", the
argument list is the rest of the sequence, where each expression is
replaced by what it evaluates to. Any expression that has type
"LazyFunction" or "EagerFunction" also has type "Function".
If the result of the evaluation of a "Function" is a sequence, it is
evaluated in turn.
Thierry Expires November 9, 2018 [Page 9]
Internet-Draft BULK1 may 2018
2.2. Forms
2.2.1. starting marker byte
marker "0x1"
mnemonic "("
2.2.2. ending marker byte
marker "0x2"
mnemonic ")"
2.2.3. Difference between sequence and form
There is a difference between a byte sequence denoting a sequence of
expressions among the current context and a byte sequence denoting a
form (i.e. a single expression that contains a sequence of
expressions). As an example, let's examine several forms of the
shape "( foo {seq} )".
o In the form "( foo nil nil nil )", {seq} denotes 3 expressions,
and they are three atoms in the yield.
o In the form "( foo nil )", {seq} is a single expression in the
yield, and that expression is an atom.
o In the form "( foo ( nil nil nil ) )", {seq} is also a single
expression in the yield, and that expression is a form, a sequence
in the yield.
In a shape, when a byte sequence must yield a single expression, it
has the type "Expr". So the last two examples fit the shape "( foo
{seq}:Expr )" but not the first. When a byte sequence must yield a
form, it has type "Form". Thus the shape "( foo {bar}:Form )" is
equivalent to "( foo ( {baz} ) )". Either one MAY be used.
2.3. Atoms
2.3.1. nil
marker "0x0" (mnemonic: "nil")
shape "nil"
Apart from being a possible short marker value, the fact that the
"0x0" byte represents a valid atom means that a sequence of null
Thierry Expires November 9, 2018 [Page 10]
Internet-Draft BULK1 may 2018
bytes is a valid part of a BULK stream, thus making the format less
fragile. In a network communication, nil atoms can be sent to keep
the channel open. They can also be used as padding at the end of a
form or between forms.
2.3.2. Array
marker "0x3" (mnemonic: "#")
shape "# Int {content}"
Arrays have a special parsing rule. After consuming the marker byte,
the parser returns to the dispatch stage. It is a parser error if
the parsed expression is not of type Int or if its value cannot be
recognized. This integer is not added to any context, but the parser
consumes as many bytes as this integer and they constitute the
content of this array.
If two arrays have the shapes "# {s1} {c1}" and "# {s2} {c2}" and if
"{s1+s2}" denotes the sum of the integers "{s1}" and "{s2}", then
their concatenation is "# {s1+s2} {c1} {c2}".
In the text notation, a quoted string represents an array containing
the encoding of that string in the current encoding.
Types: "Array", "Bytes"
In a shape, the type String is synonymous with Array, but means that
the content of the array is supposed to be taken as a string.
2.3.3. Binary words
A unsigned word can be interpreted either as a bits sequence or as an
unsigned integer in binary notation. The choice depends on the
context and the application. Actually, many processing applications
may not need make any choice, as most programming language
implementations actually also confuse unsigned integers and bits
sequences to some extent.
2.3.3.1. 8 bits word
marker "0x4" (mnemonic: "w8")
shape "w8 1B"
Types: "Int", "Word", "Word8", "Bytes"
Thierry Expires November 9, 2018 [Page 11]
Internet-Draft BULK1 may 2018
2.3.3.2. 16 bits word
marker "0x5" (mnemonic: "w16")
shape "w16 2B"
Types: "Int", "Word", "Word16", "Bytes"
2.3.3.3. 32 bits word
marker "0x6" (mnemonic: "w32")
shape "w32 4B"
Types: "Int", "Word", "Word32", "Bytes"
2.3.3.4. 64 bits word
marker "0x7" (mnemonic: "w64")
shape "w64 8B"
Types: "Int", "Word", "Word64", "Bytes"
2.3.3.5. 128 bits word
marker "0x8" (mnemonic: "w128")
shape "w128 16B"
Types: "Int", "Word", "Word128", "Bytes"
2.3.3.6. Negative integers
Note that BULK doesn't include signed words using two's complement,
because BULK's design makes them inherently wasteful. If you were to
design an ad hoc binary format that is parsed according to a schema
known in advance, like TCP/IP, and you were to include a field that
can cointain either a positive or negative integer, you would need to
use one bit to indicate that integer's sign, in which case you might
as well use two's complement, whose properties are well known, lets
you write to and from memory, etc...
But in BULK, a word used for a positive integer (otherwise known as
an unsigned integer) is already preceded by a marker byte. If BULK
included signed integers, there would never be a sense in using them
for positive integers, so a one-byte signed integer would only be
used for integers between -1 and -127. With markers for negative
Thierry Expires November 9, 2018 [Page 12]
Internet-Draft BULK1 may 2018
integers, the one-byte word can be used for integers between -1 and
-255.
Also, BULK is a format for storage and wire transport, not in-memory
data, where two's complement is useful because it supports bitwise
arithmetic, something that isn't relevant here.
The only foreseen use of two's complement signed integers is in large
arrays of data, like raster images, sound, video or any other
temporal series, e.g. physical measures. In that use case, the one-
byte overhead for each number is obviously unacceptable and they
would be stored in an array. A surrounding form or the format's
specification would tell how to interpret the contents of that array,
in terms of size and signedness.
The semantics of each of the following words is the opposite of the
countained unsigned integer. For example, "0xA 0x1 0xFF" denotes the
number -511.
2.3.3.6.1. 8 bits negative word
marker "0x9" (mnemonic: "neg8")
shape "neg8 1B"
Types: "Int", "Word", "Word8", "Bytes"
2.3.3.6.2. 16 bits signed word
marker "0xA" (mnemonic: "neg16")
shape "neg16 2B"
Types: "Int", "Word", "Word16", "Bytes"
2.3.3.6.3. 32 bits signed word
marker "0xB" (mnemonic: "neg32")
shape "neg32 4B"
Types: "Int", "Word", "Word32", "Bytes"
2.3.3.6.4. 64 bits signed word
marker "0xC" (mnemonic: "neg64")
shape "neg64 8B"
Thierry Expires November 9, 2018 [Page 13]
Internet-Draft BULK1 may 2018
Types: "Int", "Word", "Word64", "Bytes"
2.3.3.6.5. 128 bits signed word
marker "0xD" (mnemonic: "neg128")
shape "neg128 16B"
Types: "Int", "Word", "Word128", "Bytes"
2.3.4. Reserved marker bytes
Marker bytes "0xE-0x1F" are reserved for future major versions of
BULK. It is a parser error if a BULK stream with major version 1
contains such a marker byte.
2.3.5. Reference
marker "0x20-0xFF"
shape "{ns}:1B {name}:1B"
The "{ns}" byte is a value associated with a namespace. Values
"0x20-0x27" are reserved for namespaces defined by BULK
specifications. Greater values can be associated with namespaces
identified by a unique identifier.
The "{name}" byte is the name within the namespace. Vocabularies
with more than 256 names thus need to be spread accross several
namespaces.
The specification of a namespace SHOULD include a mnemonic for the
namespace and for each defined name. When descriptions use several
namespaces, the mnemonic of a reference SHOULD be the concatenation
of the namespace mnemonic, ":" and the name mnemonic if there can be
an ambiguity. For example, the "fp" name in namespace "math" becomes
"math:fp".
Type: "Ref"
2.3.5.1. Special case
References have a special parsing rule. In case a BULK stream needs
an important number of namespaces, if the marker byte is "0xFF", the
parser continues to read bytes until it finds a byte different than
0xFF. The sum of each of those bytes taken as unsigned integers is
the value associated with a namespace. For example, the reference
Thierry Expires November 9, 2018 [Page 14]
Internet-Draft BULK1 may 2018
denoted by the bytes "0xFF 0xFF 0x8C 0x1A" is the name 26 in the
namespace associated with 650.
3. Standard namespaces
Standard namespaces have a fixed marker value and are not identified
by a unique identifier.
3.1. BULK core namespace
marker "0x20" (mnemonic: "bulk")
3.1.1. Version
name "0x0" (mnemonic: "version")
shape "( version {major}:Int {minor}:Int )"
When parsing a BULK stream, a processing application MUST determine
explicitely the major and minor version of the BULK specification
that the stream obeys. This information MAY be exchanged out-of-
band, if BULK is used to exchange a number a very small messages,
where repeated headers of 8 bytes might become too big a overhead. A
processing application MUST NOT assume a default version.
If the version is expressed within a BULK stream, this form MUST be
the first in the stream. In any other place, this form has no
semantics attached to it. This specification defines BULK 1.0. When
writing a BULK stream, an application MUST denote {major} and {minor}
by the smallest byte sequence possible using unsigned words from this
specification.
An application writing a BULK stream to long-term storage (e.g. in a
file or a database record) SHOULD include a "version" form.
Two BULK versions with the same major version MUST share the same
parsing rules and the same definitions of marker bytes. Changing the
syntax or semantics of existing marker bytes and using marker bytes
in the reserved interval warrants a new major version. Changing the
syntax or semantics of existing names in standard namespaces also.
Adding standard namespaces or adding names in existing standard
namespaces warrants a new minor version.
Thierry Expires November 9, 2018 [Page 15]
Internet-Draft BULK1 may 2018
3.1.2. true
name "0x1" (mnemonic: "true")
shape "true"
Type: "Boolean".
3.1.3. false
name "0x2" (mnemonic: "false")
shape "false"
Type: "Boolean".
3.1.4. Strings encoding
name "0x3" (mnemonic: "stringenc")
shape "( stringenc {enc}:Encoding )"
This tells the processing application that, in the scope of this
expression, all expressions that are understood by the application as
character strings will be encoded with the encoding designated by
{enc}.
As the abstract yield doesn't contains strings but expressions that
will be used as strings by the application, it is not a parsing error
if the application doesn't recognize {enc}. In this situation, it is
a parsing error when the application actually needs to decode a byte
sequence as a string. It is not a parsing error when a processing
application only transmits a byte sequence encoding a string, if it
can accurately convey the encoding to the receiving application.
3.1.5. IANA registered character set
name "0x4" (mnemonic: "iana-charset")
shape "( iana-charset {id}:Int )"
This designates the string encoding registered among the IANA
Character Sets [IANA-Charsets] whose MIBenum is {id}.
Type: "Encoding".
Thierry Expires November 9, 2018 [Page 16]
Internet-Draft BULK1 may 2018
3.1.6. Windows code page
name "0x5" (mnemonic: "code-page")
shape "( code-page {id}:Int )"
This designates the string encoding among Windows code pages whose
identifier is {id}.
Type: "Encoding".
3.1.7. Namespaces
3.1.7.1. Note about unique identifiers
Several objects in this specification and future BULK specifications
are identified by something of type UniqueID. This specification
doesn't define any UniqueID form on purpose, because what constitutes
a unique enough identifier varies over time and domains and because
BULK's nature makes specifying them in advance actually unncessary
(cf. Verifiable namespace bootstrap).
Anything, including a bare array containing some identifying byte
string, could be used as a UniqueID, but we recommend enclosing any
such data in a form specifying how to interpret it. For example, a
"crypto" namespace could include a "md6" name, to use forms of shape
"( crypto:md6 Word128 )" as UniqueID.
3.1.7.2. New namespace
name "0x6" (mnemonic: "ns")
shape "( ns {marker}:Int {id}:UniqueID )"
This associates the namespace identified by {id} to the value
{marker}, within the scope of this expression.
3.1.7.3. Package
name "0x7" (mnemonic: "package")
shape "( package {id}:UniqueID {namespaces} )"
This creates a package identified by {id}. Packages are immutable,
{id} MUST be verifiable against the byte sequence {namespaces}.
{namespaces} must be a sequence of expressions of type UniqueID, each
identifying a BULK namespace.
Thierry Expires November 9, 2018 [Page 17]
Internet-Draft BULK1 may 2018
3.1.7.4. Import
name "0x8" (mnemonic: "import")
shape "( import {base}:Int {count}:Int {id}:UniqueID )"
This associates the first {count} namespaces in the package
identified by {id} with a continuous range of values starting at
{base} within the scope of this expression.
3.1.8. Definitions
To define a reference is to change the the value of its name in its
namespace (as identified by its unique identifier, not the marker
value) within a certain scope.
If a BULK stream is not evaluated, the semantics of a definition are
entirely application-dependent.
When a BULK stream containing definitions for a namespace comes from
a trusted source (i.e. in configuration files of the application, or
in the communication with an agent that has been granted the relevant
authority), an application MAY give those definitions long-lasting
semantics (i.e. keep the values of the names at the end of parsing).
This is the preferred mechanism for bulk namespace definition when
the semantics of the defined expressions can be expressed completely
by BULK forms.
3.1.8.1. Simple definition
name "0x9" (mnemonic: "define")
shape "( define {ref}:Ref {value}:Expr )"
This defines the reference {ref} to the yield of {value} in the outer
scope of this form.
3.1.8.2. Named definition
name "0xA" (mnemonic: "mnemonic/def")
shape "( mnemonic/def {ref}:Ref {mnemonic}:String {doc}:Expr {value}
)"
This suggests {mnemonic} as the mnemonic of the name designated by
{ref} in its namespace. If {value} is of type Expr, this defines the
reference {ref} to {value} in the outer scope of this form.
Thierry Expires November 9, 2018 [Page 18]
Internet-Draft BULK1 may 2018
{doc} is any expression that provides a documentation for this
reference. If it has type Array, it MUST be a string. It could be
any kind of metadata or document type.
3.1.8.3. Namespace description
name "0xB" (mnemonic: "ns-mnemonic")
shape "( ns-mnemonic {ns}:Expr {mnemonic}:String {doc} )"
This suggests {mnemonic} as the mnemonic of the namespace designated
by {ns} (which can be the integer to which this namespace is
associated, a reference in this namespace or the unique identifier of
this namespace).
3.1.8.4. Verifiable namespace definition
name "0xC" (mnemonic: "verifiable-ns")
shape "( verifiable-ns {marker}:Int {id}:UniqueID {mnemonic}:Expr
{definitions} )"
This associates the namespace identified by {id} to the value
{marker}, within the outer and inner scopes of this form. Verifiable
namespaces are immutable, {id} MUST be verifiable against the byte
sequence "{mnemonic} {definitions}". Defining a reference in the
inner scope of this form also defines that reference in the outer
scope of this form.
For this verification to be meaningful, {definitions} MUST NOT
contain any reference from a namespace before it is assoicated in
{definitions}.
If {mnemonic} is of type String, then this suggests it as the
mnemonic of the namespace. Else it MUST be "nil".
3.1.8.5. Array concatenation
name "0x10" (mnemonic: "concat")
shape "( concat {array1} {array2} )"
Name's type EagerFunction
Form's type Array
Form's value the concatenation of {array1} and {array2}.
Thierry Expires November 9, 2018 [Page 19]
Internet-Draft BULK1 may 2018
3.1.8.6. Substituton
3.1.8.6.1. Substitution function
name "0x11" (mnemonic: "subst")
shape "( subst {code} )"
Name's type LazyFunction
Form's type EagerFunction
Form's value A substitution function whose return value is the value
of {code}. Within {code}'s specific yield, the names "arg" and
"rest" are defined:
3.1.8.6.2. Argument
name "0x12" (mnemonic: "arg")
shape "( arg {n}:Int )"
Name's type EagerFunction
Form's type Expr
Form's value the element number {n} (starting at zero) of the
substitution function's arguments list
3.1.8.6.3. Rest of arguments list
name "0x13" (mnemonic: "rest")
shape "( rest {n}:Int )"
Name's type EagerFunction
Form's type Expr
Form's value the substitution function's arguments list without its
first {n} elements.
3.1.8.6.3.1. Examples
Here is a definition of the inverse followed by the number 1/2, 1/3
and 1/4:
Thierry Expires November 9, 2018 [Page 20]
Internet-Draft BULK1 may 2018
"( define inverse ( subst ( frac 1 ( arg 0 ) ) ) ) ( inverse 2 ) (
inverse 3 ) ( inverse 4 )"
Substitution will splice multiple expressions in place:
The evaluation of "( ( subst 1 ( rest 0 ) 2 ) 3 4 )" must yield the
same as "( 1 3 4 2 )"
3.1.9. Arithmetic
In the text notation of a BULK stream, a decimal integer represents
the smallest byte sequence that denotes this integer with atoms and
forms from this specification. For example, "( 31 256 )" is a
notation for the bytes "0x1 0x4 0x1F 0x5 0x1 0x0 0x2".
3.1.9.1. Fraction
name "0x20" (mnemonic: "frac")
shape "( frac {num}:Int {div}:Int )"
This is the number {num}/{div}.
Type: "Number".
3.1.9.2. Arbitrary precision signed integer
name "0x21" (mnemonic: "bigint")
shape "( bigint {bits}:Bytes )"
The bits contained in {bits} is the value of this integer in
two's-complement notation.
Type: "Number", "Int".
3.1.9.3. Binary floating-point number
name "0x22" (mnemonic: "binary")
shape "( binary {bits}:Bytes )"
This is a floating-point number expressed in IEEE 754-2008 binary
interchange format. If {bits} is an Array, the size of its contents
must be a multiple of 32 bits, as per IEEE 754-2008 rules. {bits}
MUST NOT have type Word8.
Types: "Number", "Float".
Thierry Expires November 9, 2018 [Page 21]
Internet-Draft BULK1 may 2018
3.1.9.4. Decimal floating-point number
name "0x23" (mnemonic: "decimal")
shape "( decimal {bits}:Bytes )"
This is a floating-point number expressed in IEEE 754-2008 decimal
interchange format. If {bits} is an Array, the size of its contents
must be a multiple of 32 bits, as per IEEE 754-2008 rules. {bits}
MUST NOT have type Word8.
Types: "Number", "Float".
3.1.10. Compact formats
This specification and other specifications in the official BULK
suite take the option to use as their basic building block a form
with a distinguishing reference as first element (basically, they are
a binary representation of an abstract syntax tree). As noted
previously, this means that most representations weigh 4 bytes plus
their actual content, which will in turn have some overhead because
of one or several marker bytes.
But when there is a special need for compactness, BULK makes it
possible to design protocols and formats with different trade-offs,
while retaining its property of being parseable by processing
applications not knowing the protocol in its entirety.
On one end of the spectrum, a format might choose to use an array to
encapsulate an ad hoc binary format. An extreme use of this scheme
would be to use BULK just to make explicit the binary format used.
With a known profile (for example with a file extension and/or media
type for such explicitly typed BLOBs), a BULK stream that consists
solely of the version form, a reference that describes the binary
format and an array will have a total overhead of 14, 16 or 20 bytes
if the data's size is representable in 16, 32 or 64 bits.
Still, even this extreme in the design space retains the ability to
insert expressions in the BULK stream, whatever their type. Thus
metadata can be added about data that is represented in a format that
doesn't allow for metadata or for limited metadata.
In-between these two extremes, of compactness or uniformity, several
options are available to produce a format that leverages the BULK
parser a lot more than using a single array while being more compact
than a classical BULK format. The following forms provide a standard
way to create such formats.
Thierry Expires November 9, 2018 [Page 22]
Internet-Draft BULK1 may 2018
A flat sequence of operators and operands is called a BULK bytecode.
Prefix bytecodes are those where operators come before operands,
postfix bytecodes are those where operators come after operands. In
the following forms, operators MUST be references (as usual with
BULK, another namespace could define other bytecode forms with
different rules).
The default semantics of a bytecode form is the result of
transforming its abstract yield into a sequence of forms who have the
usual semantics aof BULK forms whose first expression is of type
"Function". When evaluating a bytecode form that doesn't provide
arities, a processing application MUST abort this transformation as
soon as it encounters a reference for which it cannot determine if it
is an operator or an operand or an operator of unkown arity. When
evaluating a bytecode form that provides arities, any reference that
is not known to be an operator MUST be determined not to be an
operator.
To transform a prefix bytecode abstract yield, a processing
application creates an alternate context. If the first expression of
the bytecode can be determined not to be an operator, it is removed
from the beginning of the bytecode and appended as an atom at the end
of the alternate context. If the first expression of the bytecode
can be determined to be an operator, it is removed from the beginning
of the bytecode along with as many next expressions as its arity and
they all are appended as a form in the alternate context. The
transformation continues until the bytecode is empty, in which case
the alternate context becomes the inner context of the bytecode form
and the transformation is complete.
To transform a postfix bytecode form, a processing application
creates an alternate context. If the first expression of the
bytecode can be determined not to be an operator, it is removed from
the beginning of the bytecode and appended as an atom at the end of
the alternate context. If the first expression of the bytecode can
be determined to be an operator, it is removed from the beginning of
the bytecode and as many expressions as its arity are removed from
the end of the alternate context. They all are appended as a form in
the alternate context (with the operator as first element followed by
the operands, kept in their previous order). The transformation
continues until the bytecode is empty, in which case the alternate
context becomes the inner context of the bytecode form and the
transformation is complete.
If the overhead of several marker bytes in the operands of some
operators is too much, even more compactness can be achieved by
packing together small operands. For example, instead of an operator
with two integers as its operands, one could specify an operator to
Thierry Expires November 9, 2018 [Page 23]
Internet-Draft BULK1 may 2018
take a single word as operand and extract the integers from it (while
still retaining the ability to operate on many sizes of integers,
because it can still deduce the size of the integers by dividing the
size of the word by two).
For example, a BULK format representing player moves with a pair of
coordinates might represent a single move with the following shapes:
classical (8 bytes) "( sgf:black/2 w8 0x04 w8 0x10 )"
packed classical (7 bytes) "( sgf:black/1 w16 0x04 0x10 )"
bytecode (6 bytes) "sgf:black/2 w8 0x04 w8 0x10"
packed bytecode (5 bytes) "sgf:black/1 w16 0x04 0x10"
The transformation defined for the bytecode forms makes it possible
to mix literal expressions and operations represented by a sequence
of operators and operands. In the previous scenario, for example,
one might represent alternating moves by two players as a sequence of
words, lowering the weight of each move to 3 bytes when coordinates
are below 256. The difference between all these schemes and an array
is that you keep the ability to insert other forms, for example to
represent comments on the game or variants.
The cost of the bytecode format is that if it contains operators
whose arity is unknown to a processing application, the whole
sequence after the first occurrence of them is unreadable to that
processing application, whereas in the classical format, the
processing application can still process all the forms it understands
(and it requires no anticipation by the application creating the BULK
stream).
3.1.10.1. Prefix bytecode
name "0x30" (mnemonic: "prefix-bytecode")
shape "( prefix-bytecode {bytecode} )"
This is a prefix bytecode form that doesn't provide arities.
3.1.10.2. Prefix bytecode with arities
name "0x31" (mnemonic: "prefix-bytecode*")
shape "( prefix-bytecode* ( {arities} ) {bytecode} )"
This is a prefix bytecode form that provides arities.
Thierry Expires November 9, 2018 [Page 24]
Internet-Draft BULK1 may 2018
{arities} MUST be a sequence of shapes "( {arity}:Int {refs} )".
{refs} MUST be a sequence of references. It indicates that all
references in this sequence are operators of arity {arity}.
3.1.10.3. Postfix bytecode
name "0x32" (mnemonic: "postfix-bytecode")
shape "( postfix-bytecode {bytecode} )"
This is a postfix bytecode form that doesn't provide arities.
3.1.10.4. Postfix bytecode with arities
name "0x33" (mnemonic: "postfix-bytecode*")
shape "( postfix-bytecode* ( {arities} ) {bytecode} )"
This is a postfix bytecode form that provides arities.
{arities} MUST be a sequence of shapes "( {arity}:Int {refs} )".
{refs} MUST be a sequence of references. It indicates that all
references in this sequence are operators of arity {arity}.
3.1.10.5. Arity declaration
name "0x34" (mnemonic: "arity")
shape "( arity {arity}:Int {refs} )"
{refs} MUST be a sequence of references. It indicates that all
references in this sequence are operators of arity {arity}.
3.1.10.6. Property list
name "0x35" (mnemonic: "property-list")
shape "( property-list {bytecode} )"
{bytecode} MUST be a sequence of expression in which the first and
every odd-numbered expression is a reference that will be taken as
having arity 1.
The semantics of "( property-list foo:bar ( frac 2 3 ) foo:baz true
foo:quux "abc" )" SHOULD be same than of "( foo:bar ( frac 2 3 ) ) (
foo:baz true ) ( foo:quux "abc" )".
Thierry Expires November 9, 2018 [Page 25]
Internet-Draft BULK1 may 2018
4. Extension namespaces
Extension namespaces are defined with a unique identifier, to be
associated to a marker value.
By its decentralized nature, as far as a processing application is
concerned, apart from standard namespaces, there is no difference
between a namespace defined as part of the official BULK suite and a
user-defined one.
5. Profiles
A profile is a byte sequence parsed by a processing application just
after the "version" form or before the first expression if there is
no "version" form. Thus a parser SHOULD look ahead at the beginning
of a stream to see if the first three bytes are "( bulk:version".
With respect to the BULK stream, the profile is an out-of-band
information, usually implicit.
A processing application doesn't need to include the profile in the
concrete yield, as long as the semantics of the abstract yield are
maintained.
The same BULK stream might be processed with different profiles.
A processing application MUST NOT deduce the profile from the content
of a BULK stream.
5.1. Profile redundancy
A processing application SHOULD only rely on the use of a profile
when it is a safe assumption that the profile is known, for example
within a communication where the protocol dictates the profile.
In particular, long-term storage of a BULK stream SHOULD preserve
profile information, for example with a media type that dictates the
profile.
Otherwise, an application writing a BULK stream in a long-term
storage SHOULD include the profile after the version form. For this
reason, the expressions in a profile SHOULD have idempotent
semantics.
5.2. Standard profile
This specification defines the default profile that a processing
application MUST use when it is not using a specific profile:
Thierry Expires November 9, 2018 [Page 26]
Internet-Draft BULK1 may 2018
"( bulk:stringenc ( bulk:iana-charset 106 ) )"
This means that the default string encoding in a BULK stream is UTF-
8.
6. Security Considerations
6.1. Parsing
Parsing a BULK stream is designed to be free of side-effects for the
processing application, apart from storing the parsed results.
Arrays in BULK carry their size, so as for the application to know in
advance the size of the data to read and store, thus making it easier
to build robust code. A malicious software, however, may announce an
array with a size choosen to get an application to exhaust its
available memory. When a BULK stream has been completely received,
an array bigger than the remaining data SHOULD trigger an error.
When a BULK stream's size is not known in advance, the application
SHOULD use a growable data structure.
6.2. Forwarding
When a processing application forwards all or part of the data in a
BULK stream to another application, care must be taken if part of the
forwarded data was not entirely recognized, as it could be used by an
attacker to benefit from the authority the forwarding application has
on the recipient of the data.
6.3. Definitions
The architecture of a processing application SHOULD ensure that a
malicious agent cannot abuse authority given to it to define a
namespace in order to modify associations in other namespaces.
Depending on the use of data structures storing BULK expressions,
this could amount to giving an attacker a way to manipulate the
application's state. See Appendix A for an example of architecture
that is resistant to that kind of attack.
7. IANA Considerations
This specification defines a new media type, application/bulk. Here
are the informations for its registration to IANA:
Type name application
Subtype name bulk
Thierry Expires November 9, 2018 [Page 27]
Internet-Draft BULK1 may 2018
Required parameters none
Optional parameters none
Encoding considerations none, content is self-describing
Security considerations cf. Section 6
Interoperability considerations the constraint to start any BULK
stream with a version form has the side-effect that classes of
BULK streams can be identified by a sequence of bytes acting as
"magic number":
0x012000 any BULK stream
0x01200004 a BULK stream of any major version beneath 256
0x0120000401 a BULK stream of major version 1
0x0120000401040002 a BULK stream of version 1.2
Published specification this document
Applications that use this media type none so far
Fragment identifier considerations this specification defines no
semantics for addressing the data with a fragment identifier; a
future specification MAY define fragment identifier syntaxes to
address the content by byte offset or the parsed results by their
position in the yielded sequence
Additional information a future specification MAY define a naming
convention for media types based on bulk with a +bulk suffix, as
for XML with +xml
8. Acknowledgements
The original author of this specification read Erik Naggum's famous
rant about XML [1] several years before, and while forgotten as such,
it clearly was the seed that slowly bloomed into the design of BULK.
This format is dedicated to Erik.
9. References
Thierry Expires November 9, 2018 [Page 28]
Internet-Draft BULK1 may 2018
9.1. Normative References
[IANA-Charsets]
"IANA Charset Registry (archived at):",
.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
9.2. Informative references
[Avro] Cutting, D., "Apache Avro[TM] 1.7.4 Specification",
February 2013,
.
[HTTP2] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext
Transfer Protocol version 2 (HTTP/2)", RFC 7540, May 2015.
[protobuf]
"Protocol Buffers", July 2008,
.
[Smile] Saloranta, T., "Smile Data Format", September 2010,
.
[Thrift] Slee, M., Agarwal, A., and M. Kwiatkowski, "Thrift:
Scalable Cross-Language Services Implementation", April
2007, .
9.3. URIs
[1] http://www.schnada.de/grapt/eriknaggum-xmlrant.html
Thierry Expires November 9, 2018 [Page 29]
Internet-Draft BULK1 may 2018
Appendix A. Robust namespace definition
This constitutes a suggestion of architecture for a BULK processing
application. It has the advantage that an agent cannot modify the
values of names to which it has not specifically been given
authority. This architecture doesn't ensure this property by
checking the validity of definitions but by adhering to the Principle
Of Least Authority, thus ensuring no false positives or TOCTOU race
conditions.
For each new context (including the abstract yield when parsing
starts), the parser creates a new copy of each known namespace.
These copies are available in this context to retrieve and define
values. It implements the lexical scoping of definitions on top of
providing the robustness properties discussed here.
By default, all namespaces created in a context are discarded at the
end of this context.
Of course, an implementation of the architecture presented here can
be optimized compared to the abstract algorithm, for example by using
copy-on-demand.
Any namespace that is not a copy for its context but the object
retained by the application afterwards, gives authority to make long-
lasting definitions. Such a namespace is called lasting here.
A.1. Selective authority
A number of lasting namespaces are included for the abstract yield.
Their unique identifiers are agreed out-of-band. The disadvantage of
this solution is that it needs prior agreement on the definable
namespaces.
A.2. Open authority
Any "ns" form for a unique identifier unknown to the processing
application triggers the creation of a lasting namespace.
The disadvantage of this solution is that it opens a denial of
service vulnerability. If Bob is a processing application and Carol
and Dave are agents communicating with Bob with an open authority,
Dave can prevent Carol from defining a namespace if it manages to
know the unique identifier and starting a communication with Bob
before Carol.
Thierry Expires November 9, 2018 [Page 30]
Internet-Draft BULK1 may 2018
If an agent uses a secure way to create unique identifiers, this
solution is both flexible and safe (the burden is not on the BULK
processing application).
Appendix B. Verifiable namespace bootstrap
If a processing application that implements one or several hashing
algorithms encounters a BULK stream with namespaces identified by
UniqueID forms defined in an unknown namespace, it would be possible
for the application to recover that namespace's definition and still
verify it, as shown in the following process.
The processing application reads a BULK stream starting with "(
bulk:version 1 0 ) ( ns w8 0x28 ( 0x28 0xC w32 0xFD 0x2A 0x34 0x02 )
( ns w8 0x29 ( 0x28 0xC w32 0x24 0xA3 0x58 0xF3 )". This means that
the namespace identified by FD2A3402 is associated with marker 40,
and a form from that namespace is used to identify itself. A second
namespace, associated with marker 41, is identified by 24A358F3 with
the same form taken from the previous namespace.
By whatever available mechanism to aquire BULK namespaces'
definitions (which could be reading local configuration files or
making a search on the Internet), the processing application gets the
following definition for the namespace identified by FD2A3402: "(
bulk:version 1 0 ) ( bulk:verifiable-ns w8 0xF0 ( 0xF0 0xC w32 0xFD
0x2A 0x34 0x02 ) "crypto" ( bulk:mnemonic/def 0xF0 0xC "md9" ) )".
It can now try every hashing algorithm known to it and check which
one hashes the byte sequence ""crypto" ( bulk:mnemonic/def 0xF0 0xC
"md9" )" into FD2A3402. If it finds one, from now on, the processing
application has verified this namespace and can verify any other use
of that crypto:md9 reference.
Author's Address
Pierre Thierry
Thierry Technologies
EMail: pierre@nothos.net
Thierry Expires November 9, 2018 [Page 31]