Summary

Allow bytes to be divided using pattern matching. This is basically lifted straight out of Erlang, with some modifications to make it make sense within Rust’s pattern matching and type system.

Motivation

Parsing binaries without macros or manual masking-and-shifting.

Guide-level explanation

A binary pattern allows you to pattern-match the contents of a byte or an array of bytes, allowing you to break up binary protocols into their constituent parts without lots of manual masking or transmutes. Binary patterns are ordered with the most significant bits in front, and they can cut across bytes to parse protocols that are not aligned.

Binary patterns are surrounded with backslashes. Within the backslashes, they may be surrounded by parenthesis (in which case, the pattern should match a single byte) or square brackets (in which case, it will match an [u8; N] array).

Byte array patterns cannot match slices. Only fixed-size u8 arrays. For the initial implementation, other integral types are also not supported.

In a binary pattern, the most significant part of a single number comes first, regardless of the endianness of the underlying CPU.

The sum of the bit number of all of the parts must add up to the number of bits being matched. If the user wishes to discard bits, they may match with _.

When using the remainder part on a byte array pattern, the sum of the number of bits matched by the other parts must be a multiple of 8. The remainder is produced as a slice.

When a number does not have enough bits to fill the number it is being matched into, it will be sign-extended if it’s signed or zero-extended if it’s unsigned.

(as always, the default numerical type is u32, so zero-extending semantics are the default)

When matching on a byte array, the user may grab pieces of different bytes. The result should behave as if the most-significant-bit of the byte comes first (the same way bytes are written in hexadecimal or binary notation in Rust’s syntax).

A binary pattern is considered exhaustive if all of its parts are variables, and will not be considered exhaustive if it contains any constants. Including non-integer constants as pattern parts is an error:

When using a string literal, it must be a string literal that would normally evaluate to [u8], not one that would normally evaluate to str. The leading b not required, even though the inner syntax is the same as a byte string (like, it allows you to match against “strings” that aren’t valid UTF-8).

Drawbacks

It might not be Zero Cost enough. Maybe it should just operate on the underlying CPU’s endinanness instead of just converting everything to Big-Endian.

I kind of pulled the syntax out of my butt. Erlang uses <<1, 2, 3>>, but that seemed like it would be ambiguous with bit shift operators and generics.

Rationale and alternatives

The big advantage here is that we don’t end up in Type System Hell. Outside of the patterns themselves, we only ever deal with ordinary machine integer types, whereas bit fields on structs end up either having to spec a u7 type, or have a “u8” that only has 7 bits of storage.

It also works on top of pattern matching, a feature that Rust already has, instead of putting a SAT solver into the type checker. Everybody who’s commented on Erlang’s binary pattern matching says they love it, so that’s a plus.

The treatment of slices is based on the already-proposed slice pattern matching. In retrospect, forcing it to only work on fixed-size arrays only is too restrictive.

Prior art

We don’t support non-multiple-of-8 “bitstrings”. Only “binaries”. When extracting a non-multiple-of-8 number of bytes, this RFC will always extend the number.

Binary patterns are often either used to splat out a particular set of bits in an infallible manner, or in a loop to parse something. An infallible pattern in this proposal is either operating on a single byte, where all parts are variable, or operating on a fixed-size buffer, where all parts are variable. A pattern matched on a slice will always be fallible, and thus not allowed in a bare let statement like Erlang allows (with dynamic panicking semanitics, which aren’t safe enough for Rust).

Unresolved questions

How about binary expressions? This is probably out-of-scope, but whatever. It’s awesome!

let my_number = \(-1; 4, 2; 4\);
assert_eq!(my_number, 0xF2);

How about numbers other than u8? (ew… endianness)

Can the match checker handle exhaustiveness properly? How would that work alongside integer ranges?

Future possibilities

I kind of pulled the syntax out of my butt. Erlang uses <<1, 2, 3>> , but that seemed like it would be ambiguous with bit shift operators and generics.

I have no feelings about this proposal, but I will point out that the current syntax gives me a lot of unfortunate LaTEX flashbacks. Also, this <<>> syntax is almost certainly Kosher, since I don’t believe < is a valid starting token for a pattern… not that it aesthetically matches the rest of rust.

If you are not willing to write (or maybe don’t trust yourself being able to correctly write) a function for extracting a subset of the bits, I suggest you open a pre-RFC for a set of standard library functions instead, that would work on integer types and allow the extraction of bits. A quick implementation off the top of my head:

You can just define a macro that splits an integer’s bits and then pattern match on the resulting tuple.

If language support is to be added, then it should be about adding bit-aligned fields in structs/enum/unions and possibly anonymous structs/enums/unions, with the pattern matching being a consequence of that.

After converting foo >> 3 & 0b01111111, and accounting for the affect of shifting and ANDing, which will be performed in that order, I think that’s equivalent to let \[result; 5, _; 3]\ = foo.

But that’s really annoying. You’re expressing the thing you’re doing in terms of raw CPU instructions (shift left by 3, then zero out the topmost bit), instead of expressing your intent. Also, does that AND operating actually doing anything?

I don’t see why you are pointing me to the operator precedence reference; I know that >> has higher precedence than &, the code was completely intentional. The idiomatic way one usually extracts bits n...k from an integer is shifting the original integer to the right by n places then anding the result together with the mask 2k–n – 1. (I do realize that in the implementation of the bits_at function above, I performed the masking first and the shifting second, but it doesn’t really matter – that implementation is also correct, because I adjusted the value of the mask and parenthesized the bitwise and in (*self & mask) >> lo accordingly.)

(This was in fact one of the strongest reasons why Rust had exchanged the relative precedence of shifts and bitwise logical ops – in C and derived languages, they traditionally had the opposite relative precedence, which was annoying and error-prone when one was performing bit manipulation.)

So foo >> 3 & 0x7f extracts 7 bits from foo, starting at bit #3. It was just an example, I don’t see what is wrong with it. My point was that this is such a trivial operation that it doesn’t deserve growing the core language surface with a comparatively large feature (patterns interact with a lot of other stuff).

notriddle:

Also, does that AND operating actually doing anything?

What do you mean? Of course it is doing something, it discards the bits which we are not interested in.

notriddle:

But that’s really annoying.

I beg to differ. If you have done anything related to bits before, it shouldn’t be.

notriddle:

You’re expressing the thing you’re doing in terms of raw CPU instructions (shift left by 3, then zero out the topmost bit), instead of expressing your intent.

If we go that far (and why shouldn’t we – high level concepts are good), then bit subsets aren’t much better either. Bitfields are themselves a quite low-level concept. When we are operating with semantic types, we are usually not interested in the exact bit layout except on the lowest layers where the bits themselves are the only essential thing (e.g. serialization or networking).

So, when working with values that come from (or will go into) a bitfield, we shouldn’t be focusing on exactly which bits they are represented by. We should be translating the bits to flags, values, newtype wrappers, etc. – whatever the domain model requires, in general, instead of making it easier to write leaky abstractions by encouraging bit wrangling in domain model types.

For example, if I was using a networking library, I would be annoyed by an API that gave me, say, a TcpPacketHeader(u64), expecting me to extract raw bits from its pattern using bit range notation, instead of providing me with higher-level functions bearing clear semantics, such as seq_num(&self) -> u32 or is_ack(&self) -> bool.

Please let’s not confuse the discussion of binary patterns – the subject of this thread – with the subject of bitfields, which is a concurrent discussion thread.

The proposed syntax makes it possible to pattern-match against non-contiguous fields, ignoring or saving intermediary bits, which seems to be a minimum requirement.

I personally would like to see the proposal unified with the more general topic of bitfields, so that the pattern-matching could be against defined bitfields without necessitating dropping to the representation level of the current pre-RFC. It should also be unified, to the extent possible, with the apparently-dominant bitfields crate from the Rust Embedded Development effort.