Posts tagged with 'math'

As originally shared on Google+, and as a follow up of the previous post covering OpenGL on Go QML, a new screencast was published to demonstrate the latest features introduced around OpenGL support in Go QML:

Assume uf is an unsigned integer with 64 bits that holds the IEEE-754 representation for a binary floating point number of that size.

The questions are:

1. How to tell if uf represents an integer number?

2. How to serialize the absolute value of such an integer number in the minimum number of bytes possible, using big-endian ordering and the 8th bit as a continuation flag? For example, float64(1<<70 + 3<<21) serializes as:

"\x81\x80\x80\x80\x80\x80\x80\x83\x80\x80\x00"

The background for this problem is that the current draft of the strepr specification mentions that serialization. Some languages, such as Python and Ruby, implement transparent arbitrary precision integers, and that makes implementing the specification easier.

For example, here is a simple Python interactive session that arrives at the result provided above exploring the native integer representation.

Python makes the procedure simpler because it is internally converting the float into an integer of appropriate precision via standard C functions, and then offering bit operations on the resulting value.

The suggested brain teaser can be efficiently solved using just the IEEE-754 representation, though, and it’s relatively easy because the problem is being constrained to the integer space.

Contents

Introduction

This specification defines strepr, a stable representation that enables computing hashes and cryptographic signatures out of a defined set of composite values that is commonly found across a number of languages and applications.

Although the defined representation is a serialization format, it isn’t meant to be used as a traditional one. It may not be seen entirely in memory at once, or written to disk, or sent across the network. Its role is specifically in aiding the generation of hashes and signatures for values that are serialized via other means (JSON, BSON, YAML, HTTP headers or query parameters, configuration files, etc).

The format is designed with the following principles in mind:

Understandable — The representation must be easy to understand to increase the chances of it being implemented correctly.

Portable — The defined logic works properly when the data is being transferred across different platforms and implementations, independently from the choice of protocol and serialization implementation.

Unambiguous — As a natural requirement for producing stable hashes, there is a single way to process any supported value being held in the native form of the host language.

Meaning-oriented — The stable representation holds the meaning of the data being transferred, not its type. For example, the number 7 must be represented in the same way whether it’s being held in a float64 or in an uint16.

Supported values

The following values are supported:

nil: the nil/null/none singleton

bool: the true and false singletons

string: raw sequence of bytes

integers: positive, zero, and negative integer numbers

floats: IEEE754 binary floating point numbers

list: sequence of values

map: associative value→value pairs

Representation

nil = 'z'

The nil/null/none singleton is represented by the single byte 'z' (0x7a).

bool = 't' / 'f'

The true and false singletons are represented by the bytes 't' (0x74) and 'f' (0x66), respectively.

unsigned integer = 'p' <value>

Positive and zero integers are represented by the byte 'p' (0x70) followed by the variable-length encoding of the number.

For example, the number 131 is always represented as {0x70, 0x81, 0x03}, independently from the type that holds it in the host language.

negative integer = 'n' <absolute value>

Negative integers are represented by the byte 'n' (0x6e) followed by the variable-length encoding of the absolute value of the number.

For example, the number -131 is always represented as {0x6e, 0x81, 0x03}, independently from the type that holds it in the host language.

string = 's' <num bytes> <bytes>

Strings are represented by the byte 's' (0x73) followed by the variable-length encoding of the number of bytes in the string, followed by the specified number of raw bytes. If the string holds a list of Unicode code points, the raw bytes must contain their UTF-8 encoding.

For example, the string hi is represented as {0x73, 0x02, 'h', 'i'}

Due to the complexity involved in Unicode normalization, it is not required for the implementation of this specification. Consequently, Unicode strings that if normalized would be equal may have different stable representations.

binary float = 'd' <binary64>

32-bit or 64-bit IEEE754 binary floating point numbers that are not holding integers are represented by the byte 'd' (0x64) followed by the big-endian 64-bit IEEE754 binary floating point encoding of the number.

There are two exceptions to that rule:

1. If the floating point value is holding a NaN, it must necessarily be encoded by the following sequence of bytes: {0x64, 0x7f, 0xf8, 0x00 0x00, 0x00, 0x00, 0x00, 0x00}. This ensures all NaN values have a single representation.

2. If the floating point value is holding an integer number it must instead be encoded as an unsigned or negative integer, as appropriate. Floating point values that hold integer numbers are defined as those where floor(v) == v && abs(v) != ∞.

For example, the value 1.1 is represented as {0x64, 0x3f, 0xf1, 0x99, 0x99, 0x99, 0x99, 0x99, 0x9a}, but the value 1.0 is represented as {0x70, 0x01}, and -0.0 is represented as {0x70, 0x00}.

This distinction means all supported numbers have a single representation, independently from the data type used by the host language and serialization format.

list = 'l' <num items> [<item> ...]

Lists of values are represented by the byte 'l' (0x6c), followed by the variable-length encoding of the number of pairs in the list, followed by the stable representation of each item in the list in the original order.

Associative maps of values are represented by the byte 'm' (0x6d) followed by the variable-length encoding of the number of pairs in the map, followed by an ordered sequence of the stable representation of each key and value in the map. The pairs must be sorted so that the stable representation of the keys is in ascending lexicographical order. A map must not have multiple keys with the same representation.

Variable-length encoding

Integers are variable-length encoded so that they can be represented in short space and with unbounded size. In an encoded number, the last byte holds the 7 least significant bits of the unsigned value, and zero as the eight bit. If there are remaining non-zero bits, the previous byte holds the next 7 bits, and the eight bit is set on to flag the continuation to the next byte. The process continues until there are non-zero bits remaining. The most significant bits end up in the first byte of the encoded value, which must necessarily not be 0x80.

For example, the number 128 is variable-length encoded as {0x81, 0x00}.

Reference implementation

A reference implementation is available, including a test suite which should be considered when implementing the specification.

Changes

draft1 → draft2

Enforce the use of UTF-8 for Unicode strings and explain why normalization is being left out.

Our son Otávio was born recently. Right in the first few days, we decided to keep tight control on the feeding times for a while, as it is an intense routine pretty unlike anything else, and obviously critical for the health of the baby. I imagined that it wouldn’t be hard to find an Android app that would do that in a reasonable way, and indeed there are quite a few. We went with Baby Care, as it has a polished interface and more features than we’ll ever use. The app also includes some basic statistics, but not enough for our needs. Luckily, though, it is able to export the data as a CSV file, and post-processing that file with the R language is easy, and allows extracting some fun facts about what the routine of a healthy baby can look like in the first month, as shown below.

The first thing to do is to import the raw data from the CSV file. It is a one-liner in R:

> info = read.csv("baby-care.csv", header=TRUE)

Then, this file actually comes with other events that won’t be processed now, so we’ll slice it and grab only the rows and columns of interest:

A total of 63 hours surprised me, but the mean time of around 10 minutes per feeding is within the recommendation, and the standard deviation looks reasonable. It may be more conveniently pictured as a histogram:

The mean looks good, with about two hours every day. The standard deviation looks high on a first look, but it’s actually not that bad if we take off the first few days. Looking at the graph shows why: the slope on the left-hand side, which is expected as there’s less milk and the baby has more trouble right after birth.

The chart shows a red flag, though: one day seems well below the mean. This is something to be careful about, as babies can get into a loop where they sleep too much and miss being hungry, the lack of feeding causes hypoglycemia, which causes more sleep, and it doesn’t end up well. A rule of thumb is to wake the baby up every two hours in the first few days, and at most every four hours once he stabilizes for the following weeks.

So, this was another point of interest: what are the intervals between feedings?

Seems great, with most feedings well under two hours. There's a worrying outlier, though, of more than 6 hours. Unsurprisingly, it happened over night:

> feeding$End.Time[interval > 300]
[1] 2013/01/07 00:52

It wasn't a significant issue, but we don't want that happening often while his body isn't yet ready to hold enough energy for a full night of sleep. That's the kind of reason we've been monitoring him, and is important because our bodies are eager to get full nights of sleep, which opens the door for unintended slack. As a reward for that kind of control, we've got the chance to enjoy not only his health, but also an admirable mood.

The underlying concept is very simple: spreadsheets are a way to organize text, numbers and formulas into what might be seen as a natively numeric environment: a matrix. So what would happen if we loosed some of the bolts of the numeric-oriented organization, and tried to reuse the same concepts into a more formatting-oriented environment which is naturally collaborative: a wiki.

While I do encourage you to answer this with some fantastic new online service (please provide me with an account and the best e-book reader device available once you’re rich) I had a try at answering this question myself a while ago by writing the Calc macro for Moin.

Basically, the Calc macro allows extracting values found in a wiki page into lists (think columns or rows), and applying formulas and further formatting as wanted.

I believe there’s a lot of potential on the basic concept, and the prototype, even though functional and useful, surely has a lot to evolve, so I’ve published the project in Launchpad to make contributions easier. I actually apologize for not publishing it earlier. There was hope that more features would be implemented before releasing, but now it’s clear that it won’t get many improvements from me anytime soon. If you do decide to improve it, please try to prepare patches which are mostly ready for integration, including full testing, since I can’t dedicate much time for it myself in the foreseeable future.