The OSM binary interchange fileformat is now part of osmosis trunk and
is fully functional. This is the last chance to make potentially
incompatible changes.
In a week or two, I will add the design doc and command line options
doc to the wiki. After that, schema changes will be much more
difficult. As such, are there any final suggestions for new header
metadata or plans for future schema changes (like delta/change
support?).
At Frederik Ramm's suggestion, I have added header fields for tracking
database version to make it easier to determining what deltas need to
be applied to bring a file up-to-date. I do not know if what I have
meets the requirements. Here is the current header fields:
message HeaderBlock {
required HeaderBBox bbox = 1;
// Author, name, and version number of the dataset in this file. (to permit
// patches/updates to be incrementally applied)
optional string datasetauthor = 16; // TODO: WANT THIS?
optional string datasetname = 17; // TODO: WANT THIS?
optional int64 version = 18; // TODO: WANT THIS?
// Program generating this data
optional string writingprogram = 19; // TODO: WANT THIS?
}
Any changes/suggestions for fields to add/remove from this?
This schema is more expressive than the current XML which only
includes a 'source' field in the <bounds> tag. Should I alter my
schema to have just one field, 'source'? Or define a mapping between
this schema and current XML? If I define a mapping, any suggestions as
to what mapping to define?
My format allows non-geographic metadata, such as username, version
numbers, and last-modified datestamps to be omitted when generating a
binary file. This results in significant space savings. However, when
parsing, Osmosis expects these fields to be supplied. When the
metadata is omitted, I use '-1' and 'now' as as the default. Other
suggested defaults?
Finally, anyone have any suggestions or comments on the schema or the
rest of the design?
-- Updates since the initial version --
The format has had a few changes for performance and size. A new
DenseNodes has been defined that results in files being 10% smaller
and the uncompressed format becoming more useful. I am summarizing the
current results.
On a current (2010/08) planet, the binary format is 8gb, compared to
11gb for a bzipped planet and 15gb for a gzipped planet. On a core i7,
my format takes 9 cpu minutes to read, and 30-40 cpu minutes to
write. On the same machine, a gzipped planet requires 61 cpu minutes
to read and 141 cpu minutes to write.
Compression can be disabled to reduce CPU usage. The resulting file is
about twice as fast to write and twice the size. (i.e., about 8 times
faster to write than a gzipped planet and 15% larger.) On my machine,
using more than one CPU core and buffering, I can pipe through a
planet in the binary format to a planet in the uncompressed binary
format in about 18 minutes.
Finally, the entire planet can fit in 5.2gb by reducing the precision
of coordinates to ~1m and omitting non-geographic metadata.
The only major todo items left are checkstyle cleanups and putting in
unit-tests that work at the gross level, and also test all of the
corner cases.
Scott
////////////// Draft command line option docs for Wiki ///////////////
// --write-bin
Arguments:
file=<filename> Currently '-' representing stdout is not supported.
compress=deflate (default) Use deflate compression on each block
compress=none Disable compression. About twice as fast to write and
twice the size.
batchlimit=8000 Block size used when compressing. This is a
reasonable default.
granularity=100 The granularity or precision used to store
coordinates. The default of 100 nanodegrees is the highest precision
used by OSM, correspondnig to about 1.1cm at the equator. In the
current osmosis implementation, the granularity must be a multiple of
100. If map data is going to be exported to software that does not
need the full precision, increasing the granularity to 10000
nanodegrees can save about 10% of the file size with 1.1m precision.
omitmetadata=false (default)
omitmetadata=true Omit non-geographic metadata on OSM entities. This
includes version number and timestamp of the last edit to the entity
as well as the user name and id of the last modifier. Omitting this
metadata can save 15% filesize when exporting to software that does
not need this data.
usedense=true (default) Nodes can be represented in a regular format
or a dense format. The dense format is about 15% smaller, but more
complex. To make it easier to interoperate with (future) software that
chooses to not implement the dense format, the dense format may be
disabled.
// --read-bin
Arguments:
file=<filename> Currently '-' representing stdin is not supported.
// Usage tips:
The default options are the safe options and work efficiently and
quickly.
Buffering can improve performance. The binary format processes data in
batches, entities are queued until a limit is reached, then that batch
is serialized and compressed. This serialization can run concurrently
with other osmosis processing. With more than one core, writing
throughput can be increased by about 60% by placing a buffer in the
processing pipeline just before writing. Similarily, a buffer placed
in the pipelilne immediatelly after parsing can likewise improve
concurrent.
Eg:
osmosis --read-bin file=XXX --b bufferCapacity=12000 ....
OR
osmosis .... --b bufferCapacity=12000 --write-bin file=XXX ...
When generating data for export to other applications, I suggest
considerring --omitmetadata and --granularity=10000. With both
options, a full planet (in 2010), including all nodes, ways, and tags,
fits in 5.5gb.