On Thursday 07 September 2006 01:11, Nick Hill wrote:
> Drop the node ID, which is implied by the offset in the node file. (This
> will require generating a lookup table when processing the osm file for
> when making the way file).
The idea is to not have any processing stages which require memory
proportional to the size of the OSM database. As planet.osm has grown from
200MB to 3GB in a few months, I suggest thinking quite carefully before
"storing all nodes in a lookup table".
Node IDs will get dropped anyway when the node file is filtered for
"compacted" format (i.e. after it's been used to update the segments file,
and before it gets deployed to a rendering device), at which point each node
will be just "lat,long,flags"
> Drop the int32 id (implied by file offset) double lat1 double lat2
> double lon1 double lon2 in the segment file. Again, a lookup table will
> be required to match OSM id to offset when generating the way file.
That's an interesting idea, and it comes down to how fast you can find the
node.
From a "locality of reference" point of view, it's quite attractive for the
renderer to be able to read the segments file a few bytes at a time and have
everything it needs to render the map, without having to retrieve node
12900155 and node 32102221 from storage. From my limited understanding of
computers, I think that working sequentially through the segments file, and
keeping a small set of working data that can fit into CPU registers, might
make things faster.
The other consideration is that storing coordinates in the segment file means
you can filter it by area (again, using 12 bytes of storage in your filtering
program instead of 3GB) without having to recalculate the offsets for each
node. Again, some programs might find a speed advantage in being able to
quickly and easily filter the dataset (e.g. a device which occasionally
copies segments within 20 miles of your position from flash memory into RAM
and uses that as a working copy)
My idea for the segments file was to drop the segment ID and node IDs when
preparing the file for use by a renderer, leaving just 2 sets of coordinates,
and the attribute flags
On Thursday 07 September 2006 00:20, Jon Burgess wrote:
> I reckon a good option for the lat/lon would be a 32bit fixed point
> number. I think 2^32 data points over +-180 degrees gives a resolution
> of about 1cm across the globe (40,000 km circumference / 4 billion).
Very good idea, and something I want to implement.
On Thursday 07 September 2006 09:01, Nicola Ranaldo wrote:
> About flags, i think 32 bits are not sufficient. We should think about a
> system to extend tags. For example if the 32th is set to 1 then software
> may lookup a special file to find the other tags.
Again, very good idea, and one I hadn't thought of yet. We might need to
think carefully about how this would work (what can you use as an identifier,
that passes unchanged through the filtering and compacting stages? e.g. if
you retain ID fields in the output files for that purpose, then you have to
decide if it's better to spend that extra space on more binary attributes
instead.
A variable-length record (flag 32 means read an additional 4 bytes, similar to
the UTF-8 scheme) is an idea, but worth considering how it affects other
programs (e.g. is it acceptable to require more complexity in the filtering
programs, to deal with variable-length records)
On Thursday 07 September 2006 09:01, Nicola Ranaldo wrote:
> generating tools should sort tags
> by usage, assign the 31 most used to bits and generate a configuration file
> to get mapping.
That might be a useful feature for some applications, but since the choice of
tags isn't specified in the file format, it's up to the people programming
generators whether they want to spend memory and CPU on that task.
Personally, I was assuming that people would want to specify their 32 tags (or
64 or whatever) based on what their renderer was programmed to display. e.g.
a device for aircraft pilots might have just two tags for roads ("big" and
"small"), while a device for cars might allocate most of its 32 tags to
different types of road. A device for walkers might choose to allocate 10
different types of footpath, and lots of field/woodland boundaries but no
road information.
Similarly, the node file for use in the car device will have fuel stations and
parking lots. The walker's device will have pubs and trig points. The pilot's
device will have crop circles, aerials, and gliding schools. The choice of
tags to put in the output file is such a personal choice, so
application-dependant, that I didn't anticipate much need to auto-generate it
Of course, you can use jorg's statistics page to help you when deciding which
popular tags to select...
Other bits:
I'd be interested to hear from computer scientists about fancy algorithms for
spacially-sorting the data, doing indexes, etc. First person to mention
Hilbert space gets a nerd of the month award.
Is it worth upgrading the segments file to something that can be used in
rendering - give each segment a set of lat/long points and some extended
attributes to cover them all.
Do we need a file-header? e.g. to say what type of file (nodes file
unfiltered, nodes file stripped of its ID tags, nodes file with 12 bytes of
tag information per node, or with variable record-length...), and flags like
how much processing has been done on each file (so that the processing tools
can resume a partially-processed file), that sort of thing?
Is anyone prepared to write the "apply node coordinates to their respective
segments" program - I'm a bit distracted with the Mapyrus renderer at the
moment, but it would be very useful. Basically, it loads x MB of node data
into memory (x being set by the user depending on how good their computer
is), then goes through the segments file replacing the blank coordinates with
real ones for any that are in it's "block of nodes". (the segments file is
operated on in-situ, and doesn't change size)
The naive C-program I tried reveals that it needs to be much much quicker,
with better data-structures for holding the working-copy of nodes, and lots
of experimentation of how many nodes and segments to keep in memory between
subsequent disk accesses to optimise the speed.
Regards
OJW
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.openstreetmap.org/pipermail/dev/attachments/20060907/ead161d0/attachment.pgp>