Thursday, March 31, 2016

Around the world in one hour! (revisit)

In this blog post, we revisit an earlier blog post about extracting data from OpenStreetMap Planet.osm file. We still use the same extraction script in Pigeon but we make it modular and easier to reuse. We make use of the macro definitions in Pig to extract common code into a separate file. In the following part, we first describe the OSMX.pig file which contains the reusable macros. After that, we describe how to use it in your own Pig script.

osmx.pig

The osmx.pig file contains all the common code that is used to extract points, ways, or relations from an OSM file. It contains the following functions.

LoadOSMNodes

This macro extracts all the nodes from an OSM file. It returns a dataset that contains tuples of the following format.

osm_node_id

long

longitude

double

latitude

double

tags

map[(chararray)]

LoadOSMWaysWithSegments

This macro returns all ways in the file. Each way is returned as a series of line segments which connect two consecutive nodes on the way. It returns a dataset with tuples of the following format.

segment_id

long

A generated unique ID for each segment

id1

long

The ID of the starting node

latitude1

double

Latitude of the starting node

longitude1

double

Longitude of the starting node

id2

long

The ID of the ending node

latitude2

double

Latitude of the ending node

longitude2

double

Longitude of the ending node

way_id

long

The ID of the way that contains this segment

tags

map[(chararray)]

All the tags of the way

LoadOSMWaysWithGeoms

This macro returns all ways in the file. However, unlike LoadOSMWaysWithSegments, it returns one tuple for each segment which contains the entire geometry of the way. Each tuple is formatted as follows.

way_id

long

The ID of the way as it appears in the OSM file

first_node_id

long

The ID of the first node in this way

last_node_id

long

The ID of the last node in this way

geom

bytearray

The geometry of the way

tags

map[(chararray)]

The tags of the way as they appear in the OSM file

LoadOSMObjects

This macro returns all objects in the OSM file. Objects can be one of two cases:

First level relations: This contains relations that contain only ways.

Dangled ways: This contains ways that are not part of any relations.

The returned dataset does not contain second level relations such as relations that contain other relations. The format of the returned dataset is as follows.

object_id

long

The ID of either the relation or the way

geom

bytearray

The geometry of the object

tags

map[(chararray)]

The tags of either the way or the relation as they appear in the OSM file

Although the code looks a little bit ugly, it only contains four statements. The first one extracts all the ways as segments using the LoadOSMWaysWithSegments macro. The second statement filters the segments that are related to the road network using the tags attribute. The third statement removes unnecessary columns and the fourth statement writes the output.

Similar to the road network, the next few lines extracts and stores the buildings dataset.