OSM Data Files

The file is approximately 44MB in size, and contains approximately 460 relations, 8900 ways, and 380000 nodes.

Import Type

One-time import of a subset of the dataset.

It is possible there will be future imports of further subsets of the data, however they have not been planned yet. If they occurr, these future imports will each be of a similar size to the present import, and follow everything written on this page except, except the description of the particular choice of "important features" that form the present subset.

Data Preparation

Description of the source data

The dataset "Waterbodies in South Australia" is dated 2 July 2014, and consists of a shapefile Waterbodies.shp and other associated files, approximately 200MB total size.

The shapefile contains 151432 separate features (individually tagged areas, consisting of a polygon or multipolygon.)

The value of "FEATURECOD" can be used to determine what type of waterbody each feature corresponds to. The table below shows the different types of waterbodies in the dataset, a count of each, and the corresponding value of FEATURECOD.

Data Reduction & Simplification

Of the 151432 features in the dataset, only approximately 9000 will be included in this import.
This subset is the result of the following selection criteria:

Take all features in the dataset that have a name.

Add to this, all features that correspond to a permanent lake or a reservoir.

Add to this, any feature that is connected to any feature already included.

From the list constructed so far, remove any feature that overlaps any waterbody that is already in OSM

Also, remove any feature that is clipped by the edges of the entire region

The first two criteria above are aimed at taking only the most import features from the dataset.
The third point above aims to make future imports of more of the dataset simpler. That is, any feature that we do not include in this import, and that also doesn't overlap with an existing feature in OSM, is guaranteed to not share any nodes that are contained in the current import.

We run the "SimplifyArea" JOSM plugin on all ways, using the default settings. This reduces the total node count by approximately 23%.

Tagging Plans

Based on the FEATURECOD of each original feature, the following tags will be added:

Data Transformation

Part 1: loading the source data and auxiliary data into a QGIS project

The source file Waterbodies.shp is converted to the WGS84 coordinate system using QGIS, and added to a new layer of a new QGIS project.
The auxiliary file Gazetteer.shp is likewise converted to WGS84 and added as a new layer to the same project.

The set of existing waterbodies in OSM in the region of interest is downloaded using JOSM, saved as .osm file, and added as a separate layer to the QGIS project. The following Overpass API script was used to do the download in JOSM, using a bounding box of "min lat: -39.5, max lat: -25.0, min lon: 127.5, max lon: 142.5" :

(Before saving this data in JOSM, all "waterway=riverbank" areas are given the extra tag "natural=water", which is a hack that ensures QGIS treats riverbanks as areas.)

An empty shapefile layer, "processed.shp", is also added to the project.

Part 2: Using a python script within QGIS to select and process the source data

Using the python console which is built in to QGIS, run the script "process_data.py". The steps performed by this script are summarised as follows:

For each feature in the source data set that has a non-empty "NAME" tag, attempt to detect whether the value of this tag is genuinely the name of this waterbody, or whether it is a spurious value. (In the source data set, sometimes all dams on a farm are given a "NAME" tag that corresponds to the name of the nearby homestead. We want to remove these names). Compare the name of each waterbody with the name of nearby homesteads listed in the Gazetteer file. Also count the number of nearby waterbodies with the same name. Also, look for certain keywords in the name. Based on these observations, replace the NAME tag with the empty string.

Delete any source feature that does not have one of the following values of FEATURECOD: 3236, 4401, 4402, 4403, 4407, 4812

Find the subset of all source features that are "important". For the purposes of this import, include: features that have a non-empty name; any permanent lake; and any reservoir (feature code 3236).

Find any source feature connected to an "important" feature, and add that to the list of important features.

For each feature in the important list, detect whether any waterbody in the existing OSM database overlaps with it. If so, delete from the list.

For each item on the list that is not a wetland, find if it overlaps a wetland in list, and if so replace the shape of the wetland with a version having the first item's shape subtracted. (In the original dataset, lakes, dams, and reservoirs are allowed to overlap with wetlands. In OSM, they cannot.)

For each remaining feature, add the required OSM tags, based on the FEATURECOD.

Remove any item that touches the bounding box of the source data, since it is likely clipped.

For each surviving feature, add it to the layer "processed.shp"

Part 3: Conversion to .osm; and further processig using a Matlab script

Use JOSM to convert processed.shp to processed.osm.

Use the Matlab scripts readnodes.m and process_processed.m to perform further processing. The script does the following:

Ways that have too many nodes (more than 2000, the API limit) are split into smaller ways.

Tags are processed further. "datasa:" is prepended to some keys kept from the source. Keys that were clipped to the 10-character limit in a shapefile are fixed.

The file processed_processed.osm is written.

Part 4: Validation using JOSM

The file processed_processed.osm is loaded into JOSM, and its contents are copied to a new data layer. The data validation tool is run. A summary of the errors and warnings generated, and the steps taken to fix them, as as follows:

"Natural duplicated nodes (24)". Fixed automatically.

"Style for inner way equals multipolygon (36)". Fixed manually, by deleting inner way in most cases. Six of the warnings were ignored, as the inner way had a different name.

"Relations with the same member (1)". Deleted relations.

"Self-intersecting ways (6)". Manual tweaks to fixes these glitches.

"Overlapping water areas (86)". Manual edits where overlap is very small, else delete one of the waterbodies involved.

The SimplifyArea plugin is run on all ways, using the default settings. This reduces the total node count by approximately 21%

The script bulk_upload.py is designed to be tolerant of interruptions. If an error occurs during upload, the script will be re-run. The script is also designed to automatically break the data into separate changesets.

The script doesn't give the option to choose tags for the changesets. I will modify the source code of bulk_upload.py to force it to add the tag "source=data.sa.gov.au" to the changesets.

In the event that the upload needs to be reverted, the JOSM reverter plugin will be used.

An upload to the sandbox server will be tried first.

Conflation

A simple conservative approach to conflation is used: no data that overlaps with an existing OSM waterbody is included in the upload.

QA

Using JOSM, hundreds of features from random locations in region were examined closely. The features were compared to Bing imagery.
I paid particularly close attention to features that were in areas that I am familiar with.

Based on these observations I concluded that the dataset is, at least in the most part, of fairly high quality.

Data Updates

The source data doesn't appear to get updated often. (The latest version is dated July 2014).

If updated versions of the source data are made available in the future, I will perform the processing steps again, to import new features.
Features that change shape, change tags, or are deleted won't be handled automatically.

↑ 1.01.11.2Note: a "dam" in Australian English is a small reservoir, usually on a farm