QGIS Planet

Do you sometimes start writing an SQL query and around at line 50 you get the feeling that it might be getting out of hand? If so, it might be useful to start breaking it down into smaller chunks and wrap those up into custom functions. Never done that? Don’t despair! There’s an excellent PL/pgSQL tutorial on postgresqltutorial.com to get you started.

To get an idea of the basic structure of a PL/pgSQL function and to proof that PostGIS datatypes work just fine in this context, here’s a basic function that takes a trajectory geometry and outputs its duration, i.e. the difference between its last and first timestamp:

My end goal for this exercise was to implement a function that takes a trajectory and outputs the stops along this trajectory. Commonly, a stop is defined as a long stay within an area with a small radius. This leads us to the following definition:

Note how this function uses RETURNS TABLE to enable it to return all the stops that it finds. To add a line to the output table, we need to assign values to the sequence and geom variables and then use RETURN NEXT.

Another reason to use PL/pgSQL is that it enables us to write loops. And loops I wanted for my stop detection function! Specifically, I wanted to go through all the points in the trajectory:

FOR pt IN SELECT (ST_DumpPoints(traj)).geom LOOP
-- here comes the magic!
END LOOP;

Eventually the function should go through the trajectory and identify all segments that stay within an area with max_size diameter for at least min_duration time. To test for the area size, we can use:

While this function is not really short, it’s so much more readable than my previous attempts of doing this in pure SQL. Some of the lines for determining is_stop are not strictly necessary but they do speed up processing.

Performance still isn’t quite where I’d like it to be. I suspect that all the adding and removing points from linestring geometries is not ideal. In general, it’s quicker to find shorter stops in smaller areas than longer stop in bigger areas.

Let’s test!

Looking for a testing framework for PL/pgSQL, I found plpgunit on Github. While I did not end up using it, I did use its examples for inspiration to write a couple of tests, e.g.

Basically, each test is yet another PL/pgSQL function that doesn’t return anything (i.e. returns void) but outputs messages about the status of the test. Here I made heavy use of the PERFORM statement which executes the provided function but discards the results:

Last week, I traveled to Salzburg to attend the 30th AGIT conference and co-located English-speaking GI_Forum. Like in previous year, there were a lot of mobility and transportation research related presentations. Here are my personal highlights:

This year’s keynotes touched on a wide range of issues, from Sandeep Singhal (Google Cloud Storage) who – when I asked about the big table queries he showed – stated that they are not using a spatial index but are rather brute-forcing their way through massive data sets, to Laxmi Ramasubramanian @nycplanner (Hunter College City University of New York) who cautioned against tech arrogance and tendency to ignore expertise from other fields such as urban planning:

One issue that Laxmi particularly highlighted was the fact that many local communities are fighting excessive traffic caused by apps like Waze that suggest shortcuts through residential neighborhoods. Just because we can do something with (mobility) data, doesn’t necessarily mean that we should!

The session Spatial Perspectives on Healthy Mobility featured multiple interesting contributions, particularly by Michelle P. Fillekes who presented a framework of mobility indicators to assess daily mobility of study participants. It considers both spatial and temporal aspects of movement, as well as the movement context:

My introduction to GeoMesa talk failed to turn up any fellow Austrian GeoMesa users. So I’ll keep on looking and spreading the word. The most common question – and certainly no easy one at that – is how to determine the point where it becomes worth it to advance from regular databases to big data systems. It’s not just about the size of the data but also about how it is intended to be used. And of course, if you are one of those db admin whizzes who manages a distributed PostGIS setup in their sleep, you might be able to push the boundaries pretty far. On the other hand, if you already have some experience with the Hadoop ecosystem, getting started with tools like GeoMesa shouldn’t be too huge a step either. But that’s a topic for another day!

Since AGIT&GI_Forum are quite a big event with over 1,000 participants, it was not limited to movement data topics. You can find the first installment of English papers in GI_Forum 2018, Volume 1. As I understand it, there will be a second volume with more papers later this year.

In short: both writing trajectory queries as well as executing them is considerably faster using PostGIS trajectories (as LinestringM) rather than the commonly used point-based approach.

Here are a couple of examples to give you an impression of the differences.

Spoiler alert! Trajectory queries are up to 500 times faster than comparable point-based queries.

A quick look at indexing

In both cases, we have indexed the tracker id, geometry, and time columns to speed up query processing.

The trajectory table has 3 indexes

gist (time_range)

gist (track gist_geometry_ops_nd)

btree (tracker)

The point-based table has 4 indexes

gist (pt)

btree (trajectory_id)

btree (tracker)

btree (t)

Length

First, let’s see how to determine trajectory length for all observed moving objects (identified by a tracker id).

Using the point-based approach, we first need to ensure that the points are in the correct temporal order, create the lines, and finally sum up their length:

WITH ordered AS (
SELECT trajectory_id, tracker, t, pt
FROM geolife.trajectory_pt
ORDER BY t
), tmp AS (
SELECT trajectory_id, tracker, st_makeline(pt) traj
FROM ordered
GROUP BY trajectory_id, tracker
)
SELECT tracker, round(sum(ST_Length(traj::geography)))
FROM tmp
GROUP BY tracker
ORDER BY tracker

With trajectories, we can go right to computing lengths:

SELECT tracker, round(sum(ST_Length(track::geography)))
FROM geolife.trajectory_ext
GROUP BY tracker
ORDER BY tracker

On my test system, the trajectory query run time is 22.7 sec instead of 43.0 sec for the point-based approach:

Duration

Compared to trajectory length, duration is less complicated in the point-based approach:

WITH tmp AS (
SELECT trajectory_id, tracker, min(t) start_time, max(t) end_time
FROM geolife.trajectory_pt
GROUP BY trajectory_id, tracker
)
SELECT tracker, sum(end_time - start_time)
FROM tmp
GROUP BY tracker
ORDER BY tracker

Still, the trajectory query is less complex and much faster at 31 ms instead of 6.0 sec:

SELECT tracker, sum(upper(time_range) - lower(time_range))
FROM geolife.trajectory_ext
GROUP BY tracker
ORDER BY tracker

Temporal filter

Extracting trajectories that occurred during a certain time frame is another common use case:

Many of the topics I’ve covered in recent “Movement data in GIS” posts, have also been discussed at this year’s FOSS4G. Here’s a list of videos for you to learn more about the OGC Moving Features standard, modelling AIS data with FOSS, and more:

MarineCadastre.gov is a great source for AIS data along the US coast. Their data formats and tools though are less open. Luckily, GDAL – and therefore QGIS – can read ESRI File Geodatabases (.gdb).

MarineCadastre.gov also offer a Track Builder script that creates lines out of the broadcast points. (It can also join additional information from the vessel and voyage layers.) We could reproduce the line creation step using tools such as Processing’s Point to path but this post will show how to create PostGIS trajectories instead.

First, we have to import the points into PostGIS using either DB Manager or Processing’s Import into PostGIS tool:

Then we can create the trajectories. I’ve opted to create a materialized view:

The first part of the query creates a temporary table called ptm (short for PointM). This step adds time stamp information to each point. The second part of the query then aggregates these PointMs into trajectories of type LineStringM.

CREATE MATERIALIZED VIEW ais.trajectory AS
WITH ptm AS (
SELECT b.mmsi,
st_makepointm(
st_x(b.geom),
st_y(b.geom),
date_part('epoch', b.basedatetime)
) AS pt,
b.basedatetime t
FROM ais.broadcast b
ORDER BY mmsi, basedatetime
)
SELECT row_number() OVER () AS id,
st_makeline(ptm.pt) AS st_makeline,
ptm.mmsi,
min(ptm.t) AS min_t,
max(ptm.t) AS max_t
FROM ptm
GROUP BY ptm.mmsi
WITH DATA;

The trajectory start and end times (min_t and max_t) are optional but they can help speed up future queries.

One of the advantages of creating trajectory lines is that they render many times faster than the original points.

Of course, we end up with some artifacts at the border of the dataset extent. (Files are split by UTM zone.) Trajectories connect the last known position before the vessel left the observed area with the position of reentry. This results, for example, in vertical lines which you can see in the bottom left corner of the above screenshot.

This is a first proof of concept. It would be great to have a script that automatically fetches the datasets for a specified time frame and list of UTM zones and loads them into PostGIS for further processing. In addition, it would be great to also make use of the information in the vessel and voyage tables, thus splitting up trajectories into individual voyages.

There are multiple ways to model trajectory data. This post takes a closer look at the OGC® Moving Features Encoding Extension: Simple Comma Separated Values (CSV). This standard has been published in 2015 but I haven’t been able to find any reviews of the standard (in a GIS context or anywhere else).

The following analysis is based on the official OGC trajcectory example at http://docs.opengeospatial.org/is/14-084r2/14-084r2.html#42. The header consists of two lines: the first line provides some meta information while the second defines the CSV columns. The data model is segment based. That is, each line describes a trajectory segment with at least two coordinate pairs (or triplets for 3D trajectories). For each segment, there is a start and an end time which can be specified as absolute or relative (offset) values:

They missed the chance to use WKT notation to make the CSV easily readable by existing GIS tools.

As far as I can see, the data model requires a regular sampling interval because there is no way to store time stamps for intermediate positions along trajectory segments. (Irregular intervals can be stored using segments for each pair of consecutive locations.)

In the common GIS simple feature data model (which is point-based), the same data would look something like this:

The main issue here is that there has to be some application logic that knows how to translate from points to trajectory. For example, trajectory a changes from walking1 to walking2 at 2012-01-17T12:36:11Z but we have to decide whether to store the previous or the following state code for this individual point.

An alternative to the common simple feature model is the PostGIS trajectory data model (which is LineStringM-based). For this data model, we need to convert time stamps to numeric values, e.g. 2012-01-17T12:33:41Z is 1326803621 in Unix time. In this data model, the data looks like this:

This is very similar to the OGC data model, with the notable difference that every position is time-stamped (instead of just having segment start and end times). If one has movement data which is recorded at regular intervals, the OGC data model can be a bit more compact, but if the trajectories are sampled at irregular intervals, each point pair will have to be modeled as a separate segment.

To add the trajectory generalization scripts to your Processing toolbox, you can use the Add scripts from files tool:

It is worth noting, that Add scripts from files fails to correctly import potential help files for the scripts but that’s not an issue this time around, since I haven’t gotten around to actually write help files yet.

The scripts are used in the following order:

Extract characteristic trajectory points

Group points in space

Compute flows between cells from trajectories

The sample project contains input data, as well as output layers of the individual tools. The only required input is a layer of trajectories, where trajectories have to be LINESTRINGM (note the M!) features:

In Extract characteristic trajectory points, distance parameters are specified in meters, stop duration in seconds, and angles in degrees. The characteristic points contain start and end locations, as well as turns and stop locations:

The characteristic points are then clustered. In this tool, the distance has to be specified in layer units, which are degrees in case of the sample data.

Finally, we can compute flows between cells defined by these clusters:

Flow lines scaled by flow strength and cell centers scaled by counts

If you use these tools on your own data, I’d be happy so see what you come up with!

In this post, we use TimeManager to visualize the position of a moving object over time along a trajectory. This is another example of what is possible thanks to QGIS’ geometry generator feature. The result can look like this:

(In part 2 of this series, we already saw how a geometry generator can be used to visualize speed along a trajectory.)

The layer is added to TimeManager using t_start and t_end attributes to define the trajectory’s temporal extent.

TimeManager exposes an animation_datetime() function which returns the current animation timestamp, that is, the timestamp that is also displayed in the TimeManager dock, as well as on the map (if we don’t explicitly disable this option).

Once TimeManager is set up, we can edit the line style to add a point marker to visualize the position of the moving object at the current animation timestamp. To do that, we interpolate the position along the trajectory segments. The first geometry generator expression splits the trajectory in its segments:

The second geometry generator expression interpolates the position on the segment that contains the current TimeManager animation time:

The WHEN statement compares the trajectory segment’s start and end times to the current TimeManager animation time. Afterwards, the line_interpolate_point function is used to draw the point marker at the correct position along the segment:

In the 1st part of this series, I mentioned the Workshop on Analysis of Movement Data at the GIScience 2016 conference. Since the workshop took place in September 2016, 11 abstracts have been published (the website seems to be down currently, see the cached version) covering topics from general concepts for movement data analysis, to transport, health, and ecology specific articles. Here’s a quick overview of what researchers are currently working on:

General topics

Interpolating trajectories with gaps in the GPS signal while taking into account the context of the gap [Hwang et al., 2016]

Adding time and weather context to understand their impact on origin-destination flows [Sila-Nowicka and Fotheringham, 2016]

Finding optimal locations for multiple moving objects to meet and still arrive at their destination in time [Gao and Zeng, 2016]

Since I’m mostly working with human and vehicle movement data in outdoor settings, it is interesting to see the bigger picture of movement data analysis in GIScience. It is worth noting that the published texts are only abstracts, therefore there is not much detail about algorithms and whether the code will be available as open source.

In the previous post, I presented an approach to generalize big trajectory datasets by extracting flows between cells of a data-driven irregular grid. This generalization provides a much better overview of the flow and directionality than a simple plot of the original raw trajectory data can. The paper introducing this method also contains more advanced visualizations that show cell statistics, such as the overall count of trajectories or the generalization quality. Another bit of information that is often of interest when exploring movement data, is the time of the movement. For example, at LBS2016 last week, M. Jahnke presented an application that allows users to explore the number of taxi pickups and dropoffs at certain locations:

By adopting this approach for the generalized flow maps, we can, for example, explore which parts of the research area are busy at which time of the day. Here I have divided the day into four quarters: night from 0 to 6 (light blue), morning from 6 to 12 (orange), afternoon from 12 to 18 (red), and evening from 18 to 24 (dark blue).

The resulting visualization shows that overall, there is less movement during the night hours from midnight to 6 in the morning (light blue quarter). Sounds reasonable!

One implementation detail worth considering is which timestamp should be used for counting the number of movements. Should it be the time of the first trajectory point entering a cell, or the time when the trajectory leaves the cell, or some average value? In the current implementation, I have opted for the entry time. This means that if the tracked person spends a long time within a cell (e.g. at the work location) the trip home only adds to the evening trip count of the neighboring cell along the trajectory.

Since the time information stored in a PostGIS LinestringM feature’s m-value does not contain any time zone information, we also have to pay attention to handle any necessary offsets. For example, the GeoLife documentation states that all timestamps are provided in GMT while Beijing is in the GMT+8 time zone. This offset has to be accounted for in the analysis script, otherwise the counts per time of day will be all over the place.

Using the same approach, we could also investigate other variations, e.g. over different days of the week, seasonal variations, or the development over multiple years.

The authors do a great job at describing the concepts and algorithms, which made it relatively straightforward to implement them in QGIS Processing. So far, I’ve implemented the basic logic but the paper contains further suggestions for improvements. This was also my first pyQGIS project that makes use of the measurement value support in the new geometry engine. The time information stored in the m-values is used to detect stop points, which – together with start, end, and turning points – make up the characteristic points of a trajectory.

The following animation illustrates the current state of the implementation: First the “hairball” of trajectories is rendered. Then we extract the characteristic points and group them by proximity. The big black dots are the resulting group centroids. From there, I skipped the Voronoi cells and directly counted transitions from “nearest to centroid A” to “nearest to centroid B”.

The resulting visualization makes it possible to analyze flow strength as well as directionality. I have deliberately excluded all connections with a count below 10 transitions to reduce visual clutter. The cell size / distance between point groups – and therefore the level-of-detail – is one of the input parameters. In my example, I used a target cell size of approximately 2km. This setting results in connections which follow the major roads outside the city center very well. In the city center, where the road grid is tighter, trajectories on different roads mix and the connections are less clear.

Since trajectories in this dataset are not limited to car trips, it is expected to find additional movement that is not restricted to the road network. This is particularly noticeable in the dense area in the west where many slow trajectories – most likely from walking trips – are located. The paper also covers how to ensure that connections are limited to neighboring cells by densifying the trajectories before computing step 4.

Running the scripts for over 18,000 trajectories requires patience. It would be worth evaluating if the first three steps can be run with only a subsample of the data without impacting the results in a negative way.

One thing I’m not satisfied with yet is the way to specify the target cell size. While it’s possible to measure ellipsoidal distances in meters using QgsDistanceArea (irrespective of the trajectory layer’s CRS), the initial regular grid used in step 2 in order to group the extracted points has to be specified in the trajectory layer’s CRS units – quite likely degrees. Instead, it may be best to transform everything into an equidistant projection before running any calculations.

It’s good to see that PyQGIS enables us to use the information encoded in PostGIS LinestringM features to perform spatio-temporal analysis. However, working with m or z values involves a lot of v2 geometry classes which work slightly differently than their v1 counterparts. It certainly takes some getting used to. This situation might get cleaned up as part of the QGIS 3 API refactoring effort. If you can, please support work on QGIS 3. Now is the time to shape the PyQGIS API for the following years!