In the last quarter of 2016 the German marketing team came up with a great way to follow the immense success of last year’s Tableau Stadium Tour: the Tableau Cinema Tour! After visiting ten cities all over Germany, Austria, and Switzerland, we are now considering rolling it out all over Europe. Stay tuned for that! Since we often got requests for the data used in the main demo, I decided to produce this write-up of how to extract the data from the Internet Movie Database (IMDb). Unfortunately copyright reasons make it impossible for us to just provide you the ready-made data. That said, with this walk-through everybody should be able to get the data!

For that last step I decided to join the two tables (plr is the spatial table here, while mss has the attributes) using the field plr_id, which exists in both tables and only once on each side (hence a plain vanilla 1:1 join).

Add vector join dialog window in QGIS 2.8

That works perfectly fine, except that somehow the order of the joined fields appears to get messed up:

QGIS attirbute table with erroneously shifted field contents

Some research revealed that this seems to be a problem caused by identical field names in the two joined tables other than the join field itself. In my case the aforementioned plr_id was used to join the two tables, but in addition both tables also had a field gid, as can be seen in the following screenshot on the left:

Table design in pgAdmin: original table including field gid on the left, fixed table without (unnecessesary) field gid on the right

Removing this field gid from the attribute table mss was no problem, since the 1:1 relation to the spatial data uses the key plr_id anyways. As can be seen in the screenshot above on the right, the new table mss2 is identical to mss, only without the field gid. And lo-and-behold – joining this attribute table to the spatial table plr in QGIS works flawlessly now:

QGIS attirbute table with correct field contents

This problem had already been identified in QGIS 2.0 in late 2013, and has been marked as fixed in the meantime. Removing fields with identical names in the two tables is one – admittedly rather radical way – to solve circumvent the issue. Another, more intuitive way would be to choose a meaningful table prefix in the Add vector join dialog which can be seen in the first image above. As you can see I checked the Custom field name prefix checkbox but left the field empty. I prefer this, since it keeps my field names nice and tidy, but in cases where homonymous fields exist in the two tables you will run into trouble – hence entering a prefix here would be a nice and easy fix for this issue.

Everything described above was performed on QGIS 2.8.1-Wien (64bit) on a Windows 7 machine and PostgreSQL 9.1.16 on a 64bit Ubuntu 4.6.3 server (PostGIS 1.5.3).

Whenever possible I recently try to get all my GIS work done in QGIS. Most of the time this is no problem at all. Sometimes it makes things even easier, such as when you’re trying to work with your geospatial data in a PostgreSQL/PostGIS database (good luck trying that in ArcGIS!). But sometimes you come across a task that is just so exotic that nobody has ever come across it. Or at least nobody wrote about coming across it…

I’m pretty sure everything I’m writing here is true for most other RDBMS, too, but since I’m currently using PostgreSQL I had a chance to test it and show some hard figures here.

The problem to solve is actually a common one and rather easy to solve conceptually: Take attributes from table1 and store them in corresponding rows in table2, using a common id to join them. The straight-forward (and almost direct) translation into SQL is therefore:

There’s nothing wrong with that statement and it’s going to do exactly what’s intended. Only, it’s not very clever, and hence not very performant. This obviously only matters if your tables are on the bigger end. In my case table1 (the one to update) has 576,806 rows while table2 (the one providing the attribute data) has a whopping 848,664,485 rows. Also I should mention that table2 contains multiple rows for each corresponding row in table1. In that concrete case table2 contains data about point locations (latitude, longitude, timestamp) of people whose attributes (age, gender, etc…) are stored in table1. And there is this one attribute which is wrongly and inefficiently stored with each point location, while it is only dependent on the person and should hence be stored there.