Tidy Geospatial Networks in R

2020-05-14

intro.Rmd

Thanks to active developer and user communities, R is becoming an increasingly popular language for interactive geographic data analysis. Large steps forward include the release and continued evolution of sf, which provides classes for representing and manipulating spatial vector data (points, lines and polygons). Unlike its predecessors, sf is compatible with the popular data science oriented packages that form the tidyverse, most notably the data manipulation toolbox dplyr, and more generally with the concept of tidy data.

R is also well-suited for network analysis, thanks to the R interface of the igraph library. The tidygraph package extends igraph into the domain of the tidyverse, enabling dplyr compatibility by treating a graph as a collection of two tidy data frames describing respectively the nodes and edges.

Given the strength of R in spatial and network analysis, one would expect it to be an ideal language for analysis of geospatial networks. Geospatial networks are simultaneously graph and spatial objects, with nodes and edges embedded in geographic space. Well known examples include transport networks, river basins, power grids, ecological networks and geolocated social networks.

Although several R packages exist that address geospatial networks, they often focus on a specific application within the broader domain of geospatial network analysis, or complicate tidyverse compatibility, for example by using S4 classes or relying on the older sp package for spatial data handling. Hence, at present, no generally applicable, tidyverse compatible classes exist for geospatial network data, representing a gap in R’s spatial and network analysis ecosystems (see the gRaphical Models and Spatial task views). sfnetworks is created to fill this gap, and form a connecting edge between the two worlds.

Installation

The sfnetworks package is still in very active development, and this vignette only describes the first part of our work. Therefore, the package is also not on CRAN yet. Install it from GitHub using the remotes package in R.

The sfnetwork data structure

The core of the sfnetworks package is the sfnetwork data structure. This is a class that subclasses tidygraph’s tbl_graph, which itself subclasses igraph. Therefore, sfnetwork objects are recognized by all network analysis algorithms that igraph offers, as well as by the tidy wrappers that tidygraph has built around them.

Philosophy

The philosophy of a tbl_graph is best described by the following paragraph from the tidygraph introduction: “Relational data cannot in any meaningful way be encoded as a single tidy data frame. On the other hand, both node and edge data by itself fits very well within the tidy concept as each node and edge is, in a sense, a single observation. Thus, a close approximation of tidyness for relational data is two tidy data frames, one describing the node data and one describing the edge data.”

Since sfnetwork subclasses tbl_graph, it shares the same philosophy. However, it extends it into the domain of geospatial data analysis, where each observation has a location in geographical space. For that, it brings sf into the game. An sf object stores the geographical coordinates of each observation in standardized format in a geometry list column, which has a Coordinate Reference System associated with it. Thus, in sfnetworks, we re-formulate the last sentence of the paragraph above to the following. “A close approximation of tidyness for relational geospatial data is two sf objects, one describing the node data and one describing the edge data.”

We do need to make a note here. In a geospatial network, the nodes always have coordinates in geographic space, and thus, are always described by an sf object. The edges, however, can also be described by just the indices of its end-nodes. This still makes them geospatial, because they start and end at specific points in space, but the spatial information is not explicitly attached to them. Both representations can be useful. In road networks, for example, it makes sense to explicitly draw a line geometry between two nodes, while in geolocated social networks, it probably does not. An sfnetwork supports both types. It can either have edges with a geometry stored in a geometry list column, described by an sf object, or edges that only refer to node indices, described by a regular data frame. We refer to these types of edges as spatially explicit edges and spatially implicit edges respectively. In this vignette, we mainly focus on the first type.

Construction

An sfnetwork object can be constructed with the sfnetwork() function. This function expects the nodes and edges data as arguments, meeting the following requirements:

The nodes should be an object of class sf, solely containing features with POINT geometries.

When creating a network with spatially explicit edges, the edges should be an object of class sf, solely containing features with LINESTRING geometries. Otherwise, they can be a data.frame or tibble. In any case, they should contain the indices of their end-nodes in a to and from column, or in the two first columns, as integers. Every node index refers to the position (i.e. the rownumber) of the node in the nodes table.

When creating a network with spatially explicit edges, the coordinates of the endpoints of the edge linestrings should match with the coordinates of the respective end-nodes. Otherwise, the network structure is not valid. Note that this also depends on the precision of the coordinates.

When creating a network with spatially explicit edges, the two provided sf objects should have the same CRS.

Additionally, one can set the directed argument, defining if the created network should be directed or not. This defaults to TRUE.

Additionally, as_sfnetwork() has a method that directly converts an sf object into a network. Of course, a single sf object does not contain any explicit network information, so assumptions have to be made. This functionality should therefore be seen as a convenience function, and will not be suited for all possible use-cases. It works as follows:

When given an sf object with LINESTRING geometries: the lines are considered edges, and nodes are created at the endpoints of the lines. Identical endpoints become one single node. This addresses the use-case that was presented in this r-spatial blogpost.

When given an sf object with POINT geometries: the points are considered nodes, and edges are created between them, in a sequential way. With that we mean: node 1 will be connected to node 2, node 2 to node 3, et cetera. This addresses the use-case that was presented in this stackoverflow thread.

See below an example with the Roxel dataset (which comes with sfnetworks and is a cleaned extract of OpenStreetMap roads in Münster-Roxel).

More methods will be added in the near future, such that other spatial network objects that exist in R can be directly converted into an sfnetwork, e.g. dodgr_streetnet from the dodgr package and SpatialLinesNetwork from the stplanr package.

Activation and extraction

Just as with a tbl_graph, one of the graph elements of an sfnetwork (i.e. nodes or edges) can be activated, and all subsequent computations will be performed on that active element. By default, the nodes are the active element. This can be changed with the activate() verb, that is borrowed from tidygraph.

Thanks to activate(), a lot of data manipulation tools can be directly applied to the network. If you want to do something that goes further, you can always extract the active element as an sf object, simply by calling st_as_sf() - in the case of spatially implicit edges, extract them with as_tibble() instead.

For advanced visualization, we encourage to extract nodes and edges as sf objects, and use one of the many ways to map those in R, either statically or interactively. Think of sf’s default plot method, ggplots geom_sf(), tmap, mapview, et cetera.

Graph operations

Since it subclasses tbl_graph, sfnetwork objects are recognized by all network analysis algorithms that igraph offers, as well as by the tidy wrappers that tidygraph has built around them.

A small example is the calculation of the betweenness centrality for each node:

There is a large amount of available algorithms, ranging from simple type queries to advanced community detection. To keep this vignette within an acceptable length, we will not cover all graph algorithms that can be used, but instead refer to the tidygraph introduction.

Spatial operations

Not only can sfnetwork objects be used in graph algorithms, they are also recognized by various spatial analytical functions of sf. Internally, this works as follows: the active element is converted to an sf object, on which the function is applied. Depending on the function, the results are merged back into the network, and an sfnetwork object is returned, or simply the raw output of the sf function is returned.

Some of the functions have effects also outside of the active element: when nodes are removed from the network, the edges terminating at those nodes will be removed too. This behavior is not symmetric: when removing edges, the nodes that are at the start or end of those edges remain, even if they are not adjacent to any other edge. This is because in graph theory, edges can never exist without end-nodes, while nodes can exist without having adjacent edges.

All supported sf functions that are of the form f(x,y) also accept an sfnetwork as y parameter. Just as for object x, the active element of the network y will internally be converted to an sf object, and used as such inside the function.

Spatial filters

Networks can be filtered by a spatial predicate, using st_filter. For example, we want to only keep those nodes that intersect with a given polygon:

Spatial joins

Additional attributes can be joined spatially into the network with st_join, which is also based on spatial predicates. By default, this join is a left join, meaning that all features (i.e. nodes or edges) of x remain. If changed to an inner join (by adding left = FALSE), features get removed if they don’t have a match with any row in y.

The sfnetwork method for st_join has an important restriction. When the nodes are the active element of the network, only joins where there is at most one match per feature are allowed. Allowing multiple matches is a problem when joining on the nodes. For example, if node 1 in x has two matches in y, this creates two rows for node 1 in the resulting object. Firstly, this messes up the network structure: if the end-nodes of edge A where node 1 and node 2, these indices now suddenly refer to the same node, while the original node 2 becomes node 3. Secondly, if we manage to update the from and to columns of the edges accordingly, should edge A go from the new node 1 to the new node 3, or from the new node 2 to the new node 3? In the future we might remove the restriction, but only after we decide on a good way to deal with multiple matches per node.

For the edges this problem does not exist, because the same nodes can be connected by multiple edges. Multiple matches will cause edges to be duplicated, which might not be so useful, but at least it does not break the network structure.

Using spatial predicates directly

Spatial predicates are very useful for filtering and joining, but can also directly be applied to an sfnetwork. This allows for example to create new attribute columns based on such predicates.

net %>%mutate(in_rect =lengths(st_intersects(., rect)) >0)#> although coordinates are longitude/latitude, st_intersects assumes that they are planar#> although coordinates are longitude/latitude, st_intersects assumes that they are planar

Setting or transforming CRS

Also the CRS functions technically are applied to the active element of the network. However, when changing the CRS of the active element (either by setting or transforming), the same operation is applied to the other element (of course this does not apply for spatially implicit edges). This is because sfnetwork objects have the restriction that nodes and edges should always have the same CRS.

Setting or replacing geometries

Replacing the geometry of nodes or edges comes with restrictions. First, there is for now a general restriction for sfnetwork objects where nodes can only be points, and edges (if spatially explicit) can only be lines. This is to keep things simple at first. If there are use-cases where other geometry types as nodes or edges make sense, we can always loosen this restriction further down the road.

Additionally:

You cannot replace the geometry of the nodes when edges are spatially explicit. This would mean that the coordinates of the nodes don’t match anymore with the endpoints of the edges. Hence, the network structure is broken.

You can only replace the geometry of spatially explicit edges when 1) the CRS of the geometries doesn’t change (otherwise you would create a network with different CRS for nodes and edges, which is restricted) and 2) the endpoints of the geometries don’t change (otherwise you would have endpoints of edges that don’t match the node coordinates anymore, and thus break the network structure).

Geometries can be replaced also by using geometry unary operations, as long as they don’t break the restrictions mentioned above. This basically means that only st_reverse and st_simplify are supported. When calling st_reverse on the edges of a directed network, not only the geometries will be reversed, but the to and from columns of the edges will also be swapped. In the case of undirected networks the geometries are not reversed at all.

Non-supported functions

For the same reason as tidygraph does not support all dplyr functions, sfnetwork does not support all sf functions: there is a clear limitation in the relational data structure that requires rows to maintain their identity. For sfnetwork objects, this also means that the geometries of features should be maintained, or at least the endpoints of these geometries, in the case of the edges. That is, functions that summarise geometries, or (may) change their type, shape or position, are not supported. These are the geometric binary operations, most of the geometric unary operations, st_union, st_combine, st_cast, st_nearest_point, st_sample, st_line_sample, st_jitter, st_collection_extract and st_make_valid.

These functions cannot be directly applied to an sfnetwork object, but of course you can always escape the network structure, do some work, and merge changes back into the network. For example, when you want to know the area of the voronoi polygon of each node:

In addition to those functions mentioned above, there are also no sfnetwork methods for:

All functions related to reading and writing data. Use sf for this, and then convert to an sfnetwork. Note: Maybe there are specific spatial network data formats that the package should be able to read from and write to? If yes, please let us know.

All functions related to M and Z coordinates. This is something for later.

Functions that retrieve/change the attributes of sf objects. Like st_agr. See this issue.

Tidyverse methods

As already shown in some of the examples above, a lot of dplyr functions work directly on sfnetwork objects, thanks to tidygraph. Since the active elements (if spatially explicit) are internally handled as being sf objects, automatically the sf methods for these functions are used. That means, for example, that the geometry column is ‘sticky’. Note: also the to and from column in the edges table are sticky, which does not come from sf, but from tidygraph. So now, we have three sticky columns!

What is still to come?

Everything presented in this vignette is only a first glance at what sfnetworks will offer. The package is currently in very active development, and several features are planned to be introduced soon.

Better tidygraph integration

The fact that sfnetwork subclasses tbl_graph, makes that they are recognized by all of tidygraph’s functions. However, not all of these functions already run without errors. This is because of the geometry list column, which causes some operations to fail. Examples are the morph and bind functionalities. These issues should be solved, such that the integration of sfnetworks with tidygraph becomes - approximately - flawless.

Extending the functionalities of tidygraph and sf

Right now, sfnetworks serves as a connection between tidygraph and sf, in the sense that objects of class sfnetwork are recognized by functions of both packages. However, there might be problems that are so specific to spatial networks, that they can neither be solved by tidygraph nor sf. It is the goal of sfnetworks to provide analytical functions by itself, that address such problems. Think of snapping spatial points to the edges of a network, or finding the nearest node to a given spatial feature. We want to prevent that sfnetworks goes too much into a single application - it should remain a lower-level package - but basic, widely used spatial network operations should be supported.

Adding more methods to as_sfnetwork

As said before, it is the goal to provide as_sfnetwork()-methods for a wide range of R classes for spatial networks.

Visualization

In the current version of sfnetworks, visualization of spatial networks does not yet play a big role. It would be great to see this change in the future, for example by good integration with ggraph. Contributions are very welcome regarding this!

Contribute

We look very much forward to contributions to the package. This can be in the form of pull requests (please direct them to the develop branch), but also by opening issues in the issue tracker. Such issues can be specific feature requests labeled with the feature label (e.g. “I would like to see a function f(x) that accepts an object x of class foo and returns < this >”). It can also be a more generic description of problems that you deal with when working with spatial networks in R (e.g. “I usually find it impossible/confusing/complicated to do < this >”). Please label such issues with the hackathon label, since we will be organising a spatial networks hackathon in which we hope to solve these kind of problems. For more information on that event, see here (and join!).