RapidMiner 9.7 is Now Available
Lots of amazing new improvements including true version control! Learn more about what's new here.
Tutorial for the GeoProcessing extension
Our fictional scenario: We're working with the city of Vienna, Austria, to celebrate the long history of Vienna and the river Danube. For the celebrations, we would like to organize a boat race and a running event for children. We are working with geodata from the Open Data server of Vienna.
In the 1970s Vienna built an artificial island inside the Danube, called Donauinsel (Danube Island). Since then there's the Danube (left arm on the picture) and the New Danube (right). Here's a map to give you an idea:
We are only interested in the parts of the Danube and the New Danube that flow through Vienna. These are highlighted in the next map:
The boat race should be in the longest part of the Danube (or New Danube) through Vienna, so we want to determine the length of the river parts.
For the children's running event, we want to select the two bridges with the shortest distance between them. All bridges in Vienna are of course also available on the Open Data server:
We are obviously only interested in the bridges over the Danube, not every bridge in Vienna. So we will filter the data accordingly:
Then we will calculate the distance between every bridge and select the shortest one (ignoring very short distances of multi-part bridges).
In order to make RapidMiner capable of doing all this, install the GeoProcessing extension from the Marketplace. Make sure that you see the Geoprocessing folder in your Extensions in the Operators panel.
Some background knowledgeEarth is an irregular ellipsoid, but we like to look at maps in two dimensions, as these are more suitable for computer screens or paper. This transformation to two dimensions also allows the application of geometry calculations like distance, length, area and so on.
We express global coordinates in latitude and longitude degrees (counted from the equator and from the international 0 meridian in Greenwich). These are angles, so the distance between coordinates depends on the geographic position. We can't use these coordinates for calculating absolute sizes in our favourite measurement system (meters, yards, miles, ...).
The process of transforming coordinates to a new coordinate system (CRS, coordinate reference system) is called projection or reprojection. You can think of it as taking a photo from an airplane or a satellite to transform the three-dimensional earth surface to a two-dimensional picture. The projected coordinates can be measured in meters or other units, and geometry functions will give us the expected measurements.
Coordinate systems are referred to by EPSG codes. You can check epsg.io to find an appropriate coordinate system for the area you're working on.
It's not always necessary to reproject coordinates. If we only want to know if a geometry contains or touches another geometry, we can calculate that in the original coordinate system (if we ignore problems spanning the line between longitudes -180° and 180°).
Getting the dataThe Vienna open data server contains geodata in many formats. We can easily use the CSV version in RapidMiner. The example process loads the data directly from the web, you could of course save them locally if you need them more often.
This process contains standard RapidMiner operators only, the extension is not yet in use. The Read CSV operators are set up with the comma as the separator, and UTF-8 encoding, but otherwise with the default settings. The attribute names come from the first line, the data format is determined automatically.
We only keep a few attributes (the geometry and the object name) and rename them for later use. For example, the river geometry is renamed to riverGeom.
The standard for expressing geometries in textual form is called WKT, Well Known Text. The open data server delivers the geometries in this format, and this is also the format used by the GeoProcessing operators. If you have GIS data in a database, you can use ST_AsText in SQL to get them in this format.
The tutorial process
After reading the data, we first extract the parts of the Danube inside the boundaries of Vienna. We use Calculate Geometry Relation for this (Danube inside Vienna in the process). It has one input, so we need both the Vienna and the Danube coordinates in one example set. The easiest way to achieve this is a Cartesian join (it combines every row from the first example set with every row from the second one). We use the intersection function of Calculate Geometry Relation for getting the result. It returns the common part of the two geometries (a polygon and a line) as another geometry, in our case a shorter line (just the part of the Danube inside the Vienna polygon).
We then filter out the New Danube for the bridges, but keep both parts for the river part length calculation.
We want to get the length in meters here, not in ellipsoid degrees. So we reproject the original coordinates to a projection commonly used in Austria, ETRS89/Austria (EPSG code: 3416). This projection is appropriate here. If you work in a different geographical area, be sure to select an appropriate projection. (Choosing a wrong projection will lead to big distortions in the calculated measures.)
After reprojecting to EPSG:3416, we can calculate the length of the river arms with Calculate measures on a geometry (called Calculate river length here).
Now on to the bridges.
First we want to find bridges that cross the Danube. This is a geographic join operation if we apply it on two example sets.
We select the function crosses here. Other functions include contains/containedBy, intersects, overlaps, touches, etc. The function parameter stays empty here, it is only used by isWithinDistance.
Now we can create a distance "matrix" (not formatted as a matrix) for all the selected bridges. This happens in a subprocess.
To calculate distances, we will of course reproject the bridge coordinates to the Austrian meter-based coordinate system. We join the bridge table with itself using a Cartesian join so we get a row for every combination of bridges, but remove the row if it compares the bridge with itself.
Then we use Calculate Geometry Relation with the distance function on the projected geometries.
We then filter out everything with a distance of less than 100 meters to avoid returning irrelevant combinations (some smaller parts of the bridges are separate entries in the data).
Now we can sort the data by distance and return the first row. According to our data, "Steg an der Nordbahnbrücke" and "Floridsdorfer Brücke" would be the nearest ones, with a distance of 481 meters.
That's it, we are done with the analysis. We imported geodata from the Web, transformed coordinates, combined different example sets with different methods and calculated real-world measures on the geometries.
Some directions you could go from there:
- Use the operator Geometry to Coordinates to visualize data (it works best with point geometries, or if you have a large number of geometries)
- Try different ways to geographically join example sets
- Try out the different functions in Calculate Geometry Relation and Calculate measures on a geometry
I'm looking forward to your questions and remarks on the GeoProcessing extension and this tutorial.