Tutorial for the GeoProcessing extension

BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
edited February 2020 in Knowledge Base
There is new an extension called GeoProcessing in the RapidMiner Marketplace. To give you an idea of what you can do with this extension, here is a tutorial using some of the operators.
Our fictional scenario: We're working with the city of Vienna, Austria, to celebrate the long history of Vienna and the river Danube. For the celebrations, we would like to organize a boat race and a running event for children. We are working with geodata from the Open Data server of Vienna.
In the 1970s Vienna built an artificial island inside the Danube, called Donauinsel (Danube Island). Since then there's the Danube (left arm on the picture) and the New Danube (right). Here's a map to give you an idea:

We are only interested in the parts of the Danube and the New Danube that flow through Vienna. These are highlighted in the next map:

The boat race should be in the longest part of the Danube (or New Danube) through Vienna, so we want to determine the length of the river parts.
For the children's running event, we want to select the two bridges with the shortest distance between them. All bridges in Vienna are of course also available on the Open Data server:

We are obviously only interested in the bridges over the Danube, not every bridge in Vienna. So we will filter the data accordingly:

Then we will calculate the distance between every bridge and select the shortest one (ignoring very short distances of multi-part bridges).
In order to make RapidMiner capable of doing all this, install the GeoProcessing extension from the Marketplace. Make sure that you see the Geoprocessing folder in your Extensions in the Operators panel.

Some background knowledge

Earth is an irregular ellipsoid, but we like to look at maps in two dimensions, as these are more suitable for computer screens or paper. This transformation to two dimensions also allows the application of geometry calculations like distance, length, area and so on. 
We express global coordinates in latitude and longitude degrees (counted from the equator and from the international 0 meridian in Greenwich). These are angles, so the distance between coordinates depends on the geographic position. We can't use these coordinates for calculating absolute sizes in our favourite measurement system (meters, yards, miles, ...). 
The process of transforming coordinates to a new coordinate system (CRS, coordinate reference system) is called projection or reprojection. You can think of it as taking a photo from an airplane or a satellite to transform the three-dimensional earth surface to a two-dimensional picture. The projected coordinates can be measured in meters or other units, and geometry functions will give us the expected measurements.
Coordinate systems are referred to by EPSG codes. You can check epsg.io to find an appropriate coordinate system for the area you're working on.
It's not always necessary to reproject coordinates. If we only want to know if a geometry contains or touches another geometry, we can calculate that in the original coordinate system (if we ignore problems spanning the line between longitudes -180° and 180°).  

Getting the data

The Vienna open data server contains geodata in many formats. We can easily use the CSV version in RapidMiner. The example process loads the data directly from the web, you could of course save them locally if you need them more often.

This process contains standard RapidMiner operators only, the extension is not yet in use. The Read CSV operators are set up with the comma as the separator, and UTF-8 encoding, but otherwise with the default settings. The attribute names come from the first line, the data format is determined automatically.
We only keep a few attributes (the geometry and the object name) and rename them for later use. For example, the river geometry is renamed to riverGeom.
The standard for expressing geometries in textual form is called WKT, Well Known Text. The open data server delivers the geometries in this format, and this is also the format used by the GeoProcessing operators. If you have GIS data in a database, you can use ST_AsText in SQL to get them in this format.

The tutorial process


After reading the data, we first extract the parts of the Danube inside the boundaries of Vienna. We use Calculate Geometry Relation for this (Danube inside Vienna in the process). It has one input, so we need both the Vienna and the Danube coordinates in one example set. The easiest way to achieve this is a Cartesian join (it combines every row from the first example set with every row from the second one). We use the intersection function of Calculate Geometry Relation for getting the result. It returns the common part of the two geometries (a polygon and a line) as another geometry, in our case a shorter line (just the part of the Danube inside the Vienna polygon). 
We then filter out the New Danube for the bridges, but keep both parts for the river part length calculation.
We want to get the length in meters here, not in ellipsoid degrees. So we reproject the original coordinates to a projection commonly used in Austria, ETRS89/Austria (EPSG code: 3416). This projection is appropriate here. If you work in a different geographical area, be sure to select an appropriate projection. (Choosing a wrong projection will lead to big distortions in the calculated measures.)
After reprojecting to EPSG:3416, we can calculate the length of the river arms with Calculate measures on a geometry (called Calculate river length here). 

Now on to the bridges. 
First we want to find bridges that cross the Danube. This is a geographic join operation if we apply it on two example sets. 

We select the function crosses here. Other functions include contains/containedBy, intersects, overlapstouches, etc. The function parameter stays empty here, it is only used by isWithinDistance.
Now we can create a distance "matrix" (not formatted as a matrix) for all the selected bridges. This happens in a subprocess.
 To calculate distances, we will of course reproject the bridge coordinates to the Austrian meter-based coordinate system. We join the bridge table with itself using a Cartesian join so we get a row for every combination of bridges, but remove the row if it compares the bridge with itself.
Then we use Calculate Geometry Relation with the distance function on the projected geometries. 

We then filter out everything with a distance of less than 100 meters to avoid returning irrelevant combinations (some smaller parts of the bridges are separate entries in the data). 
Now we can sort the data by distance and return the first row. According to our data, "Steg an der Nordbahnbrücke" and "Floridsdorfer Brücke" would be the nearest ones, with a distance of 481 meters.
That's it, we are done with the analysis. We imported geodata from the Web, transformed coordinates, combined different example sets with different methods and calculated real-world measures on the geometries.
Some directions you could go from there:
- Use the operator Geometry to Coordinates to visualize data (it works best with point geometries, or if you have a large number of geometries)
- Try different ways to geographically join example sets
- Try out the different functions in Calculate Geometry Relation and Calculate measures on a geometry

I'm looking forward to your questions and remarks on the GeoProcessing extension and this tutorial.

Comments

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    A downloadable version of the processes is attached here.
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    edited February 2020
    And if you have RapidMiner 9.6+ running, you can click on this link to open the processes directly:

    Get Data (1st process shown above)

    Calculating Distances (2nd process shown above)
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    I also added the process in the 9.5 repository under /Community Data Science. (Extension Example ...)
  • ivaneivane Member Posts: 4 Contributor I
    Hi Balazs, I created a number of groovy scripts using your zip archive of library from geoscript and geotools. These processes have been working since you gave us (TCA) that geoscript library bundle in 2016 when you came down here in Melbourne. Now it seems that adding these jar files into the Rapidminer studio/lib folder causes the sql connector to beak for this latest version 9.10.008. Is there a revised library bundle I can use to continue using processes with groovy scripts in it?

    Strangely enough, it is still working on 9.10.001 studio version. However, when executed on AI Hub there is an issue with connection to the sql database on version 9.10.001 - after 1 hour and 45 minutes during execution the following error gets thrown: java.lang.IllegalAccessError: tried to access class com.microsoft.sqlserver.jdbc.SQLServerDriverIntProperty from class com.microsoft.sqlserver.jdbc.SQLServerDriver.

    Rapidminer support advised me to upgrade to 9.10.008, but when I add the bundles of geoscript jar files the sql connection breaks. Any help would be much appreciated. Note that I've also developed scripts making use of the geohash and interpolation for quicker data matching, so I would really need to keep using the groovy script using geoscript (unless there is also geohash and interpolation operators as extension).
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi @ivane,

    nice to hear from you after such a long time.

    I haven't looked into updating the geotools and geoscript lately. However, I'm actively using the GeoProcessing extension which should have newer libraries, and accessing MySQL and PostgreSQL databases is not a problem in the latest Studio. I don't have MS SQL to test.

    I guess that updating the geo* library jars one by one to current versions is the best approach. Maybe some common logging or utility library is too old, it gets loaded when Studio starts, and then the MSSQL driver breaks.

    Regards,
    Balázs
  • ivaneivane Member Posts: 4 Contributor I
    Hi @BalazsBarany

    I managed to get the geoscript working without breaking the sql connector - I only added the 113 jar files below out of the 142 jar files you have in the package. The sql connections still works (both in studio and AI hub)

    bufr-4.6.2.jar
    c3p0-0.9.1.1.jar
    cdm-4.6.2.jar
    commons-beanutils-1.7.0.jar
    commons-dbcp-1.4.jar
    commons-jxpath-1.3.jar
    commons-pool-1.5.4.jar
    core-0.26.jar
    eastwood-1.1.1-20090908.jar
    ecore-2.6.1.jar
    ehcache-1.6.2.jar
    fop-0.94.jar
    gdal-1.11.2.jar
    geodb-0.7-RC2.jar
    GeographicLib-Java-1.44.jar
    geoscript-groovy-1.6.0.jar
    gt-api-14.0.jar
    gt-app-schema-resolver-14.0.jar
    gt-arcgrid-14.0.jar
    gt-brewer-14.0.jar
    gt-complex-14.0.jar
    gt-coverage-14.0.jar
    gt-coverage-api-14.0.jar
    gt-cql-14.0.jar
    gt-css-14.0.jar
    gt-data-14.0.jar
    gt-epsg-wkt-14.0.jar
    gt-geobuf-14.0.jar
    gt-geojson-14.0.jar
    gt-geopkg-14.0.jar
    gt-graph-14.0.jar
    gt-grassraster-14.0.jar
    gt-grid-14.0.jar
    gt-gtopo30-14.0.jar
    gt-jdbc-14.0.jar
    gt-jdbc-h2-14.0.jar
    gt-jdbc-mysql-14.0.jar
    gt-jdbc-postgis-14.0.jar
    gt-jdbc-spatialite-14.0.jar
    gt-main-14.0.jar
    gt-metadata-14.0.jar
    gt-ogr-core-14.0.jar
    gt-ogr-jni-14.0.jar
    gt-opengis-14.0.jar
    gt-process-14.0.jar
    gt-process-feature-14.0.jar
    gt-process-geometry-14.0.jar
    gt-process-raster-14.0.jar
    gt-property-14.0.jar
    gt-referencing-14.0.jar
    gt-shapefile-14.0.jar
    gt-swing-14.0.jar
    gt-transform-14.0.jar
    gt-wfs-ng-14.0.jar
    gt-wms-14.0.jar
    gt-xml-14.0.jar
    gt-xsd-core-14.0.jar
    gt-xsd-fes-14.0.jar
    gt-xsd-filter-14.0.jar
    gt-xsd-gml2-14.0.jar
    gt-xsd-gml3-14.0.jar
    gt-xsd-kml-14.0.jar
    gt-xsd-ows-14.0.jar
    gt-xsd-sld-14.0.jar
    jai_codec-1.1.3.jar
    jai_core-1.1.3.jar
    jai_imageio-1.1.jar
    json-simple-1.1.jar
    jsr-275-1.0-beta-2.jar
    jt-affine-1.0.6.jar
    jt-algebra-1.0.6.jar
    jt-attributeop-1.4.0.jar
    jt-bandcombine-1.0.6.jar
    jt-bandmerge-1.0.6.jar
    jt-bandselect-1.0.6.jar
    jt-binarize-1.0.6.jar
    jt-border-1.0.6.jar
    jt-buffer-1.0.6.jar
    jt-classifier-1.0.6.jar
    jt-colorconvert-1.0.6.jar
    jt-colorindexer-1.0.6.jar
    jt-contour-1.4.0.jar
    jt-crop-1.0.6.jar
    jt-errordiffusion-1.0.6.jar
    jt-format-1.0.6.jar
    jt-iterators-1.0.6.jar
    jt-jiffle-language-0.2.0.jar
    jt-jiffleop-0.2.0.jar
    jt-lookup-1.0.6.jar
    jt-mosaic-1.0.6.jar
    jt-nullop-1.0.6.jar
    jt-orderdither-1.0.6.jar
    jt-piecewise-1.0.6.jar
    jt-rangelookup-1.4.0.jar
    jt-rescale-1.0.6.jar
    jt-rlookup-1.0.6.jar
    jt-scale-1.0.6.jar
    jt-stats-1.0.6.jar
    jt-translate-1.0.6.jar
    jt-utilities-1.0.6.jar
    jt-utils-1.4.0.jar
    jt-vectorbin-1.0.6.jar
    jt-vectorbinarize-1.4.0.jar
    jt-vectorize-1.4.0.jar
    jt-warp-1.0.6.jar
    jt-zonal-1.0.6.jar
    jt-zonalstats-1.4.0.jar
    jts-1.13.jar
    net.opengis.fes-14.0.jar
    net.opengis.ows-14.0.jar
    net.opengis.wcs-14.0.jar
    net.opengis.wfs-14.0.jar
    netcdf4-4.6.2.jar
  • c_cheec_chee Member Posts: 18 Maven
    Nice writeup above. Can I confirm if the following case can be done? Say I have a static geometry, e.g. boundary of a Resort. Then I have various point locations. The points are from say, tagged animal sensors, that periodically give Lattitude, Longitude readings. Can I use the above Extension, to test one point at a time, whether the point has CROSSED into the boundary of the Resort?

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi!

    Yes. Crossing the boundary means that you had a point outside of the boundary and then in the next reading of that sensor it is inside.

    Another way of expressing the same is having the boundary as a linestring instead of a polygon and the two points united to a linestring. Then you would actually use the "crosses" operation on these linestrings. But you could get false positives with unregular shapes, so I would recommend the first solution.

    Regards,
    Balázs
  • c_cheec_chee Member Posts: 18 Maven
    Hi Is there some simpler examples to show how to use GeoProcessing for Crossing the Boundary problem?
    I started with data having Lat,Long as columns 
    e.g. 120.16006113532586 22.98360854837837
    120.1598941085565 22.98349548390813
    120.1599270061248 22.983467540620026
    120.1600903042156 22.983575467283245
    120.16006113532586 22.98360854837837

    I wanted to create a POLYGON in in WKT - assuming this would allow me to use the 'Calculate Geometry Relation'

    So I used ReadCSV -> Coordinates To Geometry.   
    ... hoping to convert the above Lat, Long into POLYGON((120 22, 120 22, 120 22,. ...))
    But instead the result from Coordinates to Geometry is:
    POINT (120.16006113532586 22.98360854837837)
    POINT (120.1598941085565 22.98349548390813)
    POINT (120.1599270061248 22.983467540620026)
    POINT (120.1600903042156 22.983575467283245)
    POINT (120.16006113532586 22.98360854837837)

     Is the overall strategy correct? If yes, how to solve the part above?

    Thanks
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi!

    Yes, coordinates are just points, so Coordinates to Geometry only creates Points. 

    Grouping points to linestrings or polygons is not available in the Geoprocessing extension. You might be able to create the correct polygon WKT using Generate Attributes and Aggregate in RapidMiner. 

    These complex things are usually being done on the data level in a GIS-enabled database like PostGIS, or in a tool like QGIS. RapidMiner, even with the Geoprocessing extension, is not a replacement of an entire GIS pipeline. 

    If your data are polygons, you should have them as polygons, then you can use Geoprocessing e. g. for the matching process. 

    Regards,

    Balázs
Sign In or Register to comment.