The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.

Request for advice on processing big data (geospatial) using RapidMiner

A_HoughstowA_Houghstow Member Posts: 2 Newbie
Hi RM Community,

I am a newbie looking for some advice on getting started. I am currently trying to predict which locations around the world are most vulnerable to experiencing environmental conflict. My goal is to build a model that can predict this at a local (eg town/country/subdistrict) level. I've assembled a PostGIS database of global environmental, governance, development, and conflict data, including a lot of high resolution global-scale rasters. The database is stored on AWS.

I recently tried importing a small subset of this data to RapidMiner Studio to see if I could run my first query. The import included one global raster mapping cropland, one point file on conflict locations, and one set of polygons (~25 sq km hexagons, global) to serve as boundaries of interest. The import took a really long time. I had to stop after a couple of hours and change locations, and this meant stopping the import entirely since I was running Studio locally.

I have been trying to figure out a workaround so I can ultimately work with all my data using RapidMiner. Perhaps running RapidMiner Studio on an AWS instance would work? (I am doing research with an academic license and don't need to deploy the model yet, so Server may be out of the picture at this point.) Maybe there is some intermediate step I should take to make working with the data easier for RapidMiner?

My background is in social science and stats, but I am new to big data, ML, and database architecture, so I would very much appreciate any advice on the challenge!

Thank you so much.

@sgenzer, putting this question on your radar. Thank you for answering my question about RapidMiner Server previously!


  • Options
    DocMusherDocMusher Member Posts: 333 Unicorn
    Without providing you a direct answer to your question, did you take a look into previous discussions such as:
    Hope this can be of any help

  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    there's also the possibility of executing the most resource-intensive processes in the RapidMiner Cloud available from your Studio. 
    If it's a one-time thing (importing and processing the data), this could be sufficient.

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,525 RM Data Scientist

    what is the bottle neck here? It feels like the bottle neck is the data transfer over from AWS to your local computer? If this is the case, then it makes sense to either move closer to the cloud (e.g. our PAYG options) or use less data (=less attributes).

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    A_HoughstowA_Houghstow Member Posts: 2 Newbie
    Thank you everyone for your help! I have a lot of food for thought and potential solutions to try now. In particular, I am starting by working through the geospatial data tutorial posted by @BalazsBarany
    and shared by @DocMusher. I'll check back in and share what worked after taking some time to work through the tutorial.
Sign In or Register to comment.