Request for advice on processing big data (geospatial) using RapidMiner

A_HoughstowA_Houghstow Member Posts: 2 Newbie
Hi RM Community,

I am a newbie looking for some advice on getting started. I am currently trying to predict which locations around the world are most vulnerable to experiencing environmental conflict. My goal is to build a model that can predict this at a local (eg town/country/subdistrict) level. I've assembled a PostGIS database of global environmental, governance, development, and conflict data, including a lot of high resolution global-scale rasters. The database is stored on AWS.

I recently tried importing a small subset of this data to RapidMiner Studio to see if I could run my first query. The import included one global raster mapping cropland, one point file on conflict locations, and one set of polygons (~25 sq km hexagons, global) to serve as boundaries of interest. The import took a really long time. I had to stop after a couple of hours and change locations, and this meant stopping the import entirely since I was running Studio locally.

I have been trying to figure out a workaround so I can ultimately work with all my data using RapidMiner. Perhaps running RapidMiner Studio on an AWS instance would work? (I am doing research with an academic license and don't need to deploy the model yet, so Server may be out of the picture at this point.) Maybe there is some intermediate step I should take to make working with the data easier for RapidMiner?

My background is in social science and stats, but I am new to big data, ML, and database architecture, so I would very much appreciate any advice on the challenge!

Thank you so much.

@sgenzer, putting this question on your radar. Thank you for answering my question about RapidMiner Server previously!

Answers

  • DocMusherDocMusher Member Posts: 287   Unicorn
    Hi,
    Without providing you a direct answer to your question, did you take a look into previous discussions such as: https://community.rapidminer.com/discussion/25118/geographic-operations-in-rapidminer?
    Hope this can be of any help
    Cheers
    Sven

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 382   Unicorn
    Hi,

    there's also the possibility of executing the most resource-intensive processes in the RapidMiner Cloud available from your Studio. 
    If it's a one-time thing (importing and processing the data), this could be sufficient.

    Regards,
    Balázs
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,251  RM Data Scientist
    Hi,

    what is the bottle neck here? It feels like the bottle neck is the data transfer over from AWS to your local computer? If this is the case, then it makes sense to either move closer to the cloud (e.g. our PAYG options) or use less data (=less attributes).

    BR,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • A_HoughstowA_Houghstow Member Posts: 2 Newbie
    Thank you everyone for your help! I have a lot of food for thought and potential solutions to try now. In particular, I am starting by working through the geospatial data tutorial posted by @BalazsBarany
    and shared by @DocMusher. I'll check back in and share what worked after taking some time to work through the tutorial.
    sgenzer
Sign In or Register to comment.