Request for advice on processing big data (geospatial) using RapidMiner

A_Houghstow · January 2019

Hi RM Community,

I am a newbie looking for some advice on getting started. I am currently trying to predict which locations around the world are most vulnerable to experiencing environmental conflict. My goal is to build a model that can predict this at a local (eg town/country/subdistrict) level. I've assembled a PostGIS database of global environmental, governance, development, and conflict data, including a lot of high resolution global-scale rasters. The database is stored on AWS.

I recently tried importing a small subset of this data to RapidMiner Studio to see if I could run my first query. The import included one global raster mapping cropland, one point file on conflict locations, and one set of polygons (~25 sq km hexagons, global) to serve as boundaries of interest. The import took a really long time. I had to stop after a couple of hours and change locations, and this meant stopping the import entirely since I was running Studio locally.

I have been trying to figure out a workaround so I can ultimately work with all my data using RapidMiner. Perhaps running RapidMiner Studio on an AWS instance would work? (I am doing research with an academic license and don't need to deploy the model yet, so Server may be out of the picture at this point.) Maybe there is some intermediate step I should take to make working with the data easier for RapidMiner?

My background is in social science and stats, but I am new to big data, ML, and database architecture, so I would very much appreciate any advice on the challenge!

Thank you so much.

@sgenzer, putting this question on your radar. Thank you for answering my question about RapidMiner Server previously!

DocMusher · February 2019

Hi,
Without providing you a direct answer to your question, did you take a look into previous discussions such as: https://community.rapidminer.com/discussion/25118/geographic-operations-in-rapidminer?
Hope this can be of any help
Cheers
Sven

BalazsBarany · February 2019

Hi,

there's also the possibility of executing the most resource-intensive processes in the RapidMiner Cloud available from your Studio.
If it's a one-time thing (importing and processing the data), this could be sufficient.

Regards,
Balázs

MartinLiebig · February 2019

Hi,

what is the bottle neck here? It feels like the bottle neck is the data transfer over from AWS to your local computer? If this is the case, then it makes sense to either move closer to the cloud (e.g. our PAYG options) or use less data (=less attributes).

BR,

Martin

A_Houghstow · February 2019

Thank you everyone for your help! I have a lot of food for thought and potential solutions to try now. In particular, I am starting by working through the geospatial data tutorial posted by @BalazsBarany
and shared by @DocMusher. I'll check back in and share what worked after taking some time to work through the tutorial.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Request for advice on processing big data (geospatial) using RapidMiner

Answers