Using RapidMiner for building earth science models from satellite data
Hello RapidMiner developers and users,
My name is Brian Bonnlander, and I'm a research scientist with experience building earth science and ecological forecasting models from satellite data. I'm very interested in finding or helping develop a freely available data mining toolkit that can be used by earth scientists to explore and build forecasting models from large sets of gridded data, including satellite and ground observation data. Currently, much of the work in ecological forecasting is done through partnerships between earth scientists, who have the domain knowledge for ecological forecasting, and computer scientists (like myself), who write the code for data preprocessing and forecasting. The problem is that much of the code is built using proprietary tools such as Matlab, and these solutions are hard for earth scientists to understand, extend, and share with other scientists, partly because they are not coders, and partly because the supporting languages are not freely available.
It is my belief that research in earth science would be greatly enhanced if earth scientists could explore data themselves with an easy-to-use set of tools, and share their code and results for other scientists to build upon. There are grant opportunities within the U.S. for developing such tools, and I am interested in writing a grant proposal for extending a toolkit such as RapidMiner for processing large gridded datasets.
Please correct me if my assumption is not correct, but the piece that is currently missing for RapidMiner is not necessarily the ability to handle large datasets, which can run into the tens of GB for earth science models, but that it does not offer functionality for processing data with a gridded structure. For example, a common proprocessing operation with gridded data involves spatial or temporal smoothing. Suppose that the every data point is labeled with an (X,Y) location and a time T. Then a commonly used operation for preprocessing would involve smoothing values for every location (X,Y) over a time window of (T-10, T+10), or smoothing values at time T over a two-dimensional neighborhood of values around (X,Y).
Once the data are preprocessed in these ways, they are often treated as standard training examples for machine learning. The only other step I've often performed is dividing the examples into separate training sets based on some categorical attribute (such as the landcover type at location (X,Y)), and training separate models for each category.
So my questions are the following:
1. Would it be difficult to add this kind of functionality to RapidMiner (if it does not already exist)?
2. Is anyone aware of past efforts use RapidMiner for this type of earth science research?
3. Are there any geodata formats, such as GeoTiff, HDF, or NetCDF, that are already supported by RapidMiner?
4. Would it be feasible for one or two full-time developers working for about a year to add support for these types of data and data operations?
I apologize if these questions are too general, but I have failed to find answers to these questions through internet search. I very much look forward to any replies.