The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Using RapidMiner for building earth science models from satellite data
bbonnlander
Member Posts: 1 Learner III
Hello RapidMiner developers and users,
My name is Brian Bonnlander, and I'm a research scientist with experience building earth science and ecological forecasting models from satellite data. I'm very interested in finding or helping develop a freely available data mining toolkit that can be used by earth scientists to explore and build forecasting models from large sets of gridded data, including satellite and ground observation data. Currently, much of the work in ecological forecasting is done through partnerships between earth scientists, who have the domain knowledge for ecological forecasting, and computer scientists (like myself), who write the code for data preprocessing and forecasting. The problem is that much of the code is built using proprietary tools such as Matlab, and these solutions are hard for earth scientists to understand, extend, and share with other scientists, partly because they are not coders, and partly because the supporting languages are not freely available.
It is my belief that research in earth science would be greatly enhanced if earth scientists could explore data themselves with an easy-to-use set of tools, and share their code and results for other scientists to build upon. There are grant opportunities within the U.S. for developing such tools, and I am interested in writing a grant proposal for extending a toolkit such as RapidMiner for processing large gridded datasets.
Please correct me if my assumption is not correct, but the piece that is currently missing for RapidMiner is not necessarily the ability to handle large datasets, which can run into the tens of GB for earth science models, but that it does not offer functionality for processing data with a gridded structure. For example, a common proprocessing operation with gridded data involves spatial or temporal smoothing. Suppose that the every data point is labeled with an (X,Y) location and a time T. Then a commonly used operation for preprocessing would involve smoothing values for every location (X,Y) over a time window of (T-10, T+10), or smoothing values at time T over a two-dimensional neighborhood of values around (X,Y).
Once the data are preprocessed in these ways, they are often treated as standard training examples for machine learning. The only other step I've often performed is dividing the examples into separate training sets based on some categorical attribute (such as the landcover type at location (X,Y)), and training separate models for each category.
So my questions are the following:
1. Would it be difficult to add this kind of functionality to RapidMiner (if it does not already exist)?
2. Is anyone aware of past efforts use RapidMiner for this type of earth science research?
3. Are there any geodata formats, such as GeoTiff, HDF, or NetCDF, that are already supported by RapidMiner?
4. Would it be feasible for one or two full-time developers working for about a year to add support for these types of data and data operations?
I apologize if these questions are too general, but I have failed to find answers to these questions through internet search. I very much look forward to any replies.
Thank you!
--Brian
Tagged:
1
Answers
No, it is not that hard. We had a small student project here for some months coming up with a distributed version of RapidMiner. We found that it was not production ready yet and decided to start again from the scratch but the project at least have shown that it is possible to distribute the tasks among a grid and bring back and combine the results. I do not know too much details but I know that people from the University of Bonn, Germany, work on geo mining. If I remember correctly one of the names there was Till Rumpf. Maybe, but I am not aware of any right now. For the data: definitely yes. For the operations: this might depend on how familiar the developers are with distributed computing / mining or, even better, with RapidMiner. But probably it is also possible to come with a system which is at least post-alpha after one year. No need to apologize. I find it always interesting to learn what people are interested in and in which fields RapidMiner is used. I hope that my answers are helping at least a little bit...
Cheers,
Ingo
Maybe this will answer to your computation and preprocessing issues :
http://www.inf.ufrgs.br/~vbogorny/software.html
Cheers,
Jean-Charles.
Would you see an interest in implementing "spherical harmonics" Operator in the "feature generation" categories, a bit like "wavelets" in "time series" ? They are so useful in earth topology, sismology, etc...
Whenever you have a physical field F verifying "Delta / Laplacian F = 0" in spherical coordinates, it has been shown that a kind of "Fourier Analysis" onto F can be performed,
The vector base for this analysis has a double index, and is said to be orthonormal because it is built from Legendre polynoms with a specific orthonormalization process. Thus, if you have a "gridded" set of F values, you could compute many components of the model, just specifying to the operator where is F, and where are rho, theta and phi.
Thomas.
any suggestions for how to do this for older models in netcdf4 and 3 time series ? Preferably in tcloud instances
2 Direct conversion to ARFF And weka is possible
https://github.com/fracpete/netcdf-converters-weka-package compiles
but cannot make it work in the latest rapidminers any suggestions ? Perhaps it is the .jars outdated ?
Sincerely Rapdio