Recommended way to load data from own Java code to RapidMiner?

wessel · November 2012

Dear All,

I have a double[][] in my own Java code.
I wish to load this double[][] in RapidMiner.
Currently I write this double[][] to a text file, and then parse back the numbers in RapidMiner.

Is there a better way to do this?
Maybe write out a binary file and somehow load this in RapidMiner?
This will save a lot of CPU time parsing and disk space, since my double[][] text file can easily be 10GB.

Best regards,

Wessel

MariusHelf · November 2012

Are you using RapidMiner as a library integrated into your code, or do you load the data generated by your application from the normal RapidMiner GUI?

Regards,
Marius

wessel · November 2012

I'm loading data into the normal GUI.

MariusHelf · November 2012

In this case I fear you have to use one the standard import methods of RapidMiner. What about the usage of a database as storage system?

If you want to dive into the code of RapidMiner, you can try to create an ExampleSet programmatically and write it directly as a file into the RapidMiner repository, but this also involves writing files, and feels a bit hackish.

Best, Marius

wessel · November 2012

Hey,

I think creating an .ioo (one you can load using the Retrieve operator) file is by far the fastest way.
I don't think its that hackish. I will try to get working code for this procedure. Can't be that hard, right?

When I write a large double[][] for personal use in my own code I always use ObjectOutputStream.
Does not feel like a hack at all. The code for reading and writing is extremely clean.

Best regards,

Wessel

MariusHelf · November 2012

Yes, surely the code will be clean, but it's against the policy of not messing with the repository data structure by hand, and e.g. won't work if you are using a RapidAnalytics server, but only with local repositories. But as always: as long as the users are satisfied, everything is fine

Depending on what you are going to do, it might also be worth considering the use of a database, especially if you need random access to your data. Using csv files, ioo files or whatsoever always requires you to load the complete dataset.

Happy Coding!
~Marius

wessel · November 2012

I'm analyzing run time statistics of search algorithms.
So I measure N doubles at each time step, e.g. output of some heuristic.
Search algorithms can easily take 1M steps to complete.
So now I need to analyze a data-set of 1M * N doubles.
I don't see how a database system would help me here.
I need to analyze the entire data-set not just some small subset.
Right now I'm scaling to a point where just loading the data and parsing all the double takes more than a minute.
Simply retrieving the data-set later using the retrieve operator takes less than a 10th of this time.

Why would creating RM-ioo in your own code not work with RapidAnalytics?

Maybe its better to create a new "load binary data" Operator instead?

Best regards,

Wessel

MariusHelf · November 2012

wessel wrote:

Why would creating RM-ioo in your own code not work with RapidAnalytics?

What I wanted to say is that you can't simply put the ioo file into a folder, because the RapidAnalytics Repositories are stored in a database. Of course you can also access the remote RapidAnalytics repository from your code, but it's more complicated that just writing a file. So my statement was maybe a bit misleading.

Maybe its better to create a new "load binary data" Operator instead?

What should that operator do? Which should be the binary format?

Best, Marius

wessel · November 2012

Format?

A binary file is simply a sequence of bytes right?
And a double is simply 8-byte.

This is probably not the same for all programming languages, but Java uses doubleToLongBits and then writes that long value to the underlying output stream as an 8-byte quantity, high byte first.

So the operator should load the file, and create a data-set containing 1 attribute with the corresponding double values.

MariusHelf · November 2012

Well, and here's where the problems start: there are only very limited cases where you would want to import a table with exactly one attribute, written by a Java tool. This probably won't work cross platform with data written from other languages because of byte ordering etc.

Of course you can implement such an operator for your personal use in an extension. If you are capable of writing java code, this will be rather easy and could probably even be done with the Execute Script operator.

Best, Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Recommended way to load data from own Java code to RapidMiner?

Answers