Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Recommended way to load data from own Java code to RapidMiner?
Dear All,
I have a double[][] in my own Java code.
I wish to load this double[][] in RapidMiner.
Currently I write this double[][] to a text file, and then parse back the numbers in RapidMiner.
Is there a better way to do this?
Maybe write out a binary file and somehow load this in RapidMiner?
This will save a lot of CPU time parsing and disk space, since my double[][] text file can easily be 10GB.
Best regards,
Wessel
I have a double[][] in my own Java code.
I wish to load this double[][] in RapidMiner.
Currently I write this double[][] to a text file, and then parse back the numbers in RapidMiner.
Is there a better way to do this?
Maybe write out a binary file and somehow load this in RapidMiner?
This will save a lot of CPU time parsing and disk space, since my double[][] text file can easily be 10GB.
Best regards,
Wessel
Tagged:
0
Answers
Regards,
Marius
If you want to dive into the code of RapidMiner, you can try to create an ExampleSet programmatically and write it directly as a file into the RapidMiner repository, but this also involves writing files, and feels a bit hackish.
Best, Marius
I think creating an .ioo (one you can load using the Retrieve operator) file is by far the fastest way.
I don't think its that hackish. I will try to get working code for this procedure. Can't be that hard, right?
When I write a large double[][] for personal use in my own code I always use ObjectOutputStream.
Does not feel like a hack at all. The code for reading and writing is extremely clean.
Best regards,
Wessel
Depending on what you are going to do, it might also be worth considering the use of a database, especially if you need random access to your data. Using csv files, ioo files or whatsoever always requires you to load the complete dataset.
Happy Coding!
~Marius
So I measure N doubles at each time step, e.g. output of some heuristic.
Search algorithms can easily take 1M steps to complete.
So now I need to analyze a data-set of 1M * N doubles.
I don't see how a database system would help me here.
I need to analyze the entire data-set not just some small subset.
Right now I'm scaling to a point where just loading the data and parsing all the double takes more than a minute.
Simply retrieving the data-set later using the retrieve operator takes less than a 10th of this time.
Why would creating RM-ioo in your own code not work with RapidAnalytics?
Maybe its better to create a new "load binary data" Operator instead?
Best regards,
Wessel
Best, Marius
A binary file is simply a sequence of bytes right?
And a double is simply 8-byte.
This is probably not the same for all programming languages, but Java uses doubleToLongBits and then writes that long value to the underlying output stream as an 8-byte quantity, high byte first.
So the operator should load the file, and create a data-set containing 1 attribute with the corresponding double values.
Of course you can implement such an operator for your personal use in an extension. If you are capable of writing java code, this will be rather easy and could probably even be done with the Execute Script operator.
Best, Marius