Options

Recommended way to load data from own Java code to RapidMiner?

wesselwessel Member Posts: 537 Maven
edited September 2019 in Help
Dear All,

I have a double[][] in my own Java code.
I wish to load this double[][] in RapidMiner.
Currently I write this double[][] to a text file, and then parse back the numbers in RapidMiner.

Is there a better way to do this?
Maybe write out a binary file and somehow load this in RapidMiner?
This will save a lot of CPU time parsing and disk space, since my double[][] text file can easily be 10GB.

Best regards,

Wessel
Tagged:

Answers

  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Are you using RapidMiner as a library integrated into your code, or do you load the data generated by your application from the normal RapidMiner GUI?

    Regards,
    Marius
  • Options
    wesselwessel Member Posts: 537 Maven
    I'm loading data into the normal GUI.
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    In this case I fear you have to use one the standard import methods of RapidMiner. What about the usage of a database as storage system?

    If you want to dive into the code of RapidMiner, you can try to create an ExampleSet programmatically and write it directly as a file into the RapidMiner repository, but this also involves writing files, and feels a bit hackish.

    Best, Marius
  • Options
    wesselwessel Member Posts: 537 Maven
    Hey,

    I think creating an .ioo (one you can load using the Retrieve operator) file is by far the fastest way.
    I don't think its that hackish. I will try to get working code for this procedure. Can't be that hard, right?

    When I write a large double[][] for personal use in my own code I always use ObjectOutputStream.
    Does not feel like a hack at all. The code for reading and writing is extremely clean.

    Best regards,

    Wessel
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Yes, surely the code will be clean, but it's against the policy of not messing with the repository data structure by hand, and e.g. won't work if you are using a RapidAnalytics server, but only with local repositories. But as always: as long as the users are satisfied, everything is fine :)

    Depending on what you are going to do, it might also be worth considering the use of a database, especially if you need random access to your data. Using csv files, ioo files or whatsoever always requires you to load the complete dataset.

    Happy Coding!
    ~Marius
  • Options
    wesselwessel Member Posts: 537 Maven
    I'm analyzing run time statistics of search algorithms.
    So I measure N doubles at each time step, e.g. output of some heuristic.
    Search algorithms can easily take 1M steps to complete.
    So now I need to analyze a data-set of 1M * N doubles.
    I don't see how a database system would help me here.
    I need to analyze the entire data-set not just some small subset.
    Right now I'm scaling to a point where just loading the data and parsing all the double takes more than a minute.
    Simply retrieving the data-set later using the retrieve operator takes less than a 10th of this time.

    Why would creating RM-ioo in your own code not work with RapidAnalytics?

    Maybe its better to create a new "load binary data" Operator instead?

    Best regards,

    Wessel
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    wessel wrote:

    Why would creating RM-ioo in your own code not work with RapidAnalytics?
    What I wanted to say is that you can't simply put the ioo file into a folder, because the RapidAnalytics Repositories are stored in a database. Of course you can also access the remote RapidAnalytics repository from your code, but it's more complicated that just writing a file. So my statement was maybe a bit misleading.

    Maybe its better to create a new "load binary data" Operator instead?
    What should that operator do? Which should be the binary format?

    Best, Marius
  • Options
    wesselwessel Member Posts: 537 Maven
    Format?

    A binary file is simply a sequence of bytes right?
    And a double is simply 8-byte.

    This is probably not the same for all programming languages, but Java uses doubleToLongBits and then writes that long value to the underlying output stream as an 8-byte quantity, high byte first.

    So the operator should load the file, and create a data-set containing 1 attribute with the corresponding double values.
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Well, and here's where the problems start: there are only very limited cases where you would want to import a table with exactly one attribute, written by a Java tool. This probably won't work cross platform with data written from other languages because of byte ordering etc.

    Of course you can implement such an operator for your personal use in an extension. If you are capable of writing java code, this will be rather easy and could probably even be done with the Execute Script operator.

    Best, Marius
Sign In or Register to comment.