Options

Loading large data

wesselwessel Member Posts: 537 Maven
edited November 2018 in Help
Hello,

An image dataset.
1250 features, 2000 positive, 2000 negative examples.

In .mat format 32MB.
In ASCII .csv format 58MB

Every time I start my RM progress, this dataset 23 seconds to load.
Is there anyway to keep the dataset cached?
Also I'd like to cache my PCA.
Or cache the transformed dataset.

PCA in matlab takes about 30 seconds, PCA in Rapidminer about 3 min.
That is a factor 6.

Why is matlab faster?

Regards,

Wessel Luijben

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    there are several possible reasons, why the matlab format is smaller: It might store data in a binary format or uses a less precise decimal format. If you want a small RapidMiner format, you could store it binary using the IOObjectWriter or wait for RapidMiner 5.

    No, there's currently no way for caching the data. Although caching sounds easy, it is not, because on the one hand data is modified during the process, so that the cached version might get corrupted. To avoid this, you will need to store a complete copy in memory, which would pose memory problems especially on larger data sets, where the cache would be most helpful.
    To speed up your loading, there are two possibilities: Work only on a subset of the data during process design time and only use the full data for the final run. Or you might simply use the binary format of rapid miner (which is not guaranteed to be compatible with any further version) but should be faster and nice for temporary copies.

    Matlab is written in C as far as I know, which gives a a fair performance boost against a non native compiled Java program like RapidMiner. On the other hand, you might use Java on nearly every computer platform available, even on my handy...
    Beside this there are many different algorithms for calculating the eigenvectors and eigenvalues needed for the PCA. Chances are, that Matlab uses a highly tuned and optimized algorithm. Feel invited to adapt this algorithm for RapidMiner, we would gratefully include it in the core :)

    Greetings,
      Sebastian

  • Options
    wesselwessel Member Posts: 537 Maven
    I used the IOObjectWriter to write a 54.2 MB .csv file, it became 137 MB! xD
    That is the opposite result :(
  • Options
    haddockhaddock Member Posts: 849 Maven
    Gosh Wessel, problems certainly seem to seek you out, if I run the following it only takes a second to generate, write, and read back such an example set, and it is only 39.5Mb long.
    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="target_function" value="simple non linear classification"/>
            <parameter key="number_examples" value="4000"/>
            <parameter key="number_of_attributes" value="1250"/>
        </operator>
        <operator name="IOObjectWriter" class="IOObjectWriter">
            <parameter key="object_file" value="C:\Users\CJFP\Documents\rm_workspace\wessel"/>
            <parameter key="io_object" value="ExampleSet"/>
            <parameter key="output_type" value="Binary"/>
        </operator>
        <operator name="IOObjectReader" class="IOObjectReader">
            <parameter key="object_file" value="C:\Users\CJFP\Documents\rm_workspace\wessel"/>
            <parameter key="io_object" value="ExampleSet"/>
        </operator>
    </operator>
    Did you by any chance write out your CSV as XML as well?

  • Options
    wesselwessel Member Posts: 537 Maven
    I'm sorry for running into problems.
    No I did not make this mistake, I used output_type : Binary also!
    This is the dataset I used: http://77.93.77.78/download/MilkDataJoosten.csv

    MilkDataJoosten.csv 54.4 MB (57.064.974 bytes)  <-- load time 15s
    big.ioo 137 MB (144.474.351 bytes) <-- load time 6s

    Surprisingly big.ioo does load faster!


    http://77.93.77.78/download/MilkDataJoosten.csv
    <operator name="Root" class="Process" expanded="yes">
        <operator name="CSVExampleSource" class="CSVExampleSource">
            <parameter key="filename" value="D:\wessel\Desktop\MilkDataJoosten.csv"/>
            <parameter key="column_separators" value=";"/>
        </operator>
        <operator name="IOObjectWriter" class="IOObjectWriter">
            <parameter key="object_file" value="D:\wessel\Desktop\big.ioo"/>
            <parameter key="io_object" value="ExampleSet"/>
            <parameter key="output_type" value="Binary"/>
        </operator>
        <operator name="IOObjectReader" class="IOObjectReader" breakpoints="after">
            <parameter key="object_file" value="D:\wessel\Desktop\big.ioo"/>
            <parameter key="io_object" value="ExampleSet"/>
        </operator>
    </operator>
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    the result isn't very surprisingly if you take the internal encoding into account. If you create a data table in RapidMiner, it's values are stored in double arrays. Even the nominal ones, they are mapped from an index to a string. By the way, if double, float or integer is used you might determine with the appropriate parameter of the loading operator.
    Your csv file contains mainly integer values, for example "43". This is represented by two characters plus one split character, hence 3 bytes. Each double will consume 4 bytes, so this increases the needed memory. Additionally you have several missing values, which take up 1 byte but are represented by 4 bytes. Additional memory is used for holding all the examples together and storing additional informations like statistics and so on. All this is saved if you select the binary format.

    But why does this load faster? Simply because java only has to read the file and swap it directly into memory. There isn't any parsing, interpreting, object creation and repeated memory allocation needed.

    Greetings,
      Sebastian
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Let me add, that the binary format of course has it's disadvantages: If anything in the class structure changes, the bytes in the file cannot be interpreted correctly and so it cannot be read anymore. For storing informations for a longer time, you mustn't use the binary format...

    Greetings,
      Sebastian
  • Options
    wesselwessel Member Posts: 537 Maven
    Woa thanks!

    Your explanation is 100% clear.  :)
  • Options
    fischerfischer Member Posts: 439 Maven
    Hi,

    Let me just add this:
    Reading text files like CSV, is always a pain, and using Java serialization or XML serialization is even worse. This is what happens if you use the IOObjectWriter in 4.x.
    In 5, there is a custom serialization method for example sets which should speed up this significantly.

    Cheers,
    Simon
  • Options
    Stefan_EStefan_E Member Posts: 53 Maven
    Related to the subject heading - not necessarily to the thread content (Sorry!):

    In RM-5, it appears that Process / Validate Automatically is by default enabled and can't be configured to be off (at least I haven't found it... I have source with relatively slow DB queries... and the validation process seems to run them... and I have a short memory switching Auto-Validation off before it hits me...

    Greetings - Stefan
  • Options
    fischerfischer Member Posts: 439 Maven
    Hi,

    I fixed this. Validate Automatically now remembers its state.

    Cheers,
    Simon
  • Options
    jwalterjwalter Member Posts: 7 Contributor II
    Sebastian Land wrote:

    Let me add, that the binary format of course has it's disadvantages: If anything in the class structure changes, the bytes in the file cannot be interpreted correctly and so it cannot be read anymore. For storing informations for a longer time, you mustn't use the binary format...

    Greetings,
      Sebastian
    Are you refering to the file format (*.ioo / *.md) for data in the repository concept?

    Why did you choose this format? It seems that repositories can not hold e.g. CSV files.
    What is the meaning of the CONTENT files in a repo folder?
    How can I write such files - with an external program? Can you provide a documentation for that format?
Sign In or Register to comment.