Options

EXPORT Sparse Data

anagianagi Member Posts: 3 Contributor I
edited November 2018 in Help
Hello....

I am rather new to RapidMiner, and so my apology is this question is too basic.

I am trying to do some Text Mining of a relatively large dataset (>100MB), with RapidMiner, and i would like to export the results, TF-IDF, (after applying a Tokenizer, Stemmer, and Stop words Removal). The problem i have, is that when i use a "CSV export", or "ARFF export" operators, the file i receive is very large (>5GB), despite the data being very sparse.

I am not sure, if can write sparse data into CSV, but WEKA write sparse data in ARFF file format, and RapidMiner can read sparse data.

My question is: is it possible to instruct RapidMiner to make use of the sparsity of the data when exporting it to a file?

Cheers

Answers

  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    of course this is possible (this is my default answer for all "is X possible"-questions  ;D )

    The operator "Write Special Format" is your friend. Try the special format "$s[;][:]" for example if you want to separate the columns by ";" and the index of the attributes by ":". The "$s" means "sparse format". You can find more information in the help text of the operator.

    Here is a simple example process:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.001" expanded="true" name="Process">
        <process expanded="true" height="145" width="279">
          <operator activated="true" class="retrieve" compatibility="5.1.001" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
            <parameter key="repository_entry" value="//Samples/data/Iris"/>
          </operator>
          <operator activated="true" class="write_special" compatibility="5.1.001" expanded="true" height="60" name="Write Special Format" width="90" x="179" y="30">
            <parameter key="example_set_file" value="C:\Users\Ingo\Desktop\sparse_result.txt.dat"/>
            <parameter key="special_format" value="$s[;][:]"/>
          </operator>
          <connect from_op="Retrieve" from_port="output" to_op="Write Special Format" to_port="input"/>
          <connect from_op="Write Special Format" from_port="through" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Have fun!
    Ingo
  • Options
    anagianagi Member Posts: 3 Contributor I
    Thank you very much for a quick and helpful reply... But if i may be greedy and ask another related question:

    The solution u provided, writes the data without the attributes names (well, there is an option $v[name], but i am not sure how to use it?)

    What should i replace the name with? and if it's the name of an attribute (a column from TF-IDF matrix), how do i populate this field before knowing a priori what are the attributes name (terms in the dictionaries) and how many of them are there?

    I want to produce an ARFF sparse file, that contains the attribute names, (similar to the one produced by weka), and i would have thought, that i could connect the output of an ARFF file Operator to the Input of the Export Special Operator; or the other way around (mimiking the pipe unix operation), but that doesn't produce the required output format.

    Any advice to a novice user, will be much appreciated, and very helpful to get me going with RM  :)

    Cheers
Sign In or Register to comment.