EXPORT Sparse Data

anagi · April 2011

Hello....

I am rather new to RapidMiner, and so my apology is this question is too basic.

I am trying to do some Text Mining of a relatively large dataset (>100MB), with RapidMiner, and i would like to export the results, TF-IDF, (after applying a Tokenizer, Stemmer, and Stop words Removal). The problem i have, is that when i use a "CSV export", or "ARFF export" operators, the file i receive is very large (>5GB), despite the data being very sparse.

I am not sure, if can write sparse data into CSV, but WEKA write sparse data in ARFF file format, and RapidMiner can read sparse data.

My question is: is it possible to instruct RapidMiner to make use of the sparsity of the data when exporting it to a file?

Cheers

IngoRM · April 2011

Hi,

of course this is possible (this is my default answer for all "is X possible"-questions ;D )

The operator "Write Special Format" is your friend. Try the special format "$s[;][:]" for example if you want to separate the columns by ";" and the index of the attributes by ":". The "$s" means "sparse format". You can find more information in the help text of the operator.

Here is a simple example process:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.001" expanded="true" name="Process">
    <process expanded="true" height="145" width="279">
      <operator activated="true" class="retrieve" compatibility="5.1.001" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="write_special" compatibility="5.1.001" expanded="true" height="60" name="Write Special Format" width="90" x="179" y="30">
        <parameter key="example_set_file" value="C:\Users\Ingo\Desktop\sparse_result.txt.dat"/>
        <parameter key="special_format" value="$s[;][:]"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Write Special Format" to_port="input"/>
      <connect from_op="Write Special Format" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Have fun!
Ingo

anagi · April 2011

Thank you very much for a quick and helpful reply... But if i may be greedy and ask another related question:

The solution u provided, writes the data without the attributes names (well, there is an option $v[name], but i am not sure how to use it?)

What should i replace the name with? and if it's the name of an attribute (a column from TF-IDF matrix), how do i populate this field before knowing a priori what are the attributes name (terms in the dictionaries) and how many of them are there?

I want to produce an ARFF sparse file, that contains the attribute names, (similar to the one produced by weka), and i would have thought, that i could connect the output of an ARFF file Operator to the Input of the Export Special Operator; or the other way around (mimiking the pipe unix operation), but that doesn't produce the required output format.

Any advice to a novice user, will be much appreciated, and very helpful to get me going with RM

Cheers

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

EXPORT Sparse Data

Answers