"read from Excel/CSV"

CaptainChaos · September 2011

Hi Guys,

Can somebody explain me howe i can tell rapid miner to take each line under "A" as a seperate Document and each line under "B" as its ID.
I would like to add a Data to silimirity operator to it but theirfore each line has to be calssified as a document. Does any body know a operator that can do this.

Thanks

MariusHelf · September 2011

Hello CaptainChoas,

did you try the wizards in the Read Excel/Read CSV operators? There you are able to define to role of each column, so you can set the id role to column B. Hope this helps, if not, please tell me how exactly a "document" in your files looks like.

Cheers,
Marius

CaptainChaos · September 2011

Hi Marius,

I tried all the widgets but they dont help me to do what i want . I know i can chose the attribute for a column there but this doesnt help me out so far.

At the moment i just have one column(changed it) in Excel Column "A"
in each row of "A" is some kind of text. I just would like to make rapid miner treat each of them like a own document.

Thanks
Reegards

JEdward · September 2011

So you have a document that splits the data across two rows?
There's probably a simpler way, but you could do it by converting into XML and then back again.

For example:
I created a CSV file called test csv with the following structure:


Data
1
Record
2
Information
3

Then made the following process to convert it to XML in the following structure:

<Document><B>Data<A>1</A></B><B>Record<A>2</A></B><B>Information<A>3</A></B></Document>

The process then reads in the XML file and changes it into data.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
    <process expanded="true" height="657" width="748">
      <operator activated="true" class="read_csv" compatibility="5.1.011" expanded="true" height="60" name="Read CSV" width="90" x="45" y="75">
        <parameter key="csv_file" value="C:\Users\jedward\Desktop\test.csv"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations"/>
        <parameter key="encoding" value="UTF-8"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="att1.true.polynominal.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="generate_id" compatibility="5.1.011" expanded="true" height="76" name="Generate ID" width="90" x="45" y="210"/>
      <operator activated="true" class="generate_attributes" compatibility="5.1.011" expanded="true" height="76" name="Generate Attributes" width="90" x="179" y="165">
        <list key="function_descriptions">
          <parameter key="XML" value="if((ceil((id/2))==(id/2)),concat(&quot;&lt;A&gt;&quot;,att1,&quot;&lt;/A&gt;&lt;/B&gt;&quot;),concat(&quot;&lt;B&gt;&quot;,att1,&quot;&quot;))"/>
        </list>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.1.011" expanded="true" height="76" name="Select Attributes" width="90" x="313" y="120">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="XML"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="generate_data_user_specification" compatibility="5.1.011" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="246" y="30">
        <list key="attribute_values">
          <parameter key="XML" value="&quot;&lt;Document&gt;&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="generate_data_user_specification" compatibility="5.1.011" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="380" y="390">
        <list key="attribute_values">
          <parameter key="XML" value="&quot;&lt;/Document&gt;&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="append" compatibility="5.1.011" expanded="true" height="112" name="Append" width="90" x="447" y="75"/>
      <operator activated="true" class="write_special" compatibility="5.1.011" expanded="true" height="60" name="Write Special Format" width="90" x="581" y="120">
        <parameter key="example_set_file" value="C:\Users\jheath\Desktop\testXML.xml"/>
        <parameter key="special_format" value="$a"/>
        <parameter key="add_line_separator" value="false"/>
        <parameter key="quote_nominal_values" value="false"/>
        <parameter key="encoding" value="UTF-8"/>
      </operator>
      <operator activated="true" class="subprocess" compatibility="5.1.011" expanded="true" height="76" name="Subprocess" width="90" x="581" y="345">
        <process expanded="true" height="568" width="488">
          <operator activated="true" class="read_xml" compatibility="5.1.011" expanded="true" height="60" name="Read XML" width="90" x="190" y="218">
            <parameter key="file" value="C:\Users\jedward\Desktop\testXML.xml"/>
            <parameter key="xpath_for_examples" value="//Document/B"/>
            <enumeration key="xpaths_for_attributes">
              <parameter key="xpath_for_attribute" value="text()"/>
              <parameter key="xpath_for_attribute" value="A[1]/text()"/>
            </enumeration>
            <list key="namespaces"/>
            <parameter key="use_default_namespace" value="false"/>
            <parameter key="parse_numbers" value="false"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="DataHeader.true.polynominal.attribute"/>
              <parameter key="1" value="ID.true.integer.id"/>
            </list>
          </operator>
          <connect from_op="Read XML" from_port="output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Append" to_port="example set 2"/>
      <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 1"/>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 3"/>
      <connect from_op="Append" from_port="merged set" to_op="Write Special Format" to_port="input"/>
      <connect from_op="Write Special Format" from_port="through" to_op="Subprocess" to_port="in 1"/>
      <connect from_op="Subprocess" from_port="out 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Probably not at all what you were after, but it was a fun process to build & might be useful for other tasks.

Best regards,
JEdward.

colo · September 2011

Hi,

it seems hard to understand what you're after... If you have an example set, each line is a example and usually this is the correct format for most of the operators. If you want to do something with each single example, then the operator "Loop Examples" is probably the right tool. Using IDs for examples is possible by creating new ones via "Generate ID" or setting existing columns to the ID type using "Set Role".

When talking about documents this usually refers to the document datatype of the text processing extension and is only used in text and web mining context.

I am not familiar with the "Data to Similarity" operator, but this one requires an example set as input. So your data should already have the right format. If you want to do something for only one example isolated from all the others, use "Loop Examples" and put the example processing inside this operator.

For further support, it might be useful if you post a process as far as you created it, and describe where things are not working and what you would like to do different.

Regards
Matthias

P.S. Please don't post similar questions to other forums, if they are not answered immediately. Especially specific questions as yours should be posted here instead of the general data mining forum.

CaptainChaos · September 2011

Hi,

Look i do have a excel file with data just in Column a(A1:A3000).
Structure looks like this:

A
Text1........
Text2..........
Text3.......
..
...
Text3000

I know that i can loop through the file, but when i want to work with the Data later on the problem is that the Operator takes the wole Text of one Row and compares it against another(like one term). But I want one row is recognized as a single document and the words inside this row/document can be compared to those of another row/document. In the Moment My process document Operater just takes the whole Row as one term and compares it against another row.
I Hope i made a bit more clear what I want i post my code here maybe one of you guys can than undersatand what my problem is.



<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
    <process expanded="true" height="566" width="547">
      <operator activated="true" class="read_excel" compatibility="5.1.006" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
        <parameter key="excel_file" value="C:\Users\userDesktop\read\dok.xls"/>
        <parameter key="imported_cell_range" value="A1:A100"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="content.true.text.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="5.1.001" expanded="true" height="60" name="Data to Documents" width="90" x="200" y="183">
        <list key="specify_weights"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.1.001" expanded="true" height="94" name="Process Documents" width="90" x="380" y="165">
        <parameter key="keep_text" value="true"/>
        <process expanded="true" height="580" width="593">
          <connect from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Thanks again seems that you all have a hard time with me :P

colo · September 2011

Hi,

try adding the operator "Tokenize" inside the "Process Documents" operator. Otherwise the word vector consists of only one word (the whole text). You can also add other preprocessing operators at this place, e.g. "Transform Cases" or "Filter Stopwords".

Hope this is what you are looking for...

Regards
Matthias

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"read from Excel/CSV"

Answers