[SOLVED] Concatenation/grouping of rows based on a shared ID

nennatnennat Member Posts: 9 Contributor II
edited November 2018 in Help
I have been looking for a while but I couldn't find a solution to my problem, so maybe you guys know.

I have a performed a web crawl of a forum (through a seperate web crawler), the data of this web crawl has been written to a CSV.
My problem now is that every entry (original post & replies) is written on a seperate row.

The format of my CSV is as following; in the one column is the title of the topic, one column later the title of the post, and in the last column the actual text of each post.

How can I either combine all the text of one topic in one row or create seperate files per topic, with all the text of the seperate posts in them?
I have added a link to a Google Docs spreadsheet with a small sample of my data.: http://bit.ly/VDXeNU

Thanks a lot in advance!

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    supposed your first column is called topic_name, you can use the following process to split the big file into smaller ones, containing only rows from the same topic.

    Best,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
        <process expanded="true" height="145" width="413">
          <operator activated="true" class="read_csv" compatibility="5.3.000" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
            <list key="annotations"/>
            <list key="data_set_meta_data_information"/>
          </operator>
          <operator activated="true" class="loop_values" compatibility="5.3.000" expanded="true" height="60" name="Loop Values" width="90" x="179" y="30">
            <parameter key="attribute" value="topic_name"/>
            <process expanded="true" height="562" width="718">
              <operator activated="true" class="filter_examples" compatibility="5.3.000" expanded="true" height="76" name="Filter Examples" width="90" x="112" y="30">
                <parameter key="condition_class" value="attribute_value_filter"/>
                <parameter key="parameter_string" value="topic_name = %{loop_value}"/>
              </operator>
              <operator activated="true" class="write_csv" compatibility="5.3.000" expanded="true" height="76" name="Write CSV" width="90" x="246" y="30"/>
              <connect from_port="example set" to_op="Filter Examples" to_port="example set input"/>
              <connect from_op="Filter Examples" from_port="example set output" to_op="Write CSV" to_port="input"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read CSV" from_port="output" to_op="Loop Values" to_port="example set"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>
  • nennatnennat Member Posts: 9 Contributor II
    I noticed that it this way will either write one very small CSV with one topic. Now I could ask how to write multiple CSV files but since I want to draw a sample later on it might be a smarter move to ask how I write the outcomes to multiple documents with the topic names as the the file names. Is this possible? And if yes how?
    Thanks a lot in advance!
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi, if to write one file per topic with the topic title as file name, you have to use the loop macro of the Loop Values operator as filename in Write CSV. Should have been already in my previous post, but somehow it got lost. So just enter %{loop_value} as filename into Write CSV. If I got your requirements correct, this should do what you need.

    If you are not familiar with macros, you should experiment a bit with e.g. Set Macro etc.

    Happy Mining!
    ~Marius
  • nennatnennat Member Posts: 9 Contributor II
    Thanks! It works but I think it can't handle slashes, now I tried to remove them either by removing document parts or replacing them, but they can't seem to work with the CSV file as input, so I got myself stranded (again). But maybe my entire diagnosis is wrong.. So I attached a screenshot of the error I got.
    image
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    If you want to get rid of the slashes, you can use Generate Macro with the replace() function.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
        <process expanded="true" height="538" width="692">
          <operator activated="true" class="set_macro" compatibility="5.3.000" expanded="true" height="76" name="Set Macro" width="90" x="45" y="30">
            <parameter key="macro" value="macro"/>
            <parameter key="value" value="macro/with/slashes\and\backslaches"/>
          </operator>
          <operator activated="true" class="generate_macro" compatibility="5.3.000" expanded="true" height="76" name="Generate Macro" width="90" x="179" y="30">
            <list key="function_descriptions">
              <parameter key="cleanedMacro" value="replace(macro(&quot;macro&quot;), &quot;/&quot;, &quot;_&quot;)"/>
            </list>
          </operator>
          <operator activated="true" class="generate_macro" compatibility="5.3.000" expanded="true" height="76" name="Generate Macro (2)" width="90" x="313" y="30">
            <list key="function_descriptions">
              <parameter key="cleanedMacro" value="replace(macro(&quot;cleanedMacro&quot;), &quot;\\&quot;, &quot;_&quot;)"/>
            </list>
          </operator>
          <operator activated="true" class="print_to_console" compatibility="5.3.000" expanded="true" height="76" name="Print to Console" width="90" x="447" y="30">
            <parameter key="log_value" value="cleaned: %{cleanedMacro}"/>
          </operator>
          <connect from_op="Set Macro" from_port="through 1" to_op="Generate Macro" to_port="through 1"/>
          <connect from_op="Generate Macro" from_port="through 1" to_op="Generate Macro (2)" to_port="through 1"/>
          <connect from_op="Generate Macro (2)" from_port="through 1" to_op="Print to Console" to_port="through 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>
Sign In or Register to comment.