R-Script Output: File instead of ExampleSet

TobiasNehrig · December 2017

Hi Gurus,

i have some issues with an R-Script Output and I'm not seeing my mistake.

In the subprocess Bigrams are my two R-Scripts. The first R-Script prints at the console and generates an ExampleSet (Generated Bigrams) but the second R-Script prints only at the console and doesn't generates an ExampleSet (Count Bigrams). For the Count Bigrams Script I've got only the tab "File (Count Bigrams)".

Is here someone, how sees my mistake and maybe can help me?

Regards Tobias

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="subprocess" compatibility="8.0.001" expanded="true" height="82" name="Crawler" width="90" x="45" y="289">
        <process expanded="true">
          <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="34">
            <parameter key="url" value="http://www.spiegel.de"/>
            <list key="crawling_rules">
              <parameter key="store_with_matching_url" value=".+www.spiegel.+"/>
              <parameter key="follow_link_with_matching_url" value=".+spiegel.+|.+de.+"/>
            </list>
            <parameter key="max_crawl_depth" value="10"/>
            <parameter key="retrieve_as_html" value="true"/>
            <parameter key="add_content_as_attribute" value="true"/>
            <parameter key="output_dir" value="C:\Users\Knecht Ruprecht\Documents"/>
            <parameter key="max_pages" value="10"/>
            <parameter key="max_page_size" value="500"/>
            <parameter key="user_agent" value="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0"/>
          </operator>
          <operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="Get Pages" width="90" x="246" y="34">
            <parameter key="link_attribute" value="Link"/>
            <parameter key="page_attribute" value="link"/>
            <parameter key="random_user_agent" value="true"/>
          </operator>
          <connect from_op="Crawl Web" from_port="example set" to_op="Get Pages" to_port="Example Set"/>
          <connect from_op="Get Pages" from_port="Example Set" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="45" y="187">
        <parameter key="keep_text" value="true"/>
        <parameter key="data_management" value="memory-optimized"/>
        <list key="specify_weights">
          <parameter key="link" value="1.0"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34">
            <parameter key="minimum_text_block_length" value="2"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize Token" width="90" x="179" y="34">
            <parameter key="mode" value="linguistic tokens"/>
            <parameter key="language" value="German"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens a-zA-Z" width="90" x="313" y="34">
            <parameter key="condition" value="matches"/>
            <parameter key="regular_expression" value="[a-zA-Z]+"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="447" y="34"/>
          <connect from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_op="Tokenize Token" to_port="document"/>
          <connect from_op="Tokenize Token" from_port="document" to_op="Filter Tokens a-zA-Z" to_port="document"/>
          <connect from_op="Filter Tokens a-zA-Z" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="124" name="Process Doc2Data" width="90" x="45" y="34"/>
      <operator activated="true" class="subprocess" compatibility="8.0.001" expanded="true" height="82" name="Filter tf-idf" width="90" x="179" y="238">
        <process expanded="true">
          <operator activated="true" class="transpose" compatibility="8.0.001" expanded="true" height="82" name="Ingress Transpose" width="90" x="45" y="34"/>
          <operator activated="true" class="filter_example_range" compatibility="8.0.001" expanded="true" height="82" name="Filter Example Range" width="90" x="179" y="34">
            <parameter key="first_example" value="1"/>
            <parameter key="last_example" value="15"/>
            <parameter key="invert_filter" value="true"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="8.0.001" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="34">
            <parameter key="invert_filter" value="true"/>
            <list key="filters_list">
              <parameter key="filters_entry_key" value="id.equals.text"/>
            </list>
          </operator>
          <operator activated="true" class="transpose" compatibility="8.0.001" expanded="true" height="82" name="tf-idf Transpose" width="90" x="447" y="34"/>
          <connect from_port="in 1" to_op="Ingress Transpose" to_port="example set input"/>
          <connect from_op="Ingress Transpose" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
          <connect from_op="Filter Example Range" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="tf-idf Transpose" to_port="example set input"/>
          <connect from_op="tf-idf Transpose" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="subprocess" compatibility="8.0.001" expanded="true" height="124" name="Splitting" width="90" x="179" y="85">
        <process expanded="true">
          <operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="34">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="text"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="8.0.001" expanded="true" height="82" name="Generate ID" width="90" x="45" y="136"/>
          <operator activated="true" class="rename" compatibility="8.0.001" expanded="true" height="82" name="Rename ID" width="90" x="45" y="238">
            <parameter key="old_name" value="id"/>
            <parameter key="new_name" value="Document"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role" width="90" x="45" y="340">
            <parameter key="attribute_name" value="Document"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="rename" compatibility="8.0.001" expanded="true" height="82" name="Rename" width="90" x="179" y="34">
            <parameter key="old_name" value="text"/>
            <parameter key="new_name" value="word"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <operator activated="true" class="split" compatibility="8.0.001" expanded="true" height="82" name="Split" width="90" x="179" y="136">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="word"/>
            <parameter key="include_special_attributes" value="true"/>
            <parameter key="split_pattern" value="\s+"/>
          </operator>
          <operator activated="true" class="transpose" compatibility="8.0.001" expanded="true" height="82" name="Splitting Output" width="90" x="313" y="34"/>
          <operator activated="true" class="write_csv" compatibility="8.0.001" expanded="true" height="82" name="Write CSV" width="90" x="447" y="187">
            <parameter key="csv_file" value="/home/knecht/Master2017/Korpus/17-12-15-Textmining-Split.csv"/>
          </operator>
          <connect from_port="in 1" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Rename ID" to_port="example set input"/>
          <connect from_op="Rename ID" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Rename" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_op="Split" to_port="example set input"/>
          <connect from_op="Split" from_port="example set output" to_op="Splitting Output" to_port="example set input"/>
          <connect from_op="Split" from_port="original" to_port="out 3"/>
          <connect from_op="Splitting Output" from_port="example set output" to_port="out 1"/>
          <connect from_op="Splitting Output" from_port="original" to_op="Write CSV" to_port="input"/>
          <connect from_op="Write CSV" from_port="through" to_port="out 2"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
          <portSpacing port="sink_out 3" spacing="0"/>
          <portSpacing port="sink_out 4" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="subprocess" compatibility="8.0.001" expanded="true" height="103" name="Bigrams" width="90" x="313" y="136">
        <process expanded="true">
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Generate Bigrams" width="90" x="45" y="34">
            <parameter key="script" value="# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;rm_main = function(data)&#10;{&#10;&#9;library(dplyr)&#10;&#9;library(tidytext)&#10;&#10;&#9;spon_bigrams &lt;- data %&gt;%&#10;&#9;  unnest_tokens(bigram, word, token = &quot;ngrams&quot;, n = 2)&#10;&#9;print(spon_bigrams)&#10;&#10;    return(list(spon_bigrams))    &#10;}&#10;"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Generated Bigrams" width="90" x="179" y="34"/>
          <operator activated="true" class="r_scripting:execute_r" compatibility="7.2.000" expanded="true" height="82" name="Count Bigrams" width="90" x="313" y="85">
            <parameter key="script" value="# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;&#10;rm_main = function(data)&#10;{&#10;&#9;library(dplyr)&#10;&#9;library(tidytext)&#10;&#10;&#9;spon_bigrams &lt;- data %&gt;%&#10;&#9;  count(bigram,sort=TRUE)&#10;&#9;print(spon_bigrams)&#10;    &#10;    return(list(spon_bigrams))&#10;}&#10;"/>
          </operator>
          <connect from_port="in 1" to_op="Generate Bigrams" to_port="input 1"/>
          <connect from_op="Generate Bigrams" from_port="output 1" to_op="Generated Bigrams" to_port="input"/>
          <connect from_op="Generated Bigrams" from_port="output 1" to_port="out 1"/>
          <connect from_op="Generated Bigrams" from_port="output 2" to_op="Count Bigrams" to_port="input 1"/>
          <connect from_op="Count Bigrams" from_port="output 1" to_port="out 2"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
          <portSpacing port="sink_out 3" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Crawler" from_port="out 1" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Process Doc2Data" to_port="input"/>
      <connect from_op="Process Doc2Data" from_port="output 1" to_port="result 1"/>
      <connect from_op="Process Doc2Data" from_port="output 2" to_op="Splitting" to_port="in 1"/>
      <connect from_op="Process Doc2Data" from_port="output 3" to_op="Filter tf-idf" to_port="in 1"/>
      <connect from_op="Filter tf-idf" from_port="out 1" to_port="result 6"/>
      <connect from_op="Splitting" from_port="out 1" to_port="result 2"/>
      <connect from_op="Splitting" from_port="out 2" to_port="result 3"/>
      <connect from_op="Splitting" from_port="out 3" to_op="Bigrams" to_port="in 1"/>
      <connect from_op="Bigrams" from_port="out 1" to_port="result 4"/>
      <connect from_op="Bigrams" from_port="out 2" to_port="result 5"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
      <portSpacing port="sink_result 6" spacing="0"/>
      <portSpacing port="sink_result 7" spacing="0"/>
    </process>
  </operator>
</process>

MartinLiebig · December 2017

Hi @TobiasNehrig,

our R-Extension translates all data.table objects returned in RM Example Sets. Everything which is not a data table is returned as a file object. these file objects are not usable with native RM operators but can be used in other Execute R operators.

Most likely your second object is not of type data.table. I am not a R-guru so i can't check it myself. Maybe you can have a look or @DArnu / @yyhuang can help.

Best,

Martin

sgenzer · December 2017

not a RM8 release issue. Moving to normal RM Studio forum.

Scott

TobiasNehrig · December 2017

Hi @mschmitz,

thank you very much for your hint. With counted_bigrams <- data.frame(count_bigrams) i've got my output.

regards,

Tobias

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

R-Script Output: File instead of ExampleSet

Best Answer

Answers