Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Issue With Loop Files Operator

thapli_64thapli_64 Member Posts: 18 Maven
edited December 2018 in Help

Hi all,

 

I'm new to the forum and RapidMiner so excuse any redundancies or lack of details.

 

I am working with the process from Chapter 14 (Robust Language Identification) of RapidMiner: Data Maning Use Cases and Business Analytics Applications published by CRC press. The process was downloaded from here: http://rapidminerbook.com/index.php/chapter-downloads-13-24/chapter-14/

 

attachment 1 shows a screenshot of the process and attachment 2 of the loop files sub-process

 

I successfully loaded the process, and downloaded the language corpora from  http://corpora.informatik.uni-leipzig.de/download.html

 

I changed the directory for the loop files operator to read from the folder where the corpora is stored. There are five files in the directory (german, english, french, portugese and spanish). the loop files operator seems to be sucessfully reading all of them, but gives a 6th output which seems nonsensical. attachment 3 shows the expected output for any language file (enlgish in this case). attachment 3 shows the nonsensical output. Attachment 5 shows the error thrown, presumably by the nonsense output. Could someone tell me why it's happening and how to fix it? Thanks!

1.png 264.5K
2.png 292.5K
3.png 339.8K
4.png 237.5K
5.png 287K

Best Answer

  • thapli_64thapli_64 Member Posts: 18 Maven
    Solution Accepted

    So, I was able to solve the issue (with some debugging help from a colleague- always good to have someone to talk things through with)! :D

     

    I set up regex filtering (.*\.txt$) in the loop operator to only read in the desired files, in this case the 5 language files ending in .txt

     

    There was, however, another error that cropped up after this was fixed- a duplicate attribute error wrt the 'text' attribute. This was due to the 'select attributes and weights' parameter in the Data to Documents operator being checked but no value being provided for it. it seems this was the case with the process as it was downloaded and not introduced through human error (or so I'm telling myself :P )

     

     

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @thapli_64 - welcome to the community.  So first it would be much easier if you could please share your XML process in this thread (see "Helpful Reminders" on the right when you reply) as then we can truly replicate what you are doing.  Second, I just looked at that process from RapidMinerBook and the Loop Files operator that is used was deprecated since the last release:

     

    Screen Shot 2017-10-25 at 5.26.21 PM.pngdeprecated loop files operator on leftScreen Shot 2017-10-25 at 5.27.41 PM.pnghere's the new loop files operator

     

    So I would suggest moving the operators inside the old "Loop Files" and rewiring them inside a new "Loop Files" and try again.  Then paste your XML here and we will see what you have.


    Scott

      

  • thapli_64thapli_64 Member Posts: 18 Maven

    Scott,

     

    Thanks for the welcome, reply and advice! :) I had already tried what you mentioned and it threw the same error. I have attached screenshots. See the XML below:

     

    Thanks!

    Racchit.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="false" class="loop_files" compatibility="6.4.000" expanded="true" height="124" name="Loop Files" width="90" x="715" y="340">
    <parameter key="directory" value="/Users/racchitthapliyal/Desktop/rapidMiner/RapidMiner_DataMiningBusinessAnalyticsApplications/Chapter 14/data/corpus"/>
    <parameter key="filtered_string" value="full path (including file name)"/>
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
    <parameter key="csv_file" value="%{file_path}"/>
    <parameter key="column_separators" value="\t"/>
    <parameter key="use_quotes" value="false"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations"/>
    <parameter key="encoding" value="UTF-8"/>
    <list key="data_set_meta_data_information"/>
    <parameter key="read_not_matching_values_as_missings" value="false"/>
    </operator>
    <operator activated="true" class="rename" compatibility="7.6.001" expanded="true" height="82" name="Rename" width="90" x="45" y="187">
    <parameter key="old_name" value="att2"/>
    <parameter key="new_name" value="text"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="30">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="210">
    <list key="function_descriptions">
    <parameter key="language" value="replace(&quot;%{file_name}&quot;,&quot;.txt&quot;,&quot;&quot;)"/>
    </list>
    </operator>
    <operator activated="true" class="set_role" compatibility="5.3.013" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
    <parameter key="attribute_name" value="language"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="split_data" compatibility="7.6.001" expanded="true" height="103" name="Split Data" width="90" x="313" y="289">
    <enumeration key="partitions">
    <parameter key="ratio" value="0.3"/>
    <parameter key="ratio" value="0.7"/>
    </enumeration>
    <parameter key="sampling_type" value="shuffled sampling"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply" width="90" x="447" y="210"/>
    <operator activated="true" class="nominal_to_text" compatibility="6.0.003" expanded="true" height="82" name="Nominal to Text (4)" width="90" x="447" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    </operator>
    <operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="581" y="30">
    <list key="specify_weights"/>
    </operator>
    <operator activated="true" class="text:combine_documents" compatibility="7.5.000" expanded="true" height="82" name="Combine Documents" width="90" x="581" y="187"/>
    <connect from_port="file object" to_op="Read CSV" to_port="file"/>
    <connect from_op="Read CSV" from_port="output" to_op="Rename" to_port="example set input"/>
    <connect from_op="Rename" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Split Data" to_port="example set"/>
    <connect from_op="Split Data" from_port="partition 1" to_op="Multiply" to_port="input"/>
    <connect from_op="Split Data" from_port="partition 2" to_port="out 3"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Nominal to Text (4)" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_port="out 2"/>
    <connect from_op="Nominal to Text (4)" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
    <connect from_op="Data to Documents" from_port="documents" to_op="Combine Documents" to_port="documents 1"/>
    <connect from_op="Combine Documents" from_port="document" to_port="out 1"/>
    <portSpacing port="source_file object" spacing="0"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="180"/>
    <portSpacing port="sink_out 3" spacing="72"/>
    <portSpacing port="sink_out 4" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="concurrency:loop_files" compatibility="7.6.001" expanded="true" height="124" name="Loop Files (2)" width="90" x="45" y="85">
    <parameter key="directory" value="/Users/racchitthapliyal/Desktop/rapidMiner/RapidMiner_DataMiningBusinessAnalyticsApplications/Chapter 14/data/corpus"/>
    <parameter key="enable_macros" value="true"/>
    <parameter key="enable_parallel_execution" value="false"/>
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="68" name="Read CSV (2)" width="90" x="112" y="242">
    <parameter key="csv_file" value="%{file_path}"/>
    <parameter key="column_separators" value="\t"/>
    <parameter key="use_quotes" value="false"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations"/>
    <parameter key="encoding" value="UTF-8"/>
    <list key="data_set_meta_data_information"/>
    <parameter key="read_not_matching_values_as_missings" value="false"/>
    </operator>
    <operator activated="true" class="rename" compatibility="7.6.001" expanded="true" height="82" name="Rename (2)" width="90" x="112" y="395">
    <parameter key="old_name" value="att2"/>
    <parameter key="new_name" value="text"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="246" y="238">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="246" y="418">
    <list key="function_descriptions">
    <parameter key="language" value="replace(&quot;%{file_name}&quot;,&quot;.txt&quot;,&quot;&quot;)"/>
    </list>
    </operator>
    <operator activated="true" class="set_role" compatibility="5.3.013" expanded="true" height="82" name="Set Role (2)" width="90" x="380" y="242">
    <parameter key="attribute_name" value="language"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="split_data" compatibility="7.6.001" expanded="true" height="103" name="Split Data (2)" width="90" x="380" y="497">
    <enumeration key="partitions">
    <parameter key="ratio" value="0.3"/>
    <parameter key="ratio" value="0.7"/>
    </enumeration>
    <parameter key="sampling_type" value="shuffled sampling"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply (2)" width="90" x="514" y="418"/>
    <operator activated="true" class="nominal_to_text" compatibility="6.0.003" expanded="true" height="82" name="Nominal to Text (3)" width="90" x="514" y="242">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    </operator>
    <operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents (2)" width="90" x="648" y="238">
    <list key="specify_weights"/>
    </operator>
    <operator activated="true" class="text:combine_documents" compatibility="7.5.000" expanded="true" height="82" name="Combine Documents (2)" width="90" x="648" y="395"/>
    <connect from_port="file object" to_op="Read CSV (2)" to_port="file"/>
    <connect from_op="Read CSV (2)" from_port="output" to_op="Rename (2)" to_port="example set input"/>
    <connect from_op="Rename (2)" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
    <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
    <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
    <connect from_op="Set Role (2)" from_port="example set output" to_op="Split Data (2)" to_port="example set"/>
    <connect from_op="Split Data (2)" from_port="partition 1" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Split Data (2)" from_port="partition 2" to_port="output 3"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_op="Nominal to Text (3)" to_port="example set input"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_port="output 2"/>
    <connect from_op="Nominal to Text (3)" from_port="example set output" to_op="Data to Documents (2)" to_port="example set"/>
    <connect from_op="Data to Documents (2)" from_port="documents" to_op="Combine Documents (2)" to_port="documents 1"/>
    <connect from_op="Combine Documents (2)" from_port="document" to_port="output 1"/>
    <portSpacing port="source_file object" spacing="0"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    <portSpacing port="sink_output 3" spacing="0"/>
    <portSpacing port="sink_output 4" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="append" compatibility="7.6.001" expanded="true" height="82" name="Append Test" width="90" x="179" y="210"/>
    <operator activated="true" class="nominal_to_text" compatibility="6.0.003" expanded="true" height="82" name="Nominal to Text (2)" width="90" x="313" y="210">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    </operator>
    <operator activated="true" class="store" compatibility="7.6.001" expanded="true" height="68" name="Store Test" width="90" x="447" y="210">
    <parameter key="repository_entry" value="data/language_test"/>
    </operator>
    <operator activated="true" class="append" compatibility="7.6.001" expanded="true" height="82" name="Append Train" width="90" x="179" y="120"/>
    <operator activated="true" class="nominal_to_text" compatibility="6.0.003" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="120">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    </operator>
    <operator activated="true" class="store" compatibility="7.6.001" expanded="true" height="68" name="Store Train" width="90" x="447" y="120">
    <parameter key="repository_entry" value="data/language_train"/>
    </operator>
    <operator activated="true" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="30">
    <parameter key="text_attribute" value="text"/>
    <parameter key="label_attribute" value="language"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="5.3.013" expanded="true" height="82" name="Set Role Label" width="90" x="313" y="30">
    <parameter key="attribute_name" value="language"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="store" compatibility="7.6.001" expanded="true" height="68" name="Store Concatenated" width="90" x="447" y="30">
    <parameter key="repository_entry" value="data/language_concatenated"/>
    </operator>
    <connect from_op="Loop Files (2)" from_port="output 1" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Loop Files (2)" from_port="output 2" to_op="Append Train" to_port="example set 1"/>
    <connect from_op="Loop Files (2)" from_port="output 3" to_op="Append Test" to_port="example set 1"/>
    <connect from_op="Append Test" from_port="merged set" to_op="Nominal to Text (2)" to_port="example set input"/>
    <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Store Test" to_port="input"/>
    <connect from_op="Append Train" from_port="merged set" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Store Train" to_port="input"/>
    <connect from_op="Documents to Data" from_port="example set" to_op="Set Role Label" to_port="example set input"/>
    <connect from_op="Set Role Label" from_port="example set output" to_op="Store Concatenated" to_port="input"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    </process>
    </operator>
    </process>

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    The error is pointing to the fact that the attribute called language is not in the data set after it’s been processed from the process documents from data operator. Double check if the attribute language is in the output before it reaches the set role operator
  • thapli_64thapli_64 Member Posts: 18 Maven

    Thomas,

     

    It indeed is in the data pulled from the 5 expected data files (see german corpus screenshot attached). the error is being thrown by the 6th unexpected input. I need to figure out how to get rid of that. It shouldn't be there.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @thapli_64 - ok thanks for that (please next time use the </> tool to insert your XML :)  ).  So the best way to debug this is to use "breakpoints" and see what the data looks like right before the operator that is causing the trouble:

     

    Screen Shot 2017-10-25 at 5.51.49 PM.pngAdd a breakpointScreen Shot 2017-10-25 at 5.52.00 PM.pngbreakpoint added

    My guess is the same as @Thomas_Ott - you will see that the sixth time the attribute "language" is not there.


    Scott

     

  • thapli_64thapli_64 Member Posts: 18 Maven

    Scott,

     

    Thanks again for the advice- I'm learning as I go! :)

     

    Yes, I have been using breakpoints to figure out what was happening and that's how i discovered the extra data being read in. You are right, the error is indeed being thrown by the language attribute missing at that point. I had discovered earlier but didn't share because i felt the root cause was the 6th datset being read in which shouldn't be there in the first place. am I wrong to assume that? is it supposed to be there?

     

    See screen shots attached. as you can see, I added a breakpoint to see what's being fed into the set role label operator. the results screenshot shows that the text and language attributes are there for the 5 expected documents but missing for the 6th one. Now obviously the loop files sub-process is set up to deal with the actual language corpus files effectively, but not this error case. so it seems the error case shouldn't even be there at all. pleasse correct me if i'm wrong.

  • thapli_64thapli_64 Member Posts: 18 Maven

    So @sgenzer, the culprit seems to the a '.DS_STORE' file being read in by the read csv operator from the directory, which in turn spits out an erroneous result (attachment 1).

     

    attachment 2 shows the expected results for the deutsch file.

     

    attachment 3 shows that there are only 5 files in the folder. I couldn't find any hidden files.

     

    How do i get around this? why is the loop files/read csv operator reading this file?

    1.png 235.6K
    2.png 269.3K
    3.png 159.2K
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    ah it's funny I was almost going to ask if you were on a Mac because if you are and don't use a RegEx expression to filter out the .DS_STORE file, you're going to have issues.  It's a hidden file that causes sorts of challenges.  Glad you sorted it out yourself.  :)


    Scott

     

  • thapli_64thapli_64 Member Posts: 18 Maven

    Thanks Scott! This was particularly vexing but the process was educational. I learned a lot about each of the operators, and developed more confidence with RapidMiner, through this debugging process- my first serious one with RM.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    my pleasure @thapli_64.  Enjoy the RapidMiner ride.  It's a blast.  :)


    Scott

Sign In or Register to comment.