which amazon instance to chose for a "loop in loop" process requiring a huge amount of memory

EL75EL75 Member Posts: 43 Contributor II
Hi everyone,
I have a "loop in loop" process:
- A loop value, with inside a process that loads an example set with 1000 reviews to filter
- the above nested in a loop attribute that loads a a dictionary => dataset of 15 columns that contains all the words to be founded in the reviews. The largest attribute contains 2500 values -rows.

It's impossible to run this process in rapidminer studio that freezes after a while, because of the number of columns that are created by the loop value operator (one column per word for each word of each attribute column of the dictionary: 12660 columns indeed.

I’ve launched first the process in rapidminer AI HUB with an instance r4.xlarge, but crashed, then I tried with a more powerfull one: r4.4xlarge (16 vCPu and 122 GiB memory), but crashed again after few minutes.

Is there a way to define the instance design, in consideration of the number of columns?

thanks in advance for any suggestion :)

cheers
 


Best Answer

  • EL75EL75 Member Posts: 43 Contributor II
    Solution Accepted
    Thanks a lot for the solution provided! Great approach :)

Answers

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Hello @EL75,

    Loops inside loops have a huge computational complexity, I think it's O(n^2), which is fairly undesirable in any programming language, not just RapidMiner Studio. Perhaps there is a way to simplify the search by applying some tricks? Also, it sounds like you would benefit from tokenizing words rather than using columns for your searches.

    Do you mind to share your process with us so that we can check if there is anything we can do?

    About your question: is there a way to define the instance design, in consideration of the number of columns?

    Number of columns isn't a real measure for memory consumption unless you know exactly how large it is and how's it composed; I think your big issue isn't memory but optimization, though. (I may be wrong but worth the shot).

    All the best,

    Rod.
  • EL75EL75 Member Posts: 43 Contributor II
    Rod,
    Thanks a lot for your reply.
    Enclose,  process file and 2 excel files (dictionary and dataset).
    Normally I use local data repositories to optimise time access (and sharing the project with rapidminer AI Hub) but for sharing with you, I've changed the process with excel files and read excel operators.
    The goal of this work is to create an automatic labelling process in order to create a validation dataset for a deep learning classification task. Then this dataset will be manually validated. Therefore, the output must be columns with labels (the categories of the dictionary) containing ones or zeros.
    The process  contains notes regarding pending questions.
    Thanks a lot for any suggestion!
    have a good day!
    Love rapidminer capabilities and rapidminer community :)
    Best,
  • MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    edited June 2021
    Hi @EL75

    Please check if this is what you are trying to achieve.

    You'll need to install the Text Mining extension in case you don't currently have it. 

    On the process  pay special attention to the vector creation parameter.
    And on the prune method (specially for memory handling) this will help you keep only the Columns that are actually important avoiding those that do not have any appearances.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.9.002">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.9.002" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="-1"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="9.9.002" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34">
            <parameter key="excel_file" value="C:\Users\MarcoBarradas\Downloads\dictionary.xlsx"/>
            <parameter key="sheet_selection" value="sheet number"/>
            <parameter key="sheet_number" value="1"/>
            <parameter key="imported_cell_range" value="A1"/>
            <parameter key="encoding" value="SYSTEM"/>
            <parameter key="first_row_as_names" value="true"/>
            <list key="annotations"/>
            <parameter key="date_format" value=""/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="read_all_values_as_polynominal" value="false"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="EATERS.true.polynominal.attribute"/>
              <parameter key="1" value="EATEN.true.polynominal.attribute"/>
              <parameter key="2" value="OTHERS.true.polynominal.attribute"/>
            </list>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
          </operator>
          <operator activated="true" class="read_excel" compatibility="9.9.002" expanded="true" height="68" name="Read Excel (2)" width="90" x="112" y="289">
            <parameter key="excel_file" value="C:\Users\MarcoBarradas\Downloads\dataset.xlsx"/>
            <parameter key="sheet_selection" value="sheet number"/>
            <parameter key="sheet_number" value="1"/>
            <parameter key="imported_cell_range" value="A1"/>
            <parameter key="encoding" value="SYSTEM"/>
            <parameter key="first_row_as_names" value="true"/>
            <list key="annotations"/>
            <parameter key="date_format" value=""/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="read_all_values_as_polynominal" value="false"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="Review ID.true.integer.attribute"/>
              <parameter key="1" value="Date.true.date.attribute"/>
              <parameter key="2" value="App Name.true.polynominal.attribute"/>
              <parameter key="3" value="App Store.true.polynominal.attribute"/>
              <parameter key="4" value="Language.true.polynominal.attribute"/>
              <parameter key="5" value="Country.true.polynominal.attribute"/>
              <parameter key="6" value="Rating.true.integer.attribute"/>
              <parameter key="7" value="Sentiment.true.polynominal.attribute"/>
              <parameter key="8" value="Version.true.polynominal.attribute"/>
              <parameter key="9" value="Subj&amp;Body.true.polynominal.attribute"/>
            </list>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="9.9.002" expanded="true" height="82" name="Nominal to Text" width="90" x="246" y="289">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Subj&amp;Body"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="9.9.002" expanded="true" height="82" name="Nominal to Text (2)" width="90" x="313" y="34">
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value="EATEN"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="9.3.001" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="447" y="34">
            <parameter key="create_word_vector" value="true"/>
            <parameter key="vector_creation" value="Binary Term Occurrences"/>
            <parameter key="add_meta_information" value="false"/>
            <parameter key="keep_text" value="false"/>
            <parameter key="prune_method" value="none"/>
            <parameter key="prune_below_percent" value="3.0"/>
            <parameter key="prune_above_percent" value="30.0"/>
            <parameter key="prune_below_rank" value="0.05"/>
            <parameter key="prune_above_rank" value="0.95"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
            <parameter key="select_attributes_and_weights" value="false"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:transform_cases" compatibility="9.3.001" expanded="true" height="68" name="Transform Cases (2)" width="90" x="112" y="34">
                <parameter key="transform_to" value="lower case"/>
              </operator>
              <operator activated="true" class="text:tokenize" compatibility="9.3.001" expanded="true" height="68" name="Tokenize (2)" width="90" x="246" y="34">
                <parameter key="mode" value="non letters"/>
                <parameter key="characters" value=".:"/>
                <parameter key="language" value="English"/>
                <parameter key="max_token_length" value="3"/>
              </operator>
              <connect from_port="document" to_op="Transform Cases (2)" to_port="document"/>
              <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
              <connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="9.3.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="648" y="136">
            <parameter key="create_word_vector" value="true"/>
            <parameter key="vector_creation" value="Term Occurrences"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="keep_text" value="true"/>
            <parameter key="prune_method" value="none"/>
            <parameter key="prune_below_percent" value="3.0"/>
            <parameter key="prune_above_percent" value="30.0"/>
            <parameter key="prune_below_rank" value="0.05"/>
            <parameter key="prune_above_rank" value="0.95"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
            <parameter key="select_attributes_and_weights" value="false"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:transform_cases" compatibility="9.3.001" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34">
                <parameter key="transform_to" value="lower case"/>
              </operator>
              <operator activated="true" class="text:tokenize" compatibility="9.3.001" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
                <parameter key="mode" value="non letters"/>
                <parameter key="characters" value=".:"/>
                <parameter key="language" value="English"/>
                <parameter key="max_token_length" value="3"/>
              </operator>
              <connect from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Excel" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
          <connect from_op="Read Excel (2)" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
          <connect from_op="Process Documents from Data (2)" from_port="word list" to_op="Process Documents from Data" to_port="word list"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


  • EL75EL75 Member Posts: 43 Contributor II
    Hi Marco,
    Tks for the idea, and clearly, the process is fast!
    but 2 things don't fit my needs:
    1 - Lost of grouping results of word’s matches under categories names (head of columns of the dictionary):  I'm loosing the possibility to pivot at the end of the process in order to group results under the categories of the dictionary. Appying you process to my specific case (described at the beginning of the thread), returns as a result a dataset containing 6742 columns for each words matched in the dictionary and no category.
    2- lost of findings reustling from a match of a character chain (e.g. if I put « app » in the dictionary, my process « lopp in loop » will return all results that match the verbatim containing this chain of characters. In some way it operates sucha as a steming process (I don’t care having false positives, because a manual verification will be done in a second step). The process you propse uses tokenization (result of word processing) has for consequence a lost of this capability.

    The first point is a sine qua none one, but the second could be an acceptable lost.

    Cheers
  • MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    Hi @EL75 ,

    I made a change to output the result you expect for the first point.

    For the second case could you share an example of dictionary and examples? I guess we could use some replace dictionary with some regex magic to accomplish that.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.9.002">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.9.002" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="-1"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="9.9.002" expanded="true" height="68" name="Dictionary" width="90" x="45" y="34">
            <parameter key="excel_file" value="C:\Users\MarcoBarradas\Downloads\dictionary.xlsx"/>
            <parameter key="sheet_selection" value="sheet number"/>
            <parameter key="sheet_number" value="1"/>
            <parameter key="imported_cell_range" value="A1"/>
            <parameter key="encoding" value="SYSTEM"/>
            <parameter key="first_row_as_names" value="true"/>
            <list key="annotations"/>
            <parameter key="date_format" value=""/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="read_all_values_as_polynominal" value="false"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="EATERS.true.polynominal.attribute"/>
              <parameter key="1" value="EATEN.true.polynominal.attribute"/>
              <parameter key="2" value="OTHERS.true.polynominal.attribute"/>
            </list>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
          </operator>
          <operator activated="true" class="read_excel" compatibility="9.9.002" expanded="true" height="68" name="DataSet" width="90" x="45" y="238">
            <parameter key="excel_file" value="C:\Users\MarcoBarradas\Downloads\dataset.xlsx"/>
            <parameter key="sheet_selection" value="sheet number"/>
            <parameter key="sheet_number" value="1"/>
            <parameter key="imported_cell_range" value="A1"/>
            <parameter key="encoding" value="SYSTEM"/>
            <parameter key="first_row_as_names" value="true"/>
            <list key="annotations"/>
            <parameter key="date_format" value=""/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="read_all_values_as_polynominal" value="false"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="Review ID.true.integer.attribute"/>
              <parameter key="1" value="Date.true.date.attribute"/>
              <parameter key="2" value="App Name.true.polynominal.attribute"/>
              <parameter key="3" value="App Store.true.polynominal.attribute"/>
              <parameter key="4" value="Language.true.polynominal.attribute"/>
              <parameter key="5" value="Country.true.polynominal.attribute"/>
              <parameter key="6" value="Rating.true.integer.attribute"/>
              <parameter key="7" value="Sentiment.true.polynominal.attribute"/>
              <parameter key="8" value="Version.true.polynominal.attribute"/>
              <parameter key="9" value="Subj&amp;Body.true.polynominal.attribute"/>
            </list>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="9.9.002" expanded="true" height="82" name="Set Role" width="90" x="179" y="238">
            <parameter key="attribute_name" value="Review ID"/>
            <parameter key="target_role" value="id"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="9.9.002" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="238">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Subj&amp;Body"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="remember" compatibility="9.9.002" expanded="true" height="68" name="Remember" width="90" x="447" y="238">
            <parameter key="name" value="Original_File"/>
            <parameter key="io_object" value="ExampleSet"/>
            <parameter key="store_which" value="1"/>
            <parameter key="remove_from_process" value="true"/>
          </operator>
          <operator activated="true" class="concurrency:loop_attributes" compatibility="9.9.002" expanded="true" height="103" name="Loop Attributes" width="90" x="246" y="34">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value="EATEN|EATERS|OTHERS"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="attribute_name_macro" value="loop_attribute"/>
            <parameter key="reuse_results" value="false"/>
            <parameter key="enable_parallel_execution" value="false"/>
            <process expanded="true">
              <operator activated="true" class="select_attributes" compatibility="9.9.002" expanded="true" height="82" name="Select Attributes (3)" width="90" x="179" y="34">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="%{loop_attribute}"/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
              </operator>
              <operator activated="true" class="nominal_to_text" compatibility="9.9.002" expanded="true" height="82" name="Nominal to Text (2)" width="90" x="313" y="34">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="%{loop_attribute}"/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="nominal"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="file_path"/>
                <parameter key="block_type" value="single_value"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="single_value"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
              </operator>
              <operator activated="true" class="text:process_document_from_data" compatibility="9.3.001" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="447" y="34">
                <parameter key="create_word_vector" value="true"/>
                <parameter key="vector_creation" value="Binary Term Occurrences"/>
                <parameter key="add_meta_information" value="false"/>
                <parameter key="keep_text" value="false"/>
                <parameter key="prune_method" value="none"/>
                <parameter key="prune_below_percent" value="3.0"/>
                <parameter key="prune_above_percent" value="30.0"/>
                <parameter key="prune_below_rank" value="0.05"/>
                <parameter key="prune_above_rank" value="0.95"/>
                <parameter key="datamanagement" value="double_sparse_array"/>
                <parameter key="data_management" value="auto"/>
                <parameter key="select_attributes_and_weights" value="false"/>
                <list key="specify_weights"/>
                <process expanded="true">
                  <operator activated="true" class="text:transform_cases" compatibility="9.3.001" expanded="true" height="68" name="Transform Cases (2)" width="90" x="112" y="34">
                    <parameter key="transform_to" value="lower case"/>
                  </operator>
                  <operator activated="true" class="text:tokenize" compatibility="9.3.001" expanded="true" height="68" name="Tokenize (2)" width="90" x="246" y="34">
                    <parameter key="mode" value="non letters"/>
                    <parameter key="characters" value=".:"/>
                    <parameter key="language" value="English"/>
                    <parameter key="max_token_length" value="3"/>
                  </operator>
                  <connect from_port="document" to_op="Transform Cases (2)" to_port="document"/>
                  <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
                  <connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_document" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="text:process_document_from_data" compatibility="9.3.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="289">
                <parameter key="create_word_vector" value="true"/>
                <parameter key="vector_creation" value="Term Occurrences"/>
                <parameter key="add_meta_information" value="true"/>
                <parameter key="keep_text" value="true"/>
                <parameter key="prune_method" value="none"/>
                <parameter key="prune_below_percent" value="3.0"/>
                <parameter key="prune_above_percent" value="30.0"/>
                <parameter key="prune_below_rank" value="0.05"/>
                <parameter key="prune_above_rank" value="0.95"/>
                <parameter key="datamanagement" value="double_sparse_array"/>
                <parameter key="data_management" value="auto"/>
                <parameter key="select_attributes_and_weights" value="false"/>
                <list key="specify_weights"/>
                <process expanded="true">
                  <operator activated="true" class="text:transform_cases" compatibility="9.3.001" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34">
                    <parameter key="transform_to" value="lower case"/>
                  </operator>
                  <operator activated="true" class="text:tokenize" compatibility="9.3.001" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
                    <parameter key="mode" value="non letters"/>
                    <parameter key="characters" value=".:"/>
                    <parameter key="language" value="English"/>
                    <parameter key="max_token_length" value="3"/>
                  </operator>
                  <connect from_port="document" to_op="Transform Cases" to_port="document"/>
                  <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
                  <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_document" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="9.9.002" expanded="true" height="82" name="Keep Numeric" width="90" x="447" y="289">
                <parameter key="attribute_filter_type" value="value_type"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="numeric"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="9.9.002" expanded="true" height="82" name="Remove The Rating" width="90" x="581" y="289">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="Rating"/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="true"/>
                <parameter key="include_special_attributes" value="false"/>
              </operator>
              <operator activated="true" class="generate_aggregation" compatibility="9.9.002" expanded="true" height="82" name="Total_Column" width="90" x="715" y="289">
                <parameter key="attribute_name" value="Total_%{loop_attribute}"/>
                <parameter key="attribute_filter_type" value="all"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
                <parameter key="aggregation_function" value="sum"/>
                <parameter key="concatenation_separator" value="|"/>
                <parameter key="keep_all" value="true"/>
                <parameter key="ignore_missings" value="true"/>
                <parameter key="ignore_missing_attributes" value="false"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="9.9.002" expanded="true" height="82" name="Keep_ID_and_Total" width="90" x="849" y="289">
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value="Review ID|Total_%{loop_attribute}"/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="true"/>
              </operator>
              <connect from_port="input 1" to_op="Select Attributes (3)" to_port="example set input"/>
              <connect from_port="input 2" to_op="Process Documents from Data" to_port="example set"/>
              <connect from_op="Select Attributes (3)" from_port="example set output" to_op="Nominal to Text (2)" to_port="example set input"/>
              <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
              <connect from_op="Process Documents from Data (2)" from_port="word list" to_op="Process Documents from Data" to_port="word list"/>
              <connect from_op="Process Documents from Data" from_port="example set" to_op="Keep Numeric" to_port="example set input"/>
              <connect from_op="Keep Numeric" from_port="example set output" to_op="Remove The Rating" to_port="example set input"/>
              <connect from_op="Remove The Rating" from_port="example set output" to_op="Total_Column" to_port="example set input"/>
              <connect from_op="Total_Column" from_port="example set output" to_op="Keep_ID_and_Total" to_port="example set input"/>
              <connect from_op="Keep_ID_and_Total" from_port="example set output" to_port="output 1"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="source_input 3" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="loop_collection" compatibility="9.9.002" expanded="true" height="68" name="Join with Previous DS" width="90" x="447" y="34">
            <parameter key="set_iteration_macro" value="false"/>
            <parameter key="macro_name" value="iteration"/>
            <parameter key="macro_start_value" value="1"/>
            <parameter key="unfold" value="true"/>
            <process expanded="true">
              <operator activated="true" class="recall" compatibility="9.9.002" expanded="true" height="68" name="Recall" width="90" x="179" y="34">
                <parameter key="name" value="Original_File"/>
                <parameter key="io_object" value="ExampleSet"/>
                <parameter key="remove_from_store" value="true"/>
              </operator>
              <operator activated="true" class="concurrency:join" compatibility="9.9.002" expanded="true" height="82" name="Join" width="90" x="246" y="187">
                <parameter key="remove_double_attributes" value="true"/>
                <parameter key="join_type" value="inner"/>
                <parameter key="use_id_attribute_as_key" value="true"/>
                <list key="key_attributes"/>
                <parameter key="keep_both_join_attributes" value="false"/>
              </operator>
              <operator activated="true" class="remember" compatibility="9.9.002" expanded="true" height="68" name="Remember (2)" width="90" x="380" y="34">
                <parameter key="name" value="Original_File"/>
                <parameter key="io_object" value="ExampleSet"/>
                <parameter key="store_which" value="1"/>
                <parameter key="remove_from_process" value="true"/>
              </operator>
              <connect from_port="single" to_op="Join" to_port="right"/>
              <connect from_op="Recall" from_port="result" to_op="Join" to_port="left"/>
              <connect from_op="Join" from_port="join" to_op="Remember (2)" to_port="store"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="recall" compatibility="9.9.002" expanded="true" height="68" name="Final Data Set" width="90" x="648" y="34">
            <parameter key="name" value="Original_File"/>
            <parameter key="io_object" value="ExampleSet"/>
            <parameter key="remove_from_store" value="true"/>
          </operator>
          <connect from_op="Dictionary" from_port="output" to_op="Loop Attributes" to_port="input 1"/>
          <connect from_op="DataSet" from_port="output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Remember" to_port="store"/>
          <connect from_op="Remember" from_port="stored" to_op="Loop Attributes" to_port="input 2"/>
          <connect from_op="Loop Attributes" from_port="output 1" to_op="Join with Previous DS" to_port="collection"/>
          <connect from_op="Final Data Set" from_port="result" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>



  • EL75EL75 Member Posts: 43 Contributor II
    Hi MarcoBarradas,
    thanks again, looks like really better!something strange, when I iterate with my dictionary (26 columns/16600 words) and my dataset (1000 rows), I should have verbatim rows with ZERO as results on some labels. Instead I have at least 30.

    Perhaps there’s no relationship with that, but it seems to be corelated with the value of the column « verbatim size » 
    For building a validation dataset for classification task, I need label with ZERO values. I know that the output of the process we are working on delivers a sum of words, for each category of the dict, that have been found in each verbatim and, indeed, I have to add in the end an operator to replace each non « 0 »  value by « 1 » .
    I tried different option of summing within the "Generate Aggregation » attribute, without success.
    Is there any reason for non having zeros?
    Best,
  • MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    Hi @EL75,

    You just need to adjust the Remove Rating (Select Attribute Operator) inside the Loop Attributes Operator.

    In that I removed all the Numeric Attributes that we had before creating the new counting Attributes.

    If you change the attribute filter type to  subset you can remove as many numeric attributes ( Rating and Verbatim) as you need to avoid any adding up other numbers to your totals.
     
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    @MarcoBarradas, awesome job!
  • EL75EL75 Member Posts: 43 Contributor II
    edited July 2021
    Yes, for sure :)
    Remains a question: 
    The process that I need to implement is to count the words of the dict that have been matched in each "subj&body" column of each row of the dataset, but yours seems to count the numeric att values.
    I have 26 columns in my dict, and let’s say 100 words per column, each head of column is the name of the category of my dict.
    For epoch 1  of the "loop attributes" process, the cell « subj&body » containing the verbatim of the row N°1 of my dataset must be loaded, then the process must check if each of 100 words of column 1 of the dict is found (then write « 1 » for each word found, and if not « 0 » ), then the process sums all words of column 1 found in « subj&body » of row 1 and returns a final result. And so on and so forth till last row of the dataset, each time with the 26 columns of the dict…
    Therefore, the final results should be a word count, but here it doesn’t seem to be correlated with that, as it seems that if one of the words of the dict has been found, the returned value seems to be correlated with the value of the numeric att of the row processed, that has been taken into account by the summing formula. 
    I’ve created a new att with a « 0 » value, and selected only this one for the summing, and that returned « 0 » everywhere.
    Am I wrong? what do I miss?
    cheers,
  • MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    @EL75 the process I shared is counting the number of times each word of a category appears on the sub&body. 
    For example if the text is "I have a cat that I love. My cat is my closest friend and dislikes my bird"
    The count for cat will be 2 the count for bird will be 1 and the Total_Eaters will be 3
    That is managed by the vector creation option on the Process Documents from Data operator. The option in there is "Term Ocurrences" if you
    Change that parameter to "Binary Term Ocurrences" the count will change to cat 1 bird 1 Total_Eaters will be 2

Sign In or Register to comment.