token in groupe

startx25startx25 Member Posts: 7 Contributor II
edited November 2018 in Help
Hi all,

I have read this wonderful tutorial for Finding text needles in document haystacks :

https://docs.google.com/file/d/0BzlG_h9m5M7tVXUyeVl4cmhJZGc/edit?usp=sharing

It'work fine, but now i want to add another texte file in step 1 : the text needles file  (with label value : ex:Groupe2)
(2 textfile in intput in step 1)

And in the end result proces, i want to identify from witch text needles file provide my wordlist (Groupe1 or Group2) in my textfile in step3

thank you for any help



Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    Is this what you need?
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="75">
            <parameter key="text" value="binominal parameter&#10;binominal attributes&#10;Binominal operator&#10;"/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (3)" width="90" x="45" y="165">
            <parameter key="text" value="at this&#10;had the&#10;of the&#10;"/>
          </operator>
          <operator activated="true" class="collect" compatibility="5.3.008" expanded="true" height="94" name="Collect" width="90" x="179" y="120"/>
          <operator activated="true" class="loop_collection" compatibility="5.3.008" expanded="true" height="76" name="Loop Collection" width="90" x="313" y="120">
            <parameter key="set_iteration_macro" value="true"/>
            <process expanded="true">
              <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents" width="90" x="112" y="75">
                <parameter key="vector_creation" value="Binary Term Occurrences"/>
                <parameter key="add_meta_information" value="false"/>
                <process expanded="true">
                  <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" name="Tokenize (3)">
                    <parameter key="mode" value="regular expression"/>
                    <parameter key="expression" value="[^a-zA-Z ]"/>
                  </operator>
                  <operator activated="true" class="text:replace_tokens" compatibility="5.3.000" expanded="true" name="Replace Tokens (3)">
                    <list key="replace_dictionary">
                      <parameter key=" " value="_"/>
                    </list>
                  </operator>
                  <connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
                  <connect from_op="Tokenize (3)" from_port="document" to_op="Replace Tokens (3)" to_port="document"/>
                  <connect from_op="Replace Tokens (3)" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_document" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (4)" width="90" x="112" y="300">
                <parameter key="text" value="This Example Process mostly focuses on the transform binominal parameter. &#10;All remaining parameters are mostly for selecting the attributes. &#10;The Select Attributes operator also has many similar parameters for selection of attributes.&#10;You can study the Example Process of the Select Attributes operator if &#10;you want an understanding of these parameters. The Retrieve operator is used to &#10;load the Golf data set. A breakpoint is inserted at this point so that you can &#10;have look at the data set before application of the Nominal to Binominal operator. &#10;You can see that the 'Outlook' attribute has three possible values &#10;i.e. 'sunny', 'rain' and 'overcast'. The 'Wind' attribute has two possible values &#10;i.e. 'true' and 'false'. All parameters of the Nominal to Binominal operator are &#10;used with default values. Run the process. First you will see the Golf data set. &#10;Press the run button again and you will see the final results. &#10;You can see that the 'Outlook' attribute is replaced by three binominal attributes, &#10;one for each possible value of the original 'Outlook' attribute. &#10;These attributes are ' Outlook = sunny', ' Outlook = rain', and ' Outlook = overcast'. &#10;Only the value of one of these attributes is true for a specific example, the value of &#10;the other attributes is false. Examples whose 'Outlook ' attribute had the value 'sunny'&#10;in the original ExampleSet, will have the attribute ' Outlook =sunny' value set to &#10;'true'in the new ExampleSet, the value of the 'Outlook =overcast' and 'Outlook =rain' &#10;attributes will be 'false'. The numeric attributes of the input ExampleSet remain &#10;unchanged. The 'Wind' attribute was not replaced by two binominal attributes, &#10;one for each possible value of the 'Wind' attribute because this attribute is already &#10;binominal. Still if you want to break it into two separate binominal attributes, &#10;this can be done by setting the transform binominal parameter to true.&#10;"/>
              </operator>
              <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents (2)" width="90" x="313" y="255">
                <parameter key="vector_creation" value="Term Occurrences"/>
                <process expanded="true">
                  <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" name="Tokenize (4)">
                    <parameter key="expression" value="\\r\\n"/>
                  </operator>
                  <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.000" expanded="true" name="Generate n-Grams (2)">
                    <parameter key="max_length" value="5"/>
                  </operator>
                  <connect from_port="document" to_op="Tokenize (4)" to_port="document"/>
                  <connect from_op="Tokenize (4)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
                  <connect from_op="Generate n-Grams (2)" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_document" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="generate_attributes" compatibility="5.3.008" expanded="true" height="76" name="Generate Attributes" width="90" x="313" y="75">
                <list key="function_descriptions">
                  <parameter key="group" value="&quot;Group_%{iteration}&quot;"/>
                </list>
              </operator>
              <connect from_port="single" to_op="Process Documents" to_port="documents 1"/>
              <connect from_op="Process Documents" from_port="word list" to_op="Process Documents (2)" to_port="word list"/>
              <connect from_op="Create Document (4)" from_port="output" to_op="Process Documents (2)" to_port="documents 1"/>
              <connect from_op="Process Documents (2)" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Create Document (2)" from_port="output" to_op="Collect" to_port="input 1"/>
          <connect from_op="Create Document (3)" from_port="output" to_op="Collect" to_port="input 2"/>
          <connect from_op="Collect" from_port="collection" to_op="Loop Collection" to_port="collection"/>
          <connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    regards

    Andrew
  • startx25startx25 Member Posts: 7 Contributor II
    first, thank you for your help

    this is not realy what  i need. Meybe my epxlanation are not exact.

    here is a summary of what i need :

    input 1 : textfile with some keywords (groupe1) (one word per line)
    input 2 : textfile with some keywords (groupe2) (one word per line)
    input 3: a flat text file

    what i need is to count how much keyword from goupe1 and from goupe2 are présent in my flat text file (input 3)


    I think i need to add an aggregate operators but it 'can't count correct value of groupe
    here iw an example from

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
            <parameter key="text" value="dog&#10;cat&#10;bird&#10;&#10;"/>
            <parameter key="add label" value="true"/>
            <parameter key="label_type" value="text"/>
            <parameter key="label_value" value="groupe1"/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="255">
            <parameter key="text" value="This Example Process mostly focuses on the transform binominal parameter. &#10;All remaining parameters are mostly for selecting the attributes. &#10;The Select Attributes operator also has many similar parameters for selection of attributes.&#10;You can study the Example Process of the Select Attributes operator if &#10;you want an understanding of these parameters. The Retrieve operator is used to &#10;load the Golf data set. A breakpoint is inserted at this point so that you can &#10;have look at the data set before application of the Nominal to Binominal operator. &#10;You can see that the 'Outlook' attribute has three possible values &#10;i.e. 'sunny', 'rain' and 'overcast'. The 'Wind' attribute has two possible values &#10;i.e. 'true' and 'false'. All parameters of the Nominal to Binominal operator are &#10;used with default values. Run the process. First you will see the Golf data set. &#10;Press the run button again and you will see the final results. and dog cat  bird and dog cat  bird and dog cat  bird  house car car house  &#10;You can see that the 'Outlook' attribute is replaced by three binominal attributes, &#10;one for each possible value of the original 'Outlook' attribute. &#10;These attributes are ' Outlook = sunny', ' Outlook = rain', and ' Outlook = overcast'. &#10;Only the value of one of these attributes is true for a specific example, the value of &#10;the other attributes is false. Examples whose 'Outlook ' attribute had the value 'sunny'&#10;in the original ExampleSet, will have the attribute ' Outlook =sunny' value set to &#10;'true'in the new ExampleSet, the value of the 'Outlook =overcast' and 'Outlook =rain' &#10;attributes will be 'false'. The numeric attributes of the input ExampleSet remain &#10;unchanged. The 'Wind' attribute was not replaced by two binominal attributes, &#10;one for each possible value of the 'Wind' attribute because this attribute is already &#10;binominal. Still if you want to break it into two separate binominal attributes, &#10;this can be done by setting the transform binominal parameter to true.&#10;"/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (3)" width="90" x="45" y="120">
            <parameter key="text" value="car&#10;train&#10;boat&#10;truck"/>
            <parameter key="add label" value="true"/>
            <parameter key="label_type" value="text"/>
            <parameter key="label_value" value="groupe2"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="112" name="Process Documents" width="90" x="246" y="75">
            <parameter key="vector_creation" value="Binary Term Occurrences"/>
            <parameter key="add_meta_information" value="false"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize (3)" width="90" x="45" y="30">
                <parameter key="mode" value="regular expression"/>
                <parameter key="expression" value="[^a-zA-Z ]"/>
              </operator>
              <connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
              <connect from_op="Tokenize (3)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents (2)" width="90" x="313" y="255">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize (4)" width="90" x="45" y="30">
                <parameter key="expression" value="\\r\\n"/>
              </operator>
              <connect from_port="document" to_op="Tokenize (4)" to_port="document"/>
              <connect from_op="Tokenize (4)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Create Document (2)" from_port="output" to_op="Process Documents (2)" to_port="documents 1"/>
          <connect from_op="Create Document (3)" from_port="output" to_op="Process Documents" to_port="documents 2"/>
          <connect from_op="Process Documents" from_port="word list" to_op="Process Documents (2)" to_port="word list"/>
          <connect from_op="Process Documents (2)" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

    Thank you for any help




  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    I think this might work for you

    I'll waive my usual fee of beer or money  ;)
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="75">
            <parameter key="text" value="binominal parameter&#10;binominal attributes&#10;Binominal operator&#10;"/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (3)" width="90" x="45" y="165">
            <parameter key="text" value="at this&#10;had the&#10;of the&#10;"/>
          </operator>
          <operator activated="true" class="collect" compatibility="5.3.008" expanded="true" height="94" name="Collect" width="90" x="246" y="75"/>
          <operator activated="true" class="loop_collection" compatibility="5.3.008" expanded="true" height="76" name="Loop Collection" width="90" x="380" y="75">
            <parameter key="set_iteration_macro" value="true"/>
            <process expanded="true">
              <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents" width="90" x="112" y="30">
                <parameter key="vector_creation" value="Binary Term Occurrences"/>
                <parameter key="add_meta_information" value="false"/>
                <process expanded="true">
                  <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" name="Tokenize (3)">
                    <parameter key="mode" value="regular expression"/>
                    <parameter key="expression" value="[^a-zA-Z ]"/>
                  </operator>
                  <operator activated="true" class="text:replace_tokens" compatibility="5.3.000" expanded="true" name="Replace Tokens (3)">
                    <list key="replace_dictionary">
                      <parameter key=" " value="_"/>
                    </list>
                  </operator>
                  <connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
                  <connect from_op="Tokenize (3)" from_port="document" to_op="Replace Tokens (3)" to_port="document"/>
                  <connect from_op="Replace Tokens (3)" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_document" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (4)" width="90" x="112" y="300">
                <parameter key="text" value="This Example Process mostly focuses on the transform binominal parameter. &#10;All remaining parameters are mostly for selecting the attributes. &#10;The Select Attributes operator also has many similar parameters for selection of attributes.&#10;You can study the Example Process of the Select Attributes operator if &#10;you want an understanding of these parameters. The Retrieve operator is used to &#10;load the Golf data set. A breakpoint is inserted at this point so that you can &#10;have look at the data set before application of the Nominal to Binominal operator. &#10;You can see that the 'Outlook' attribute has three possible values &#10;i.e. 'sunny', 'rain' and 'overcast'. The 'Wind' attribute has two possible values &#10;i.e. 'true' and 'false'. All parameters of the Nominal to Binominal operator are &#10;used with default values. Run the process. First you will see the Golf data set. &#10;Press the run button again and you will see the final results. &#10;You can see that the 'Outlook' attribute is replaced by three binominal attributes, &#10;one for each possible value of the original 'Outlook' attribute. &#10;These attributes are ' Outlook = sunny', ' Outlook = rain', and ' Outlook = overcast'. &#10;Only the value of one of these attributes is true for a specific example, the value of &#10;the other attributes is false. Examples whose 'Outlook ' attribute had the value 'sunny'&#10;in the original ExampleSet, will have the attribute ' Outlook =sunny' value set to &#10;'true'in the new ExampleSet, the value of the 'Outlook =overcast' and 'Outlook =rain' &#10;attributes will be 'false'. The numeric attributes of the input ExampleSet remain &#10;unchanged. The 'Wind' attribute was not replaced by two binominal attributes, &#10;one for each possible value of the 'Wind' attribute because this attribute is already &#10;binominal. Still if you want to break it into two separate binominal attributes, &#10;this can be done by setting the transform binominal parameter to true.&#10;"/>
              </operator>
              <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents (2)" width="90" x="246" y="120">
                <parameter key="vector_creation" value="Term Occurrences"/>
                <process expanded="true">
                  <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" name="Tokenize (4)">
                    <parameter key="expression" value="\\r\\n"/>
                  </operator>
                  <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.000" expanded="true" name="Generate n-Grams (2)">
                    <parameter key="max_length" value="5"/>
                  </operator>
                  <connect from_port="document" to_op="Tokenize (4)" to_port="document"/>
                  <connect from_op="Tokenize (4)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
                  <connect from_op="Generate n-Grams (2)" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_document" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="generate_attributes" compatibility="5.3.008" expanded="true" height="76" name="Generate Attributes" width="90" x="380" y="120">
                <list key="function_descriptions">
                  <parameter key="group" value="&quot;Group_%{iteration}&quot;"/>
                </list>
              </operator>
              <operator activated="true" class="rename_by_generic_names" compatibility="5.3.008" expanded="true" height="76" name="Rename by Generic Names" width="90" x="380" y="210">
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attributes" value="|group"/>
                <parameter key="invert_selection" value="true"/>
              </operator>
              <operator activated="true" class="generate_aggregation" compatibility="5.3.008" expanded="true" height="76" name="Generate Aggregation" width="90" x="380" y="300">
                <parameter key="attribute_name" value="sum"/>
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attributes" value="|group"/>
                <parameter key="invert_selection" value="true"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="5.3.008" expanded="true" height="76" name="Select Attributes" width="90" x="514" y="120">
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attributes" value="group|sum|"/>
              </operator>
              <connect from_port="single" to_op="Process Documents" to_port="documents 1"/>
              <connect from_op="Process Documents" from_port="word list" to_op="Process Documents (2)" to_port="word list"/>
              <connect from_op="Create Document (4)" from_port="output" to_op="Process Documents (2)" to_port="documents 1"/>
              <connect from_op="Process Documents (2)" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_op="Rename by Generic Names" to_port="example set input"/>
              <connect from_op="Rename by Generic Names" from_port="example set output" to_op="Generate Aggregation" to_port="example set input"/>
              <connect from_op="Generate Aggregation" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="5.3.008" expanded="true" height="76" name="Append" width="90" x="514" y="75"/>
          <connect from_op="Create Document (2)" from_port="output" to_op="Collect" to_port="input 1"/>
          <connect from_op="Create Document (3)" from_port="output" to_op="Collect" to_port="input 2"/>
          <connect from_op="Collect" from_port="collection" to_op="Loop Collection" to_port="collection"/>
          <connect from_op="Loop Collection" from_port="output 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    regards

    Andrew
  • startx25startx25 Member Posts: 7 Contributor II
    Hi Andrew

    Great !
    Thank you for this, I really appreciate,


    ;D


Sign In or Register to comment.