Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Extract a specific word from text

Steve8Steve8 Member Posts: 1 Learner III
edited June 2019 in Help
Hi, I'm having a problem that is hopefully straightforward.

One attribute in my dataset consists of product descriptions. I want to extract a specific keyword from the product descriptions and create a new attribute that contains just the keyword.

For example, lets say iSight is my keyword and the following is the description:

"New 8-megapixel iSight camera with 1.5µ pixels
Autofocus with Focus Pixels ƒ/2.2 aperture"

The new attribute would be able to find and extract "iSight":


Regards,

Steve

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    The Generate Attributes operator is one possibility. Use the matches function and a suitable regular expression.

    If you want the keyword itself to be the attribute value then you could use the Generate Extract function.

    Here's a simple example showing both.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.1.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.1.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="6.1.000" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="112" y="75">
            <list key="attribute_values">
              <parameter key="text" value="&quot;find needle in haystack&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.1.000" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="112" y="300">
            <list key="attribute_values">
              <parameter key="text" value="&quot;nothing to see here&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="append" compatibility="6.1.000" expanded="true" height="94" name="Append" width="90" x="179" y="165"/>
          <operator activated="true" class="generate_attributes" compatibility="6.1.000" expanded="true" height="76" name="Generate Attributes" width="90" x="313" y="210">
            <list key="function_descriptions">
              <parameter key="needleBinominal" value="matches(text,&quot;.*needle.*&quot;)"/>
            </list>
          </operator>
          <operator activated="true" class="text:generate_extract" compatibility="6.1.000" expanded="true" height="60" name="Generate Extract" width="90" x="447" y="210">
            <parameter key="source_attribute" value="text"/>
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries">
              <parameter key="needleNominal" value=".*(needle).*"/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Append" from_port="merged set" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Generate Extract" to_port="Example Set"/>
          <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    regards

    Andrew
  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    hi

    how to write expression for get the specific word list.

  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    how to write expression in generate attribute operator for getting specific word list

     

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @rajbanokhan

     

     

    I don't understand exactily what you want but : 

    I propose to you 2 elements of response : 

    1.  first this process

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="75">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordA&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="112" y="187">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordB&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (3)" width="90" x="112" y="289">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordC&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (4)" width="90" x="112" y="391">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordB hello wordA good morning wordC&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="append" compatibility="8.0.001" expanded="true" height="145" name="Append" width="90" x="380" y="187"/>
    <operator activated="true" class="nominal_to_text" compatibility="8.0.001" expanded="true" height="82" name="Nominal to Text" width="90" x="514" y="187"/>
    <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="648" y="187">
    <list key="function_descriptions">
    <parameter key="needleBinominal" value="contains(text,&quot;wordA&quot;)&amp;&amp;contains(text,&quot;wordB&quot;)&amp;&amp;contains(text,&quot;wordC&quot;)"/>
    </list>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="782" y="187">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
    <parameter key="expression" value="(word1\b)(word2\b)(word3\b)"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="124" name="Multiply" width="90" x="246" y="187"/>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="514" y="442">
    <parameter key="string" value="wordC"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="514" y="289">
    <parameter key="string" value="wordA"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="514" y="85">
    <parameter key="string" value="wordB"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Filter Tokens (3)" to_port="document"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Filter Tokens (2)" to_port="document"/>
    <connect from_op="Multiply" from_port="output 3" to_op="Filter Tokens (4)" to_port="document"/>
    <connect from_op="Filter Tokens (4)" from_port="document" to_port="document 3"/>
    <connect from_op="Filter Tokens (3)" from_port="document" to_port="document 2"/>
    <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    <portSpacing port="sink_document 3" spacing="0"/>
    <portSpacing port="sink_document 4" spacing="0"/>
    </process>
    </operator>
    <operator activated="false" class="text:generate_extract" compatibility="7.5.000" expanded="true" height="68" name="Generate Extract" width="90" x="514" y="391">
    <parameter key="source_attribute" value="text"/>
    <parameter key="query_type" value="Regular Expression"/>
    <list key="string_machting_queries"/>
    <list key="regular_expression_queries">
    <parameter key="word1_word2_word3_Nominal" value="(?=.*wordA)(?=.*wordB)(?=.*wordC)"/>
    </list>
    <list key="regular_region_queries"/>
    <list key="xpath_queries"/>
    <list key="namespaces"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
    <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append" to_port="example set 3"/>
    <connect from_op="Generate Data by User Specification (4)" from_port="output" to_op="Append" to_port="example set 4"/>
    <connect from_op="Append" from_port="merged set" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    2. this second process

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="75">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordA&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="112" y="187">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordB&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (3)" width="90" x="112" y="289">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordC&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (4)" width="90" x="112" y="391">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordB hello wordA good morning wordC&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="append" compatibility="8.0.001" expanded="true" height="145" name="Append" width="90" x="380" y="187"/>
    <operator activated="true" class="nominal_to_text" compatibility="8.0.001" expanded="true" height="82" name="Nominal to Text" width="90" x="514" y="187"/>
    <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="648" y="187">
    <list key="function_descriptions">
    <parameter key="needleBinominal" value="contains(text,&quot;wordA&quot;)&amp;&amp;contains(text,&quot;wordB&quot;)&amp;&amp;contains(text,&quot;wordC&quot;)"/>
    </list>
    </operator>
    <operator activated="false" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="782" y="442">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
    <parameter key="expression" value="(word1\b)(word2\b)(word3\b)"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="124" name="Multiply" width="90" x="246" y="187"/>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="514" y="442">
    <parameter key="string" value="wordC"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="514" y="289">
    <parameter key="string" value="wordA"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="514" y="85">
    <parameter key="string" value="wordB"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Filter Tokens (3)" to_port="document"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Filter Tokens (2)" to_port="document"/>
    <connect from_op="Multiply" from_port="output 3" to_op="Filter Tokens (4)" to_port="document"/>
    <connect from_op="Filter Tokens (4)" from_port="document" to_port="document 3"/>
    <connect from_op="Filter Tokens (3)" from_port="document" to_port="document 2"/>
    <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    <portSpacing port="sink_document 3" spacing="0"/>
    <portSpacing port="sink_document 4" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:generate_extract" compatibility="7.5.000" expanded="true" height="68" name="Generate Extract" width="90" x="782" y="187">
    <parameter key="source_attribute" value="text"/>
    <parameter key="query_type" value="Regular Expression"/>
    <list key="string_machting_queries"/>
    <list key="regular_expression_queries">
    <parameter key="word1_word2_word3_Nominal" value="(?=.*wordA)(?=.*wordB)(?=.*wordC)"/>
    </list>
    <list key="regular_region_queries"/>
    <list key="xpath_queries"/>
    <list key="namespaces"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
    <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append" to_port="example set 3"/>
    <connect from_op="Generate Data by User Specification (4)" from_port="output" to_op="Append" to_port="example set 4"/>
    <connect from_op="Append" from_port="merged set" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Generate Extract" to_port="Example Set"/>
    <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    fot this last process , i don't undestand why the colums where there are all the wanted words, thre are nothing displayed.

     

    Regards;

     

    Lionel 

     

     

     

  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    hi

    i want to search specific word from text. for example we have a paragraph and i want to search a word from it

    so i use "The Generate Attributes operator" and how i write "Use the matches function and a suitable regular expression".

     

Sign In or Register to comment.