Extract a specific word from text

Steve8Steve8 Member Posts: 1 Contributor I
edited June 2019 in Help
Hi, I'm having a problem that is hopefully straightforward.

One attribute in my dataset consists of product descriptions. I want to extract a specific keyword from the product descriptions and create a new attribute that contains just the keyword.

For example, lets say iSight is my keyword and the following is the description:

"New 8-megapixel iSight camera with 1.5µ pixels
Autofocus with Focus Pixels ƒ/2.2 aperture"

The new attribute would be able to find and extract "iSight":


Regards,

Steve

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    The Generate Attributes operator is one possibility. Use the matches function and a suitable regular expression.

    If you want the keyword itself to be the attribute value then you could use the Generate Extract function.

    Here's a simple example showing both.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.1.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.1.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="6.1.000" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="112" y="75">
            <list key="attribute_values">
              <parameter key="text" value="&quot;find needle in haystack&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="6.1.000" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="112" y="300">
            <list key="attribute_values">
              <parameter key="text" value="&quot;nothing to see here&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="append" compatibility="6.1.000" expanded="true" height="94" name="Append" width="90" x="179" y="165"/>
          <operator activated="true" class="generate_attributes" compatibility="6.1.000" expanded="true" height="76" name="Generate Attributes" width="90" x="313" y="210">
            <list key="function_descriptions">
              <parameter key="needleBinominal" value="matches(text,&quot;.*needle.*&quot;)"/>
            </list>
          </operator>
          <operator activated="true" class="text:generate_extract" compatibility="6.1.000" expanded="true" height="60" name="Generate Extract" width="90" x="447" y="210">
            <parameter key="source_attribute" value="text"/>
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries">
              <parameter key="needleNominal" value=".*(needle).*"/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Append" from_port="merged set" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Generate Extract" to_port="Example Set"/>
          <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    regards

    Andrew
  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    hi

    how to write expression for get the specific word list.

  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    how to write expression in generate attribute operator for getting specific word list

     

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @rajbanokhan

     

     

    I don't understand exactily what you want but : 

    I propose to you 2 elements of response : 

    1.  first this process

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="75">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordA&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="112" y="187">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordB&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (3)" width="90" x="112" y="289">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordC&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (4)" width="90" x="112" y="391">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordB hello wordA good morning wordC&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="append" compatibility="8.0.001" expanded="true" height="145" name="Append" width="90" x="380" y="187"/>
    <operator activated="true" class="nominal_to_text" compatibility="8.0.001" expanded="true" height="82" name="Nominal to Text" width="90" x="514" y="187"/>
    <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="648" y="187">
    <list key="function_descriptions">
    <parameter key="needleBinominal" value="contains(text,&quot;wordA&quot;)&amp;&amp;contains(text,&quot;wordB&quot;)&amp;&amp;contains(text,&quot;wordC&quot;)"/>
    </list>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="782" y="187">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
    <parameter key="expression" value="(word1\b)(word2\b)(word3\b)"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="124" name="Multiply" width="90" x="246" y="187"/>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="514" y="442">
    <parameter key="string" value="wordC"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="514" y="289">
    <parameter key="string" value="wordA"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="514" y="85">
    <parameter key="string" value="wordB"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Filter Tokens (3)" to_port="document"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Filter Tokens (2)" to_port="document"/>
    <connect from_op="Multiply" from_port="output 3" to_op="Filter Tokens (4)" to_port="document"/>
    <connect from_op="Filter Tokens (4)" from_port="document" to_port="document 3"/>
    <connect from_op="Filter Tokens (3)" from_port="document" to_port="document 2"/>
    <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    <portSpacing port="sink_document 3" spacing="0"/>
    <portSpacing port="sink_document 4" spacing="0"/>
    </process>
    </operator>
    <operator activated="false" class="text:generate_extract" compatibility="7.5.000" expanded="true" height="68" name="Generate Extract" width="90" x="514" y="391">
    <parameter key="source_attribute" value="text"/>
    <parameter key="query_type" value="Regular Expression"/>
    <list key="string_machting_queries"/>
    <list key="regular_expression_queries">
    <parameter key="word1_word2_word3_Nominal" value="(?=.*wordA)(?=.*wordB)(?=.*wordC)"/>
    </list>
    <list key="regular_region_queries"/>
    <list key="xpath_queries"/>
    <list key="namespaces"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
    <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append" to_port="example set 3"/>
    <connect from_op="Generate Data by User Specification (4)" from_port="output" to_op="Append" to_port="example set 4"/>
    <connect from_op="Append" from_port="merged set" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    2. this second process

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="75">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordA&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="112" y="187">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordB&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (3)" width="90" x="112" y="289">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordC&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="68" name="Generate Data by User Specification (4)" width="90" x="112" y="391">
    <list key="attribute_values">
    <parameter key="text" value="&quot;test wordB hello wordA good morning wordC&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="append" compatibility="8.0.001" expanded="true" height="145" name="Append" width="90" x="380" y="187"/>
    <operator activated="true" class="nominal_to_text" compatibility="8.0.001" expanded="true" height="82" name="Nominal to Text" width="90" x="514" y="187"/>
    <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="648" y="187">
    <list key="function_descriptions">
    <parameter key="needleBinominal" value="contains(text,&quot;wordA&quot;)&amp;&amp;contains(text,&quot;wordB&quot;)&amp;&amp;contains(text,&quot;wordC&quot;)"/>
    </list>
    </operator>
    <operator activated="false" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="782" y="442">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
    <parameter key="expression" value="(word1\b)(word2\b)(word3\b)"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="124" name="Multiply" width="90" x="246" y="187"/>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="514" y="442">
    <parameter key="string" value="wordC"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="514" y="289">
    <parameter key="string" value="wordA"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="514" y="85">
    <parameter key="string" value="wordB"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Filter Tokens (3)" to_port="document"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Filter Tokens (2)" to_port="document"/>
    <connect from_op="Multiply" from_port="output 3" to_op="Filter Tokens (4)" to_port="document"/>
    <connect from_op="Filter Tokens (4)" from_port="document" to_port="document 3"/>
    <connect from_op="Filter Tokens (3)" from_port="document" to_port="document 2"/>
    <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    <portSpacing port="sink_document 3" spacing="0"/>
    <portSpacing port="sink_document 4" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:generate_extract" compatibility="7.5.000" expanded="true" height="68" name="Generate Extract" width="90" x="782" y="187">
    <parameter key="source_attribute" value="text"/>
    <parameter key="query_type" value="Regular Expression"/>
    <list key="string_machting_queries"/>
    <list key="regular_expression_queries">
    <parameter key="word1_word2_word3_Nominal" value="(?=.*wordA)(?=.*wordB)(?=.*wordC)"/>
    </list>
    <list key="regular_region_queries"/>
    <list key="xpath_queries"/>
    <list key="namespaces"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
    <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append" to_port="example set 3"/>
    <connect from_op="Generate Data by User Specification (4)" from_port="output" to_op="Append" to_port="example set 4"/>
    <connect from_op="Append" from_port="merged set" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Generate Extract" to_port="Example Set"/>
    <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    fot this last process , i don't undestand why the colums where there are all the wanted words, thre are nothing displayed.

     

    Regards;

     

    Lionel 

     

     

     

  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    hi

    i want to search specific word from text. for example we have a paragraph and i want to search a word from it

    so i use "The Generate Attributes operator" and how i write "Use the matches function and a suitable regular expression".

     

Sign In or Register to comment.