Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Only search for a specific keyword from a text

tahsintahsin Member Posts: 20 Contributor II
Hello,

I want to search for a specific keyword from a text and assign them by their type. I am using the Generate Attributes operator and writing a function to search for the keywords. I have this problem, I have words like, "liar", "lies", "lied" in the list. The function expression that I am using picks up words like "families", "familiar" as well. I only want words that has "lies", "liar", not "families" or "familiar". 

This was my approach ;

if(matches(Notes,".*lies.*"),"Lies",
if(matches(Notes,".*liar.*"),"Lies",
if(matches(Notes,".*lied.*"),"Lies",
if(matches(Notes,".*lying.*"),"Lying","None"))))

Any help is appreciated. Thanks

Best Answer

  • MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    Solution Accepted
    Hello @tahsin

    You could use a MAP operator with regex configured in order to replace all the other text on your text attribute. You may want to create a copy of it first.
    I'm pasting a process that could help you get there.
    Since you are doing some text processing I would recommend going through the Text and Web Mining tutorials at the academy

    https://academy.rapidminer.com/learn/course/text-and-web-mining-with-rapidminer/text-and-web-mining/lets-get-started

    <?xml version="1.0" encoding="UTF-8"?><process version="9.9.002">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.9.002" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="-1"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.9.002" expanded="true" height="68" name="Create ExampleSet" width="90" x="246" y="85">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="Text&#10;He said a couple of lies to us&#10;The wife pointed he was a liar and that was the reason for it&#10;Person lied about the reason he was at that place&#10;He was lying all the time"/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="generate_copy" compatibility="9.9.002" expanded="true" height="82" name="Generate Copy" width="90" x="380" y="85">
            <parameter key="attribute_name" value="Text"/>
            <parameter key="new_name" value="Type"/>
          </operator>
          <operator activated="true" class="map" compatibility="9.9.002" expanded="true" height="82" name="Map" width="90" x="514" y="85">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Type"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <list key="value_mappings">
              <parameter key=".+\b(lies|liar|lied)\b.+" value="Liar"/>
              <parameter key=".+\b(lying)\b.+" value="Lying"/>
            </list>
            <parameter key="consider_regular_expressions" value="true"/>
            <parameter key="add_default_mapping" value="false"/>
          </operator>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Generate Copy" to_port="example set input"/>
          <connect from_op="Generate Copy" from_port="example set output" to_op="Map" to_port="example set input"/>
          <connect from_op="Map" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,527 RM Data Scientist
    Dont you want to use the contains function?

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • tahsintahsin Member Posts: 20 Contributor II
    hi Martin, I used the contains function first but it does the same thing. Picks up everything. 

    This is actually how I do it in python,
    df['Type'] = np.where(df.Notes.str.contains(r'\b(lies|liar|lied)\b'), 'Lies',
                 np.where(df.Notes.str.contains(r'\b(lying)\b'), 'Lying','None'))

    Not sure how to do it in here. 
  • tahsintahsin Member Posts: 20 Contributor II
Sign In or Register to comment.