Options

Text Extraction/Tokenization

TCGJTCGJ Member Posts: 1 Newbie
Hi everyone!
I am trying to de-personalize some data that I have and I figured I'd be able to use RM to either extract names or tokenization to anonymize them.
I've tried using the tokenization and extract information operators, but they haven't given me the results I was looking for.
Has anyone done this before and/or can offer some advice?

Answers

  • Options
    MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    Hi TCGJ is not a perfect solution but it may give you an idea of what you could do with some operators inside RapidMiner

    <?xml version="1.0" encoding="UTF-8"?><process version="9.8.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.8.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.8.001" expanded="true" height="68" name="Text on a DataSet" width="90" x="112" y="34">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="TEXT&#10;Hello my name is Chandler Muriel Bing. I have a friend who is named Pieter van den Woude and he has another friend, A. A. Milne. Gandalf the Gray joins us. Together, we make up the Friends Cast and Crew."/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="replace" compatibility="9.8.001" expanded="true" height="82" name="Replace" width="90" x="246" y="34">
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="replace_what" value=" [A-Z]([a-z]+|\.)(?:\s+[A-Z]([a-z]+|\.))*(?:\s+[a-z][a-z\-]+){0,2}\s+[A-Z]([a-z]+|\.)"/>
            <parameter key="replace_by" value=" xxxxx"/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="9.3.001" expanded="true" height="68" name="Create Document" width="90" x="112" y="136">
            <parameter key="text" value="Hello my name is Chandler Muriel Bing. I have a friend who is named Pieter van den Woude and he has another friend, A. A. Milne. Gandalf the Gray joins us. Together, we make up the Friends Cast and Crew."/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          </operator>
          <operator activated="true" class="text:replace_tokens" compatibility="9.3.001" expanded="true" height="68" name="Replace Tokens" width="90" x="246" y="136">
            <list key="replace_dictionary">
              <parameter key=" [A-Z]([a-z]+|\.)(?:\s+[A-Z]([a-z]+|\.))*(?:\s+[a-z][a-z\-]+){0,2}\s+[A-Z]([a-z]+|\.)" value=" xxxxxxxxxxxxxx"/>
            </list>
          </operator>
          <connect from_op="Text on a DataSet" from_port="output" to_op="Replace" to_port="example set input"/>
          <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
          <connect from_op="Create Document" from_port="output" to_op="Replace Tokens" to_port="document"/>
          <connect from_op="Replace Tokens" from_port="document" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <description align="center" color="yellow" colored="true" height="80" resized="true" width="698" x="42" y="214">Regex Expression copied from: &lt;br/&gt;https://stackoverflow.com/questions/7653942/find-names-with-regular-expression</description&gt;
        </process>
      </operator>
    </process>
    You could also use the Meaning Cloud extension to recognize some other entities.
    And for a more advance take on the problem you could test the spaCy library with the execute python operator or with the JupyterNotebooks on AI Hub platform.

    This links may help you:

    https://medium.com/analytics-vidhya/natural-language-processing-using-spacy-in-python-part-1-ac1bc4ad2b9c  


Sign In or Register to comment.