Extracting Information using Natural Language Processing

msacs09msacs09 Member Posts: 55 Contributor II
Experts,

I'm trying to extract few key words, specially from the attached data sample. Is there any sample process that I can infer to give me the desired results.

From the attached sample. I need to extract the individual associations with the previous entities from "notes" column . Following is the output I need from the attached sample . I have around 6K notes like this.

Previous_Associations 
(1) Mercury Marine
(2) Thrivent
(3) Pride System, Excel Capital
(4) Rinco
(5) Aero Network

Thanks for your time.

Answers

  • kaymankayman Member Posts: 662 Unicorn
    There are several ways to approach this, but if your data doesn't get more complex as the example I'd suggest to use the generate attributes operator as in below example. Since your recordset is pretty small it should go fast enough to get what you need.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="UTF-8"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.3.001" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="Notes&#10;Mr. Stewart Little is a Co-Founder and serves as the Chief Information Officer at Rinco. He is the faculyt of Department of Computer Science at Kinteic Enginerring Universtity. He was the Director at Mercury Marine.&#10;Ms. Lindsy Grahm is a Co-Founder and serves as a Head of Design at Mercedes. Lindsy is a Forbes 30 Under 30 recipient, and most recently was a Co-Founder of Thrivent, an enterprise datacenter management platform started from a Harvard University research project.&#10;Mr. Michael Johnson is a Co-Founder and Board Member of RedTraffic, where he serves as Chief Executive Officer. He is a Co-Founder of Majesic Solutions. Prior to this, he co-founded Pride Systems. He also co-founded and was a Board Member of Excel Capital. He also served as Chief Operating Officer at the New York Times Publishing Corporation.&#10;Mr. Shawn Reiding is a Co-Founder and serves as the Chief Executive Officer at Captial Corp. He was also the Co-Founder and CTO at Rinco.&#10;Ms. Sheena Jackson is a Co-Founder and serves as the Chief Executive Officer at MightyTalk. Previously, she served as a Director, Product Management at Aero Network.&#10;"/>
            <parameter key="column_separator" value="\t"/>
            <parameter key="parse_all_as_nominal" value="true"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="246" y="34">
            <list key="function_descriptions">
              <parameter key="Extract" value="if(contains([Notes],&quot;Mercury Marine&quot;),&quot;Mercury Marine&quot;,&quot;&quot;)"/>
              <parameter key="Extract" value="if(contains([Notes],&quot;Thrivent&quot;),concat(&quot;Thrivent&quot;,&quot;, &quot;,[Extract]),[Extract])"/>
              <parameter key="Extract" value="if(contains([Notes],&quot;Pride System&quot;),concat(&quot;Pride System&quot;, &quot;,&quot;,[Extract]),[Extract])"/>
              <parameter key="Extract" value="if(contains([Notes],&quot;Rinco&quot;),concat(&quot;Rinco&quot;,&quot;, &quot;,[Extract]),[Extract])"/>
              <parameter key="Extract" value="if(contains([Notes],&quot;Aero Network&quot;),concat(&quot;Aero Network&quot;,&quot;, &quot;,[Extract]),[Extract])"/>
              <parameter key="Extract" value="if(contains([Notes],&quot;Excel Capital&quot;),concat(&quot;Excel Capital&quot;,&quot;, &quot;,[Extract]),[Extract])"/>
            </list>
            <parameter key="keep_all" value="true"/>
          </operator>
          <operator activated="true" class="split" compatibility="9.3.001" expanded="true" height="82" name="Split" width="90" x="380" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Extract"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="split_pattern" value=","/>
            <parameter key="split_mode" value="ordered_split"/>
          </operator>
          <operator activated="true" class="trim" compatibility="9.3.001" expanded="true" height="82" name="Trim" width="90" x="514" y="34">
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Split" to_port="example set input"/>
          <connect from_op="Split" from_port="example set output" to_op="Trim" to_port="example set input"/>
          <connect from_op="Trim" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    


  • msacs09msacs09 Member Posts: 55 Contributor II
    edited June 2019
    kayman , Sorry i  might have not been clear with the requirement. I would  like to programmatically extract those Previous_Associations using some kind of Rapidminer NLTK or perhaps we can start with the pattern starting after word like “previously” or “recently” or “prior”. I wouldn't know those entities to hard-code there names in the "contains" statement 
  • kaymankayman Member Posts: 662 Unicorn
    Ah, that makes it a different story indeed :-)

    One relative simple and theoretical approach might then be the following :
     
    - Tokenise your content so you get a sentence by line
    - Look for sentences that contain defined keywords like previous / was / recently etc in a close distance to any of the given company names
    - Ignore the other sentences
    - extract the company names

    I'll see if I can find the time to get some sample working.

    NLTK (and others) have defined entity recognition logic, but you will have to train these anyway to recognize your brands also. On top of that these will not have the option to understand the relation with current or previous position.

    You can of course train models to do this for you, but this requires training data, and since you only have 6000 records you would almost have to manually tag them all to get a starter set which is probably a bit overkill then. 

  • kaymankayman Member Posts: 662 Unicorn
    Try something like this :

    It will look for companies you provide in a simple list view, and if these are in a sentence containing one of the defined keywords it will be extracted as an entity.

    It means of course you need to know the companies upfront, but you could use some other logic like looking for terms as 'founder of / CTO of / worked for' etc to get most of these up front also. This is an exercise you will have to do by default. 
    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="UTF-8"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.3.001" expanded="true" height="68" name="strings" width="90" x="112" y="289">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="Notes&#10;Mr. Stewart Little is a Co-Founder and serves as the Chief Information Officer at Rinco. He is the faculyt of Department of Computer Science at Kinteic Engineering Universtity. He was the Director at Mercury Marine.&#10;Ms. Lindsy Grahm is a Co-Founder and serves as a Head of Design at Mercedes. Lindsy is a Forbes 30 Under 30 recipient, and most recently was a Co-Founder of Thrivent, an enterprise datacenter management platform started from a Harvard University research project.&#10;Mr. Michael Johnson is a Co-Founder and Board Member of RedTraffic, where he serves as Chief Executive Officer. He is a Co-Founder of Majesic Solutions. Prior to this, he co-founded Pride Systems. He also co-founded and was a Board Member of Excel Capital. He also served as Chief Operating Officer at the New York Times Publishing Corporation.&#10;Mr. Shawn Reiding is a Co-Founder and serves as the Chief Executive Officer at Captial Corp. He was also the Co-Founder and CTO at Rinco.&#10;Ms. Sheena Jackson is a Co-Founder and serves as the Chief Executive Officer at MightyTalk. Previously, she served as a Director, Product Management at Aero Network.&#10;"/>
            <parameter key="column_separator" value="\t"/>
            <parameter key="parse_all_as_nominal" value="true"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.3.001" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="246" y="289">
            <list key="function_descriptions">
              <parameter key="Previous" value="[Notes]"/>
            </list>
            <parameter key="keep_all" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">duplicate the field to clean the data</description>
          </operator>
          <operator activated="true" class="utility:create_exampleset" compatibility="9.3.001" expanded="true" height="68" name="companies" width="90" x="112" y="493">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="from&#10;RedTraffic&#10;Mercury Marine&#10;Thrivent&#10;Pride System&#10;Rinco&#10;Aero Network&#10;Excel Capital&#10;New York Times Publishing Corporation&#10;Kinteic Engineering Universtity&#10;Captial Corp&#10;MightyTalk"/>
            <parameter key="column_separator" value="\t"/>
            <parameter key="parse_all_as_nominal" value="true"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">simple list with known companies</description>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.3.001" expanded="true" height="82" name="Generate Attributes (3)" width="90" x="246" y="493">
            <list key="function_descriptions">
              <parameter key="to" value="concat(&quot;[COMP:&quot;,[from],&quot;]&quot;)"/>
            </list>
            <parameter key="keep_all" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">create a simple tag so the system can recognize these as entities</description>
          </operator>
          <operator activated="true" class="replace_dictionary" compatibility="9.3.001" expanded="true" height="103" name="Replace (Dictionary)" width="90" x="447" y="289">
            <parameter key="return_preprocessing_model" value="false"/>
            <parameter key="create_view" value="false"/>
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Previous"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="from_attribute" value="from"/>
            <parameter key="to_attribute" value="to"/>
            <parameter key="use_regular_expressions" value="false"/>
            <parameter key="convert_to_lowercase" value="false"/>
            <parameter key="first_match_only" value="false"/>
            <description align="center" color="transparent" colored="false" width="126">add tags to recognized companies</description>
          </operator>
          <operator activated="true" class="set_role" compatibility="9.3.001" expanded="true" height="82" name="Set Role" width="90" x="581" y="289">
            <parameter key="attribute_name" value="Notes"/>
            <parameter key="target_role" value="original"/>
            <list key="set_additional_roles"/>
            <description align="center" color="transparent" colored="false" width="126">give original data a label so it travels through the process as metadata</description>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="9.3.001" expanded="true" height="82" name="Nominal to Text" width="90" x="715" y="289">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Previous"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <description align="center" color="transparent" colored="false" width="126">convert to text for processing</description>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="8.2.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="849" y="289">
            <parameter key="create_word_vector" value="false"/>
            <parameter key="vector_creation" value="TF-IDF"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="keep_text" value="true"/>
            <parameter key="prune_method" value="none"/>
            <parameter key="prune_below_percent" value="3.0"/>
            <parameter key="prune_above_percent" value="30.0"/>
            <parameter key="prune_below_rank" value="0.05"/>
            <parameter key="prune_above_rank" value="0.95"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
            <parameter key="select_attributes_and_weights" value="false"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="8.2.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
                <parameter key="mode" value="linguistic sentences"/>
                <parameter key="characters" value=".:"/>
                <parameter key="language" value="English"/>
                <parameter key="max_token_length" value="3"/>
                <description align="center" color="transparent" colored="false" width="126">split by sentence</description>
              </operator>
              <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.2.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="246" y="34">
                <parameter key="condition" value="matches"/>
                <parameter key="regular_expression" value=".*\b(was|previously|recently|prior|served)\b.*"/>
                <parameter key="case_sensitive" value="false"/>
                <parameter key="invert condition" value="false"/>
                <description align="center" color="transparent" colored="false" width="126">add all relevant keywords for filter</description>
              </operator>
              <operator activated="true" class="text:tokenize" compatibility="8.2.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="380" y="34">
                <parameter key="mode" value="specify characters"/>
                <parameter key="characters" value="[]"/>
                <parameter key="expression" value="\["/>
                <parameter key="language" value="English"/>
                <parameter key="max_token_length" value="3"/>
              </operator>
              <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.2.000" expanded="true" height="68" name="Filter Tokens (by Content) (2)" width="90" x="514" y="34">
                <parameter key="condition" value="contains"/>
                <parameter key="string" value="COMP:"/>
                <parameter key="regular_expression" value=".*\b(was|previously|recently|prior)\b.*"/>
                <parameter key="case_sensitive" value="false"/>
                <parameter key="invert condition" value="false"/>
                <description align="center" color="transparent" colored="false" width="126">add all relevant keywords for filter</description>
              </operator>
              <operator activated="true" class="text:replace_tokens" compatibility="8.2.000" expanded="true" height="68" name="Replace Tokens" width="90" x="648" y="34">
                <list key="replace_dictionary">
                  <parameter key="COMP:" value=", "/>
                </list>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
              <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
              <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Tokens (by Content) (2)" to_port="document"/>
              <connect from_op="Filter Tokens (by Content) (2)" from_port="document" to_op="Replace Tokens" to_port="document"/>
              <connect from_op="Replace Tokens" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
            <description align="center" color="transparent" colored="false" width="126">extract sentences containing keywords as previous etc</description>
          </operator>
          <operator activated="true" class="replace" compatibility="9.3.001" expanded="true" height="82" name="Replace" width="90" x="983" y="289">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="text"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="true"/>
            <parameter key="replace_what" value="^, "/>
            <description align="center" color="transparent" colored="false" width="126">some final cleanup</description>
          </operator>
          <operator activated="true" class="rename" compatibility="9.3.001" expanded="true" height="82" name="Rename" width="90" x="1117" y="289">
            <parameter key="old_name" value="text"/>
            <parameter key="new_name" value="Previous"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="9.3.001" expanded="true" height="82" name="Set Role (2)" width="90" x="1251" y="289">
            <parameter key="attribute_name" value="Notes"/>
            <parameter key="target_role" value="regular"/>
            <list key="set_additional_roles">
              <parameter key="Previous" value="regular"/>
            </list>
          </operator>
          <connect from_op="strings" from_port="output" to_op="Generate Attributes (2)" to_port="example set input"/>
          <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Replace (Dictionary)" to_port="example set input"/>
          <connect from_op="companies" from_port="output" to_op="Generate Attributes (3)" to_port="example set input"/>
          <connect from_op="Generate Attributes (3)" from_port="example set output" to_op="Replace (Dictionary)" to_port="dictionary"/>
          <connect from_op="Replace (Dictionary)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Replace" to_port="example set input"/>
          <connect from_op="Replace" from_port="example set output" to_op="Rename" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
          <connect from_op="Set Role (2)" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    


  • msacs09msacs09 Member Posts: 55 Contributor II
    Thank You. Sadly as you stated the challenge is getting the training data (known companies), without manually tagging them as we do  not have the list of known companies to train :-(
  • kaymankayman Member Posts: 662 Unicorn
    There are some ways to get these relatively quick, but all depends on the quality of the data. Since companies for instance usualy start with a capital you could try to filter out all words that are in lower caps only etc, this will then give you a pretty reduced wordlist you could go through faster.

    But yeah, it remains a challenge indeed...
  • msacs09msacs09 Member Posts: 55 Contributor II
    That’s a great thought.
Sign In or Register to comment.