Approach to standardize merchant names -Tagging

msacs09msacs09 Member Posts: 55 Contributor II
Experts,

I'm in the process of the standardizing our transaction type and bucket them in a correct category.

For example we have companies like below. The biggest challenge is tagging and putting them in appropriate bucket. There are lot of variations with transaction types. What machine learning model can we use here to tackle this monstrous tagging work. Are there any sample model that is built to address such use cases. any reference to it is greatly apprenticed.

CatgType Matched Actual Entry

HR ADP Adp
Travel Airbnb Airbnb
Travel Alaska Air AlaskaAirlinesInc
HR Allied Delta Allied Delta
G&A Amazon Amazon
Server AWS Amazon Web Services
Credit Crd American Express American Express
Travel American Air AmericanAirlines
Credit crd American Express Amex Epayment
Insurance Anthem Anthem Bc


Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hmmm I have an idea but I'd like to test it out. Do you have a larger data set you can share?
  • msacs09msacs09 Member Posts: 55 Contributor II
    Sir I sent you the larger data set to you inbox. Thank you for all the support 
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    just wrote you back.
  • msacs09msacs09 Member Posts: 55 Contributor II
    Thank you sir. What i was thinking is to get the Industry/business category by scrapping that data on the google search page to get specific industry for example Toyota would  "automotive"

    Is there a example on how we scrape a google web page and achieve this? Attached is what i wanted to extract.



  • msacs09msacs09 Member Posts: 55 Contributor II
    edited June 2019
    I think we have 2 things to be done for this use case. Again Experts please correct me if i'm off track

    First, name matching and grouping different naming of the company to be same ex:- AWS, Amazon Web Services, Amazon Web Services Inc, Amazon Web Services Llc etc., to same company

    Second, use Google Search or use wiki API (this isn't as consistent as google) passing company names and scrap the data. In the below example it should be courier delivery services company

    https://en.wikipedia.org/w/api.php?action=opensearch&search=FEDEX&limit=1&format=json

    https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=FEDEX

    So i think i got theory part, but now how to do this in RM is where i have BIG GAP any sample process to get me started is greatly appreciated.
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hmm I don't really understand your theory here but if you want to grab those wikipedia JSONs in RapidMiner, that's not hard to do.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.3.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="-1"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.3.000" expanded="true" height="68" name="Retrieve Transaction Category Tagging" width="90" x="45" y="85">
            <parameter key="repository_entry" value="Transaction Category Tagging"/>
          </operator>
          <operator activated="true" class="filter_example_range" compatibility="9.3.000" expanded="true" height="82" name="Filter Example Range" width="90" x="179" y="85">
            <parameter key="first_example" value="1"/>
            <parameter key="last_example" value="4"/>
            <parameter key="invert_filter" value="false"/>
          </operator>
          <operator activated="true" class="concurrency:loop_values" compatibility="9.3.000" expanded="true" height="82" name="Loop Values" width="90" x="313" y="85">
            <parameter key="attribute" value="Row Labels"/>
            <parameter key="iteration_macro" value="loop_value"/>
            <parameter key="reuse_results" value="false"/>
            <parameter key="enable_parallel_execution" value="false"/>
            <process expanded="true">
              <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="45" y="34">
                <parameter key="url" value="https://en.wikipedia.org/w/api.php?action=opensearch&amp;amp;search=%{loop_value}&amp;amp;limit=1&amp;amp;format=json"/>
                <parameter key="random_user_agent" value="false"/>
                <parameter key="connection_timeout" value="10000"/>
                <parameter key="read_timeout" value="10000"/>
                <parameter key="follow_redirects" value="true"/>
                <parameter key="accept_cookies" value="all"/>
                <parameter key="cookie_scope" value="global"/>
                <parameter key="request_method" value="GET"/>
                <list key="query_parameters"/>
                <list key="request_properties"/>
                <parameter key="override_encoding" value="false"/>
                <parameter key="encoding" value="SYSTEM"/>
              </operator>
              <operator activated="true" class="delay" compatibility="9.3.000" expanded="true" height="82" name="Delay" width="90" x="179" y="34">
                <parameter key="delay" value="fixed"/>
                <parameter key="delay_amount" value="1000"/>
                <parameter key="min_delay_amount" value="0"/>
                <parameter key="max_delay_amount" value="1000"/>
              </operator>
              <operator activated="true" class="text:json_to_data" compatibility="8.1.000" expanded="true" height="82" name="JSON To Data" width="90" x="313" y="34">
                <parameter key="ignore_arrays" value="false"/>
                <parameter key="limit_attributes" value="false"/>
                <parameter key="skip_invalid_documents" value="false"/>
                <parameter key="guess_data_types" value="true"/>
                <parameter key="keep_missing_attributes" value="false"/>
                <parameter key="missing_values_aliases" value=", null, NaN, missing"/>
              </operator>
              <connect from_op="Get Page" from_port="output" to_op="Delay" to_port="through 1"/>
              <connect from_op="Delay" from_port="through 1" to_op="JSON To Data" to_port="documents 1"/>
              <connect from_op="JSON To Data" from_port="example set" to_port="output 1"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="subprocess" compatibility="9.3.000" expanded="true" height="82" name="Union Append" width="90" x="447" y="85">
            <process expanded="true">
              <operator activated="true" class="loop_collection" compatibility="9.3.000" expanded="true" height="82" name="Output (4)" width="90" x="45" y="34">
                <parameter key="set_iteration_macro" value="true"/>
                <parameter key="macro_name" value="iteration"/>
                <parameter key="macro_start_value" value="1"/>
                <parameter key="unfold" value="false"/>
                <process expanded="true">
                  <operator activated="false" breakpoints="after" class="select" compatibility="9.3.000" expanded="true" height="68" name="Select (5)" width="90" x="112" y="34">
                    <parameter key="index" value="%{iteration}"/>
                    <parameter key="unfold" value="false"/>
                  </operator>
                  <operator activated="true" class="branch" compatibility="9.3.000" expanded="true" height="82" name="Branch (2)" width="90" x="313" y="34">
                    <parameter key="condition_type" value="expression"/>
                    <parameter key="expression" value="%{iteration}==1"/>
                    <parameter key="io_object" value="ANOVAMatrix"/>
                    <parameter key="return_inner_output" value="true"/>
                    <process expanded="true">
                      <connect from_port="condition" to_port="input 1"/>
                      <portSpacing port="source_condition" spacing="0"/>
                      <portSpacing port="source_input 1" spacing="0"/>
                      <portSpacing port="sink_input 1" spacing="0"/>
                      <portSpacing port="sink_input 2" spacing="0"/>
                    </process>
                    <process expanded="true">
                      <operator activated="true" class="recall" compatibility="9.3.000" expanded="true" height="68" name="Recall (5)" width="90" x="45" y="187">
                        <parameter key="name" value="LoopData"/>
                        <parameter key="io_object" value="ExampleSet"/>
                        <parameter key="remove_from_store" value="true"/>
                      </operator>
                      <operator activated="true" class="union" compatibility="9.3.000" expanded="true" height="82" name="Union (2)" width="90" x="179" y="34"/>
                      <connect from_port="condition" to_op="Union (2)" to_port="example set 1"/>
                      <connect from_op="Recall (5)" from_port="result" to_op="Union (2)" to_port="example set 2"/>
                      <connect from_op="Union (2)" from_port="union" to_port="input 1"/>
                      <portSpacing port="source_condition" spacing="0"/>
                      <portSpacing port="source_input 1" spacing="0"/>
                      <portSpacing port="sink_input 1" spacing="0"/>
                      <portSpacing port="sink_input 2" spacing="0"/>
                    </process>
                  </operator>
                  <operator activated="true" class="remember" compatibility="9.3.000" expanded="true" height="68" name="Remember (5)" width="90" x="581" y="34">
                    <parameter key="name" value="LoopData"/>
                    <parameter key="io_object" value="ExampleSet"/>
                    <parameter key="store_which" value="1"/>
                    <parameter key="remove_from_process" value="true"/>
                  </operator>
                  <connect from_port="single" to_op="Branch (2)" to_port="condition"/>
                  <connect from_op="Branch (2)" from_port="input 1" to_op="Remember (5)" to_port="store"/>
                  <connect from_op="Remember (5)" from_port="stored" to_port="output 1"/>
                  <portSpacing port="source_single" spacing="0"/>
                  <portSpacing port="sink_output 1" spacing="0"/>
                  <portSpacing port="sink_output 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="select" compatibility="9.3.000" expanded="true" height="68" name="Select (6)" width="90" x="179" y="34">
                <parameter key="index" value="%{iteration}"/>
                <parameter key="unfold" value="false"/>
              </operator>
              <connect from_port="in 1" to_op="Output (4)" to_port="collection"/>
              <connect from_op="Output (4)" from_port="output 1" to_op="Select (6)" to_port="collection"/>
              <connect from_op="Select (6)" from_port="selected" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve Transaction Category Tagging" from_port="output" to_op="Filter Example Range" to_port="example set input"/>
          <connect from_op="Filter Example Range" from_port="example set output" to_op="Loop Values" to_port="input 1"/>
          <connect from_op="Loop Values" from_port="output 1" to_op="Union Append" to_port="in 1"/>
          <connect from_op="Union Append" from_port="out 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    


  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    These are both fairly complex tasks (whether in RapidMiner or any other platform).  You can do some text string matching (using similarity measures) to try to combine instances where one string is a subset or close match to another, but many of the examples you provide (such as AWS and Amazon matching) are going to be very difficult to accomplish programmatically.  You may want to look at adding a manual dictionary of token replacement for commonly used abbreviations and acronyms.  
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • kaymankayman Member Posts: 662 Unicorn
    Yeah, the joy of entity recognition. All nice but you need to get the training data...

    I'd follow the advice above and work in 2 steps. First have a kind of 'translation list'  where I'd use regex to convert most known variations to a common label. So (AWS|Amazon.*web.*services) becomes AWS or so. Dirty job but someone has to do it. 

    Next I'd do something as in attached example, where you can use a simple list with all of the entities you like to find (I've made something similar to look for brands etc in reviews) and the process will 'tag' these in the text. This can be relatively easy converted to more official tagging so you create for instance your own entity recognition model in for instance Spacy, and integrate this using python. 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.2.001" expanded="true" height="68" name="Brands" width="90" x="246" y="136">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="brand&#10;canon&#10;nikon&#10;panasonic&#10;samsung&#10;sony&#10;jbl&#10;sonos&#10;bose"/>
            <parameter key="column_separator" value="\t"/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="subprocess" compatibility="9.2.001" expanded="true" height="82" name="Subprocess (6)" width="90" x="380" y="136">
            <process expanded="true">
              <operator activated="true" class="generate_attributes" compatibility="9.2.001" expanded="true" height="82" name="Generate Attributes (3)" width="90" x="45" y="34">
                <list key="function_descriptions">
                  <parameter key="brand" value="trim(lower(brand))"/>
                  <parameter key="first" value="prefix([brand],1)"/>
                  <parameter key="remain" value="suffix([brand],length([brand])-1)"/>
                </list>
                <parameter key="keep_all" value="true"/>
              </operator>
              <operator activated="true" class="remove_duplicates" compatibility="9.2.001" expanded="true" height="103" name="Remove Duplicates (2)" width="90" x="179" y="34">
                <parameter key="attribute_filter_type" value="all"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
                <parameter key="treat_missing_values_as_duplicates" value="false"/>
              </operator>
              <operator activated="true" class="aggregate" compatibility="9.2.001" expanded="true" height="82" name="Aggregate (2)" width="90" x="313" y="34">
                <parameter key="use_default_aggregation" value="false"/>
                <parameter key="attribute_filter_type" value="all"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
                <parameter key="default_aggregation_function" value="average"/>
                <list key="aggregation_attributes">
                  <parameter key="remain" value="concatenation"/>
                </list>
                <parameter key="group_by_attributes" value="first"/>
                <parameter key="count_all_combinations" value="false"/>
                <parameter key="only_distinct" value="false"/>
                <parameter key="ignore_missings" value="true"/>
              </operator>
              <operator activated="true" class="generate_attributes" compatibility="9.2.001" expanded="true" height="82" name="Generate Attributes (4)" width="90" x="447" y="34">
                <list key="function_descriptions">
                  <parameter key="from" value="concat(&quot;(?i)\\b(&quot;,[first],&quot;(?:&quot;,[concat(remain)],&quot;))\\b&quot;)"/>
                  <parameter key="to" value="&quot;&lt;:tag:brand:XTAG$1:&gt;&quot;"/>
                </list>
                <parameter key="keep_all" value="true"/>
              </operator>
              <connect from_port="in 1" to_op="Generate Attributes (3)" to_port="example set input"/>
              <connect from_op="Generate Attributes (3)" from_port="example set output" to_op="Remove Duplicates (2)" to_port="example set input"/>
              <connect from_op="Remove Duplicates (2)" from_port="example set output" to_op="Aggregate (2)" to_port="example set input"/>
              <connect from_op="Aggregate (2)" from_port="example set output" to_op="Generate Attributes (4)" to_port="example set input"/>
              <connect from_op="Generate Attributes (4)" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="246" y="34">
            <parameter key="text" value="This is a string that includes some brands, like Sony, samsung and Panasonic"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          </operator>
          <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="380" y="34">
            <parameter key="text_attribute" value="strings"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="replace_dictionary" compatibility="9.2.001" expanded="true" height="103" name="Replace (2)" width="90" x="782" y="34">
            <parameter key="return_preprocessing_model" value="false"/>
            <parameter key="create_view" value="false"/>
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="strings"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="from_attribute" value="from"/>
            <parameter key="to_attribute" value="to"/>
            <parameter key="use_regular_expressions" value="true"/>
            <parameter key="convert_to_lowercase" value="false"/>
            <parameter key="first_match_only" value="false"/>
          </operator>
          <connect from_op="Brands" from_port="output" to_op="Subprocess (6)" to_port="in 1"/>
          <connect from_op="Subprocess (6)" from_port="out 1" to_op="Replace (2)" to_port="dictionary"/>
          <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
          <connect from_op="Documents to Data" from_port="example set" to_op="Replace (2)" to_port="example set input"/>
          <connect from_op="Replace (2)" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    


Sign In or Register to comment.