Filter examples with dictionary

EL75EL75 Member Posts: 43 Contributor II
Hi,
I'm looking for a solution to filter out examples of a data set, using a dictionary containing words. That would be a "filter example" operator working as the "replace (Dictionary)". That could allow to filter out all examples if a chosen attribute would contain words contained in the dictionary (or, with the "invert filter" option, keep only them).
Best, 
Tagged:

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,254 RM Data Scientist
    Hi
    so filter tokens Using ExampleSet does not help?

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • EL75EL75 Member Posts: 43 Contributor II
    edited December 2020
    yep, it helps for token filtering stuff, but I'd like to filter (remove or  KEEP) the entire row.
    I explain: when you have a dictionary containing words that you're sure only one category of people are using, such a functionality could allow someone to isolate the entire rows (i.e the speakers), then filtering / splitting the orignal data set. that could be very relevant for the next analysis stuff. therefore, I'd like to have a functionality that could allow me, to match a set of words, via an excel file - the dictionary - in the attribute containing the text (verbatim). The "replace (Dictionary)" operator is the idea, but the output would be => filtering the entire row rather than replacing a string by another.
    best,
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,254 RM Data Scientist
    Hi @El75,
    how is this different from the contains option of filter examples? We could just loop over your dictonary and filter for all rows.

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • EL75EL75 Member Posts: 43 Contributor II
    edited December 2020
    Martin, 
    you're definitely right, and I've done it. Thanks for the solution.
    Remain few questions.
    First, here's what I've done:
    1- Level 0 => a Loop attribute operator to loop over the att of the dictionary:


    2- LEVEL -1 => "Loop Values" operator within the "Loop Attributes" operator => 




    3- LEVEL -2 => inside the loop-value operator:


    The remaining questions for which I couldn't find solution by myself:

    A- Inside loop_value: I've added a "Generate attribute" operator after the "Filter example" operator to keep in the output, for each row of the dataset,  firstly the names of the dictionary categories and secondly the number of times the words of the dictionary were matched. The problem is that I failed to find a way to calculate the value of the generated attribute:
    Let’s take an example:
    - A dictionary that has 2 columns (attributes) and headers contain the att names: « dic1 »  contains 10 words, « dic2 »  contains 20 words
    - One data set, that contains 100 rows (examples) with many columns including one « Subj&Body » that contains verbatim to analyse by the filter operator.
    During the first loop_value's process (dic1), an attribute will be generated att=« dic1 » , each of the 100 rows of the data set, will be evaluated for each of the 10  words of the first dictionary attribute( « dic1 ») . If the row N°2 of the dataset contains 5 / 10 of the words, how to calculate - at the end of the « loop_value » process corresponding to « dic1 » - the value « 5 » as the value of the « dic1 » generated?

    B - Filter operator stops after identifying a first match: I keep in mind that this process aims to identify rows in a data set that contains words in an att with the final goal to filter those rows, so that, all stuff that could be done in a text processing operator (such as tokenization) isn't possible here. Therefore, the filter will match all strings that correspond to the dictionary entry, including false positive (e.g filter retains «  booking  »  for «  book  »  in the dic.). 
    With that in mind, could it nevertheless be possible that the filter operator, could match all strings of the verbatim within a cell, and return the number of match ?

    C- "Loop in the Loop" generates a very big data set as output, and for each row of the dataset that contains the verbatim, different new attributes are generated, and all rows don't have same generated attributes depending if words have been matched. The result is that to aggregate the output, I’ve used (LEVEL -1) an "Append (Superset) » operator that accepts example-sets with different attributes. 

    Then, to remove duplicates of rows, I’ve used a PIVOT operator. But I did that in a separate process, because the pivot needs to be tuned with the names of the attributes of the dictionary selected in the Loop_attribute (LEVEL 0). 
    Why? because, with a big dictionary, and a bid dataset, the following parameters have a huge impact on the memory and the computer capacities : 
    a)  the number of  dictionary attributes that are selected,
    b)  the option of creating an att for each word matched by the filter (like it has been done for the  the dictionary categories) 
    c)  to keep the UNMATCHED values in the final output with a « 0 » value in all attribues generated

    For hardware limitations mentioned above, I’ve decided to renounce to b) and c).
    But as a) also has a huge impact on performances, dictionary categories must be selected manual, and consequently the tuning of the pivot matrix. But this is really time consuming (e.g if the number of categories differs from a run to another, the pivot doesn’t operate...)

    Here is the question: Could there be a way to have a pivot process automated, taking into account the number of attributes selected, their names ? that would allow us to keep the pivot process within the « loop in the loop » one?

    Thanks in advance for all help!
    best,
Sign In or Register to comment.