Filter examples with dictionary

EL75 · December 2020

Hi,
I'm looking for a solution to filter out examples of a data set, using a dictionary containing words. That would be a "filter example" operator working as the "replace (Dictionary)". That could allow to filter out all examples if a chosen attribute would contain words contained in the dictionary (or, with the "invert filter" option, keep only them).
Best,

MartinLiebig · December 2020

Hi

so filter tokens Using ExampleSet does not help?

Best,

Martin

EL75 · December 2020

yep, it helps for token filtering stuff, but I'd like to filter (remove or KEEP) the entire row.
I explain: when you have a dictionary containing words that you're sure only one category of people are using, such a functionality could allow someone to isolate the entire rows (i.e the speakers), then filtering / splitting the orignal data set. that could be very relevant for the next analysis stuff. therefore, I'd like to have a functionality that could allow me, to match a set of words, via an excel file - the dictionary - in the attribute containing the text (verbatim). The "replace (Dictionary)" operator is the idea, but the output would be => filtering the entire row rather than replacing a string by another.
best,

MartinLiebig · December 2020

Hi @El75,

how is this different from the contains option of filter examples? We could just loop over your dictonary and filter for all rows.

Best,

Martin

EL75 · December 2020

Martin,
you're definitely right, and I've done it. Thanks for the solution.
Remain few questions.
First, here's what I've done:
1- Level 0 => a Loop attribute operator to loop over the att of the dictionary:

Image: https://us.v-cdn.net/6030995/uploads/editor/dt/fh958lbr23ml.png

2- LEVEL -1 => "Loop Values" operator within the "Loop Attributes" operator =>

Image: https://us.v-cdn.net/6030995/uploads/editor/uh/adrteecxdakb.png

3- LEVEL -2 => inside the loop-value operator:

Image: https://us.v-cdn.net/6030995/uploads/editor/mg/ucdetat1j4uk.png

The remaining questions for which I couldn't find solution by myself:

A- Inside loop_value: I've added a "Generate attribute" operator after the "Filter example" operator to keep in the output, for each row of the dataset, firstly the names of the dictionary categories and secondly the number of times the words of the dictionary were matched. The problem is that I failed to find a way to calculate the value of the generated attribute:

Let’s take an example:

- A dictionary that has 2 columns (attributes) and headers contain the att names: « dic1 » contains 10 words, « dic2 » contains 20 words

- One data set, that contains 100 rows (examples) with many columns including one « Subj&Body » that contains verbatim to analyse by the filter operator.

During the first loop_value's process (dic1), an attribute will be generated att=« dic1 » , each of the 100 rows of the data set, will be evaluated for each of the 10 words of the first dictionary attribute( « dic1 ») . If the row N°2 of the dataset contains 5 / 10 of the words, how to calculate - at the end of the « loop_value » process corresponding to « dic1 » - the value « 5 » as the value of the « dic1 » generated?

B - Filter operator stops after identifying a first match: I keep in mind that this process aims to identify rows in a data set that contains words in an att with the final goal to filter those rows, so that, all stuff that could be done in a text processing operator (such as tokenization) isn't possible here. Therefore, the filter will match all strings that correspond to the dictionary entry, including false positive (e.g filter retains « booking » for « book » in the dic.).

With that in mind, could it nevertheless be possible that the filter operator, could match all strings of the verbatim within a cell, and return the number of match ?

C- "Loop in the Loop" generates a very big data set as output, and for each row of the dataset that contains the verbatim, different new attributes are generated, and all rows don't have same generated attributes depending if words have been matched. The result is that to aggregate the output, I’ve used (LEVEL -1) an "Append (Superset) » operator that accepts example-sets with different attributes.

Then, to remove duplicates of rows, I’ve used a PIVOT operator. But I did that in a separate process, because the pivot needs to be tuned with the names of the attributes of the dictionary selected in the Loop_attribute (LEVEL 0).

Why? because, with a big dictionary, and a bid dataset, the following parameters have a huge impact on the memory and the computer capacities :

a) the number of dictionary attributes that are selected,

b) the option of creating an att for each word matched by the filter (like it has been done for the the dictionary categories)

c) to keep the UNMATCHED values in the final output with a « 0 » value in all attribues generated

For hardware limitations mentioned above, I’ve decided to renounce to b) and c).

But as a) also has a huge impact on performances, dictionary categories must be selected manual, and consequently the tuning of the pivot matrix. But this is really time consuming (e.g if the number of categories differs from a run to another, the pivot doesn’t operate...)

Here is the question: Could there be a way to have a pivot process automated, taking into account the number of attributes selected, their names ? that would allow us to keep the pivot process within the « loop in the loop » one?

Thanks in advance for all help!
best,

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Filter examples with dictionary

Answers