TF-IDF and Aspect grouping with Rapid Miner

HeikoeWin786 · June 2020

Dear all,
I am new to RapidMiner and I got few questions really seeking your kind support.
I have a airline dataset with labelled data of sentiment (pos, neg, and netural).
I had divided the dataset 75/25 data split and perform the text processing (i.e. nominal to text, data to document, preprocess document with tokenization, stopwords).
Q1: However, when the result out in word from preprocess document operator, I found the neg,pos and netural data columns have all zero value. Is this normal or am I missing something?
Q2: I want to perform the aspect categorization i.e. I have 5 topics as aspect groups (e.g. flight, service, ...) and the output of TF-IDF consists of the highest frequency words, and those words I want to group under the 5 topics. After that, I will perform Navies Bayes Classification to know the sentiment classification for each aspects. Is there any efficient way I can perform this in RapidMiner?

I am a really starter in Rapidminer and i am so sorry if I am asking very basic questions. But, I do hope your kind support in helping me to learn this.

Thanks and regarda,
Hikoe

kayman · June 2020

In the workflow, the pre-process --> TF-IDF = it is what "process document" operator does right? exa output provide the dataset that will be an input for NBC. And the word output from the operator is just for the analysis of TF-IDF, correct? So, for NBC, we need to use the exa output of "process operator"

Yes, but you can also use this word list as a filter for your unseen (new) data. Not needed in case of NBC, some other models require this. It's always good practice

But the performance matrix (confusion matrix) is to test on training model or test model? Bec I see it shows 83% for training but 0% for test

You will have to ignore these figures as they were created by a non optimal workflow and are therefore not useful. I think in the example I provided the x-fold will give you both training and test accuracy, the single one only the test one. This is anyway the most important but it's always good to compare training and test accuracy to see if there are major differences.

kayman · June 2020

You need to explicitly state that your label is a label, otherwise it's just considered a normal plane attribute. For this you need to use the [Set Role] operator, and assign the special label attribute to your labeled attribute.

Now the wordlist will return multiple columns, one by label

HeikoeWin786 · June 2020

@kayman
Dear Kayman, Thanks for your answer. yes, I do label my label column as label using the set role operator. I had attached here the pre-process steps I did. However, the find the output come out as multiple columns for each word tokenized. Also for the result of TF-IDF, the frequency word but the three columns which are the label (negative,postive nd netural), they have only 0 value and no other value. Any idea what I could have done right? thanks a lot.

Image: https://us.v-cdn.net/6030995/uploads/editor/vw/q0vmn4zxl39q.png

kayman · June 2020

Could you share your process? (File -> export process)? No need to provide data, just want to take a look at the parameters to see if there is something overlooked.

There are a few typical candidates, like the labeling and nominal to text, but these appear covered indeed.

BTW, you can skip a few parts in your process. there is no need to use the multiply operators as the Store operators pass the content through also, so generate ID -> Store -> next operator. Not a big change but it reduces the clutter a bit.

You can also use the [process documents from data] operator. This way you do not need to convert the data to documents first, you just feed your data to the example port of the operator. Saves you again a few blocks.

Next validate you process operator. Ensure you create your vectors, otherwise your example out will be empty, but also ensure you are not pruning to much. When in doubt try to run without pruning first, it can take a longer time but at least you can validate if there is data coming through.

Same goes for the processing parts inside the operator. In case of doubt try with the minimum first, so like just tokenizing on spaces and nothing more. It's going to give a lot of garbage but at least you know it's doing something, then you start improving up to the moment nothing get's through anymore and you have your troublemaker.

HeikoeWin786 · June 2020

@kayman Sure, definitely can! I had attached here. Thanks a lot for your kind explanation. I will simplify the process. But, I do believe I am missing something in preprocessing part. Because, when I use the output of this preprocessing and apply NBC classifier, I got 0.00 for performance for test dataset. And, 87.3% accuracy for train dataset. Thank you so much for helping me out here

kayman · June 2020

You are actually doing a TF-IDF on a TF_IDF set. The output of the process documents is already vectorised so no point of doing it twice.

The operator has 2 outputs, one is the example set and if you use the vector option this becomes your bag of words with the TF-IDF score for each of them. No further processing is needed, this set is used for training as it is. I would suggest some pruning and further tuning in the operator itself to reduce the vector size but apart from that it should work.

the other one is your word list, and if your labels are set correctly this will split the words by associated label. You used an attribute called confident_score for your label, are you sure this is the right attribute (the one stating positive, negative or neutral) ?

I've attached a streamlined version of your process, hope it helps.

HeikoeWin786 · June 2020

@kayman
Dear Kayman,

Thanks a lot for highlighting what I was missing. I had read more on what you had advised and now somehow I understood.
Just one thing, I would like to understand is regarding the labeling of the dataset. The orginal dataset has "recommended" option where user give "yes/no/empty", I label the comment as "Yes" for positve, no for "negative" and netural for "empty". However, my dataset is in excel format and each comments is in excel row. will this effect the pre-processing? I had attached the snapshot of dataset for your kind reference.
Could you explain me a bit more regarding this?
"the other one is your word list, and if your labels are set correctly this will split the words by associated label. You used an attribute called confident_score for your label, are you sure this is the right attribute (the one stating positive, negative or neutral) ?"

Do I need to put sentiment score as the numbering like (postive = 3, netural = 2, negative = 1) or can keep in text?

Really, Kayman, I am so thankful for your kind input here

Regards,
Heikoe

Image: https://us.v-cdn.net/6030995/uploads/editor/so/t662qaj3dp6h.png

HeikoeWin786 · June 2020

Just one more thing @kayman , i am not sure if I need to check the performance for training or testing result? The steps I am doing is as below:

1.Label the dataset (i.e. confident_score = Positive, Negative, and Neutral)

2.Clean the dataset (remove duplicates, trim the whitespaces)

3.Split the dataset (75 for Training, 25 for testing)

4.Merge the dataset (75 with value for the label and 25 with empty value for the label)

5.Pre-process the dataset (Tokenization, stopword, TF-IDF) - this was exactly what u had helped me here

6.Preprocess dataset for NBC (as shown in figure)

6.1.filter the pre-process dataset as train and test (i.e. confident_score column with label value and confident_score column without label value)

6.2.Apply model with NBC for training dataset

6.3.Apply model with NBC for test dataset

6.4.Join test dataset result with original test dataset

6.5.Performance matrix for the training dataset and test dataset

However, the 6.4 returns 0 records for me and the performance matrix shows 83% for training but 0% for test. I am truly confused what is the mistake here? The dataset itself or the way I label or Am I missing any key steps?

It is very nice discussion with you and I would be truly appreciate if I can learn from you so that I can explore further confidently.

thanks Kayman

kayman · June 2020

Hi Heikoe, it will work like this. I just asked the question because confident_score didn't sound very labelish, but now I see where it came from. You can therefore keep it as it is indeed.

The only importent things are therefore :

1. make confident_score special attribute label
2. make customer_review text as nominals will be treated as metadata and therefor not vectorized by the [process document] operator
3. Tune pruning and filters inside of the process operator to reduce the amount of low impact attributes (keywords)

But that's what you have so should be ok.

HeikoeWin786 · June 2020

@kayman

Thanks much for your prompt reply. I just want to double confirm if I understood correctly.
1. make confident_score special attribute label --> this is to configure in "Set Role" as the column confident_score as label. correct? i.e. Target role = label.
2. make customer_review text as nominals will be treated as metadata and therefor not vectorized by the [process document] operator ----> This one, I am not sure, where I have to change that to nominal. As this is set polynominal when I loaded. Any way I can change that to metadata so that it wont be vectorized?

Image: https://us.v-cdn.net/6030995/uploads/editor/du/syawghazgqlc.png

3. Tune pruning and filters inside of the process operator to reduce the amount of low impact attributes (keywords) ---> yes this one I got!

thanksss one again!!! Kayman

kayman · June 2020

For training you only need your vectorset (so the exa output of your process documents operator) as this contains the TF-IDF data the NBC needs to construct its mathematical magic.

so in it's simplest form it is vectorset -> shuffle (no must but always smart) -> split between train and validation -> Train model (NBC) -> apply to test data -> check performance.

I'd suggest (based on my own experience) to use some cross validation also, as this will improve the results pretty much

attached some high level framework showing both options *untested

kayman · June 2020

customer_review must be changed to text, now it's nominal which means it will be happily ignored by the process operator. only text format is used to generate TF-IDF. So polynominal needs to be changed to text, only for customer_reviews.

You can also use the nominal to text operator and just select customer_review. I think you already did that before so should be ok already.

HeikoeWin786 · June 2020

@kayman

Thats well-explained Kayman. Few questions:
1) For training you only need your vectorset (so the exa output of your process documents operator) as this contains the TF-IDF data the NBC needs to construct its mathematical magic.
- the dataset I merged in pre-processing is 75% training data with the pos/neg/neu value in confident_score and 25% test data which has no value in confident_score. Therefore, the processed data is the combination of that 75% and 25%. Because I did the split data before I go for text processing. Will this be an issue? I just need to put whole labeled dataset without splitting 25/75? If not, the whole dataset will have label value and I am not sure how NBC will detect training and test datasets. Test datasets must have empty value in label column, correct?

Thanks much!!

kayman · June 2020

- also your test data needs to contain TF-IDF data, as these are the attributes NBC uses to make it's predictions
- while there is a label in your data, rapidminer will ignore this for the prediction but only use it to compare it's prediction (the outcome of NBC) with the provided label to check the accuracy.

So you preprocesses 100% of data first, then split the data (75/25 or 80/20, doesn't matter that much)

Training starts on 75%, rapidminer validates with label if the logic is correct, model is created.
Test data uses this model, and the label to define if prediction was correct or not and shows this in the accuracy part
If you are happy with the scores you have a good model

Now you can use this model with new unlabeled data, and trust the predictions. The unlabeled data will always have to use the same pre-processing flow as you did for training, but without labels this time and you do not need to create vectors anymore as you already have a model.

The only important thing therefore is that your text field is 'reshaped' the same way as you did with your training data. So tokenized, stopwords removed etc. You can use the saved wordlist file from your training process as a filter (since the model isn't trained on any new words so they have no value anyway) by adding it to the wordlist input of your process document operator.

kayman · June 2020

Ok, misread some things.

As you have 25% data without label you cannot use that for training, this will be your 'unseen' data.

The remainder (with valid label) will be your training data, and this will need to be split again in train (75~80%) and test (20~25%).

So workflow :

Training data -> Preprocess -> TFIDF -> Train and validate -> save model
Unseen data -> Preprocess -> apply model

Preprocessing should be the same for both

HeikoeWin786 · June 2020

@kayman

Now this truly explained what I had been missing.
Sure, I will proceed with the advised workflow by tuning some processing parts.
Just two more questions

1) In the workflow, the pre-process --> TF-IDF = it is what "process document" operator does right? exa output provide the dataset that will be an input for NBC. And the word output from the operator is just for the analysis of TF-IDF, correct? So, for NBC, we need to use the exa output of "process operator".
2) But the performance matrix (confusion matrix) is to test on training model or test model? Bec I see it shows 83% for training but 0% for test.

Thanks much for all your kind explanation and patience with me here

, Kayman!

HeikoeWin786 · June 2020

@kayman
Bravo!! Kayman....Now I got the complete picture for this whole process. I am truly appreciated.
Just one last question, The performance input is showing error in connection. Is it ok if the mod output from apply model 2 is not connected to per input of performance 2?

Image: https://us.v-cdn.net/6030995/uploads/editor/lx/a6yobou0cfrw.png

HeikoeWin786 · June 2020

Hi @kayman
Thanks a lot. I run the your suggested framework and that works!!
I got the accuracy for 75% and I plan to optimize this by subjective detection.

At least I got an idea and I can now see for optimization.
Really appreciated!
Stay safe and take care

kayman · June 2020

you don't need to connect the mod output indeed, it's just the same data as your mod input (pass through)
In rapidminer you can (in general) only connect the same types with each other. So if your output port is labeled mod, it can only be attached to an input port labeled mod. Anything else will give you an error message.

There are a few exceptions to this rule, for instance the store operators accept a lot of formats but typically input and output formats should be the same if you want to connect operators together

HeikoeWin786 · June 2020

@kayman

Hello Kayman, I am sorry but I got one issue and I would like to seek for your advise.
I am trying to convert all the examples of the attribute of an excel file that contains customer reviews so that I can get a list of sentences, which I will use as an input for "extract sentiment operator".
In possible, I want to split the reviews into sentences but keep the sentiment_score as assigned, and generate sentence ID for each sentence.

I had tried my best to create the workflow but it is throwing error. I would be truly appreciate if you can advise me what I am doing wrong here?

Thanks a lot for all your know-how sharing,
Heikoe

HeikoeWin786 · June 2020

@kayman
Hello Kayman, I re-read our conversation and I came across one confusion.
As mentioned by you in above: "The only important thing therefore is that your text field is 'reshaped' the same way as you did with your training data. So tokenized, stopwords removed etc. You can use the saved wordlist file from your training process as a filter (since the model isn't trained on any new words so they have no value anyway) by adding it to the wordlist input of your process document operator."

Could you kindly please explain me here what it means by adding the word file?
After pre-processing i received 2 file, one is exa and one is word file, then I use exa file to run my NBC.
However, i believed this exc file contains TFIDF vector as well. and the word file is a list of words from TFIDF. For NBC for unseen data, do I need to input both exc and wordlist?

thanks and regards,
Heikoe

kayman · June 2020

Your model is trained on data coming from the original dataset, this means that new words in your unseen data have no effect, as they are not in the model.

Therefore the wordlist as originated by your model can be used as a filter for unseen data, as this will reduce the number of attributes the process needs to take into consideration.

For unseen data you need to run the complete (and same) pre-processing flow (so generate a new example set). No need to create a new wordlist here.

HeikoeWin786 · June 2020

@kayman
Hello Kayman,

Thanks for your kind clarification here.
I understood that the unseen data can use wordlist generated from the training dataset as they are independent.
However, my confusion is where do we use this wordlist and what is the purpose? Because I didnt include the wordlist (i.e. word output from process document) in my NBC training. I only use exc output from process document as an input for my NBC training.
Pardon me if I am slow to understand here, coz this is my first time trying to learn modelling concept.

thanks so much for all your kind explanation,
regards.
Heikoe

kayman · June 2020

When you train you have nothing, just text.

You pre-process this text to create an exampleset, aka vectorset (bag of words) that will be used to create an NBC model.

Your training process also outputs this bag of words as a wordlist. Or in other words, a list with all the words that are relevant for your model. Any word not in this list will not be used by the model.

Your new data (unseen) can contain new words, these were not in the training bag of words and are therefore not used by your model, so they are just redundant by default. Here is where your (saved) training wordlist can be handy, in your prediction flow you can use this as an input (so left side of the operator) so any token in your unseen data that is not in the wordlist will be ignored.

Using the wordlist therefore means you could simplify your prediction preprocess part a bit, because you don't need to use additional filters anymore, the wordlist handles that part. But the tokenizing part needs to be exactly the same as your training setup.

HeikoeWin786 · June 2020

@kayman Hello Kayman, Thanks for your explanation. I took some time to try to understand this. But still I cannot visualize how this done in RM. E.g. how the wordlist from pre-processed training dataset (i.e. the word output) in used as an input in design.
Retrieve (unseen data) --> Preprocess --> Apply Model (the model we saved during the training process).

Is that so? And the wordlist is added as an input in pre-process phase?

Sorry for taking very long to understand this. I couldn't visualize how to design the flow for unseen data.

And, one more thing, When I run NBC and Cross-validation for different datasets, but receive the same performance result, is that typical? I expected different result since they are different model. Please correct me if my knowledge is wrong here.

Thanks much,
Heikoe

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

TF-IDF and Aspect grouping with Rapid Miner

Best Answer

Answers