Problem Processing Data and Filter Stopwords for LDA

LaraNeu · January 2021

Hi,

I really need the help of you as a community. I already tried out all solutions that were suggested to others in community posts regarding the filter stopwords operator but nothing worked so far. I have reviews from which I want to extract topics with LDA. I followed tutorials on how to pre-process the data and filter stopwords etc. but unfortunately, it does not seem to work. Despite the transform cases into lowercase I still have words with capital letters in my output and it does not filter out the stopwords I attached in the .txt file. Also, the replace token operator does not seem to work. As I have the filter Tokens by POS operator (that takes a lot of time) I used a sample of only 100 (what can be enabled any time). I also tried it without the filter tokens by POS and with the whole data set. Unfortunately, it just does not seem to work. I attached all my files and processes. Could you please help me with my process? Thank you so much!

I am not sure if this goes too far for one post but can someone also tell me how to find out the ideal number of topics for LDA?

Thank you, Larissa

jacobcybulski · January 2021

You have a number of issues in your process. If indeed you wanted to use Process Documents from Data to do some pre-processing of text before LDA then you need to keep text it generates ("keep text" option), it also means that LDA must then be processing the attribute "text" and not your original "Review". The "Review" is polynomial and it is not automatically of type text, so your intuition to use Nominal to Text was correct but you need to apply it to "Review". Next, you cannot filter the tokens by POS as you have not done any stemming and so no POS tags are present (you would need a dictionary stemming to get these). Finally, all your stop words would be eliminated by the default English stop word filter anyway, so do you really need it? Good luck!

LaraNeu · January 2021

Thanks a lot @jacobcybulski! All your tips were awesome and I totally understand what was the problem. Would you say the POS is necessary? I only have problems with the replace token operator now as it does not replace the token but adds it e.g. replace Harry with HarryPotter = HarryHarryPotter now. Any tips for that?

kayman · January 2021

It actually does replace, but the replace operator by default has no boundaries, so it will (rightly) also replace the harry in harrypotter with harrypotter, so you get multiple replacements that may appear as additions.

One way to avoid would be to carefully order your replacements, or use regular expressions. Something like \bharry\b will only replace harry when it is a word on it's own.

For LDA there is no real need for POS filtering, in a traditional NLP flow this makes sense but the power of LDA is that it 'sees' the relations between words so it reduces the need to normalise to an extreme level. Even filtering is just an option. I'd suggest to do just some basic cleaning to get rid of the most obvious dirt and let LDA do the rest for you.

jacobcybulski · January 2021

Topics are formed purely on the statistical properties of text. So POS would be useful only if you wanted to exclude some parts of speech and focus on others, e.g. to use only nouns and verbs, or nouns and adjectives, etc.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Problem Processing Data and Filter Stopwords for LDA

Best Answer

Answers