Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
"Text Mining- Select Token based on Dictionary File"
Hi every one,
I'm trying to work on a text mining workflow to filter specific language contents based on a specific language dictionary (TXT file most probably).
I was able to filter stopwords using the operator "Filter Stopwords (Dictionary)" to filter the content depending on a dictionary, but I'm still trying to select tokens based also on a dictionary, but it seems that the only operator offered is Filter Tokens (by contents) (which enables selecting tokens based on a regular expression, there is no option for selecting tokens based on a dictionary file).
I need your support if you have an idea if there exists any operator to do that, or if I'm missing something.
Thank you in advance
I'm trying to work on a text mining workflow to filter specific language contents based on a specific language dictionary (TXT file most probably).
I was able to filter stopwords using the operator "Filter Stopwords (Dictionary)" to filter the content depending on a dictionary, but I'm still trying to select tokens based also on a dictionary, but it seems that the only operator offered is Filter Tokens (by contents) (which enables selecting tokens based on a regular expression, there is no option for selecting tokens based on a dictionary file).
I need your support if you have an idea if there exists any operator to do that, or if I'm missing something.
Thank you in advance
Tagged:
0
Answers
Hi,
It's the same problem I have. did you solve it?
regards
Would you have a practical high level example?
What I sometimes do is using a reverse logic flow. So if the tokens can be used to remove 'good content', like the language you are looking for, what remains is 'bad content'. This return can then be used as a filter for your original set, so you can identify the good ones by deducting the outcome of the token process.
Not sure if it's helpfull, there may be other and better ways but more details can help in that case.