Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
"text processing results into decison tree?"
Hey guys.After having tokenized some pdf documents, I now want to use the results and to induct a decision tree.Any ideas how this can be done? As I saw the induction tree operator needs an exampleset as input.How do I generate this from my results?
Thanks in advance
Thanks in advance
Tagged:
0
Answers
Thanks for the reply.
I tried to use the "decision tree" operator which is contained in the decision tree induction, under the category modeling. Actually I have no idea on how to do that. I am new.
For doing the tokenization of the pdfs I have used the operator "process documents from files" and into that I used the "tokenize"operator.
you are probably using one of the Process Documents operators. Those operators output an example set, which you can use to induce a decision tree. However, in the field of text classification you usually have a huge amount of attributes (actually one attribute for each word in your corpus). Decision trees, on the other side, perform quite bad on data with many attributes. Instead, you should consider a linear SVM instead.
If you have problems setting up the process, please post the xml of what you have so far as described in my signature.
Best regards,
Marius
this is the xml of my process
Now I want to insert a decision tree operator. I have saved the example set that the previous process created, and in a different process I did the following which is not working any ideas?
Without knowing your expectations and your data it's hard to see where your problems occur.
Best regards,
Marius
Two separate ways.
I know I have made to separate processes (one for the tokenization and one for the tree.) . Maybe this could be done with a single one..
To get these indicator attributes/patterns, usually the Decision Tree is a good choice, however, with so many attributes, it may be of limited use. Anyway, it should work - which error do you get when running the process that creates the tree?
Instead of using the tree, you could also create a Linear SVM model for each of your 10 classes which separates that class from all other classes (keyword "1 vs. all classification"). When inspecting the model you will see weights associated with each attribute/word. Great absolute values there indicate a strong influence of that word - if the weight is negative for one class, if positive for the other class.
Best regards,
Marius
When trying to use the decision tree with my example set , I get an error that says that metadata is underspecified.. No idea why this happens.
I will also try what you indicated again tomorrow . I hope it will work, so I can give it as an alternative solution.
About the association rules it goes the same way?
Thank you again!
Happy Mining!
~Marius
Just ignore it and try to hit the big blue Run button.
If an error occurs during actual execution, please let us know and we'll try to give you further assistance.
Best regards,
Marius
edit: I solved this problem by using the operator multiple times. If there is a more efficient way please inform me ..
Another question.I want to extract the results (the wordlist actually) into an xls format.Is that possible? I am searching for such an option but I cannot find it.
you can experiment with the prune parameters of the Process Documents operator to remove words that appear too often/too seldom.
Best regards,
Marius
Does that help you?
Best regards,
Marius