"text processing results into decison tree?"

margkw · December 2012

Hey guys.After having tokenized some pdf documents, I now want to use the results and to induct a decision tree.Any ideas how this can be done? As I saw the induction tree operator needs an exampleset as input.How do I generate this from my results?
Thanks in advance

kasper2304 · December 2012

Can you give a short description of which nodes you used?

margkw · December 2012

Hi!
Thanks for the reply.

I tried to use the "decision tree" operator which is contained in the decision tree induction, under the category modeling. Actually I have no idea on how to do that. I am new.

For doing the tokenization of the pdfs I have used the operator "process documents from files" and into that I used the "tokenize"operator.

MariusHelf · December 2012

Hi,

you are probably using one of the Process Documents operators. Those operators output an example set, which you can use to induce a decision tree. However, in the field of text classification you usually have a huge amount of attributes (actually one attribute for each word in your corpus). Decision trees, on the other side, perform quite bad on data with many attributes. Instead, you should consider a linear SVM instead.

If you have problems setting up the process, please post the xml of what you have so far as described in my signature.

Best regards,
Marius

margkw · December 2012

thank you marius.I will try that.

margkw · January 2013

Hi marius!!!

this is the xml of my process

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="251" width="346">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
        <list key="text_directories">
          <parameter key="BPM2000" value="C:\Users\s102738\Desktop\BPM-text-analysis\BPM 2000"/>
          <parameter key="BPM2003" value="C:\Users\s102738\Desktop\BPM-text-analysis\BPM 2003 Eindhoven"/>
          <parameter key="BPM2004" value="C:\Users\s102738\Desktop\BPM-text-analysis\BPM 2004 Potsdam"/>
          <parameter key="BPM2005" value="C:\Users\s102738\Desktop\BPM-text-analysis\BPM 2005 Nancy"/>
          <parameter key="BPM2006" value="C:\Users\s102738\Desktop\BPM-text-analysis\BPM 2006 Vienna"/>
          <parameter key="BPM2007" value="C:\Users\s102738\Desktop\BPM-text-analysis\BPM 2007 Brisbane"/>
          <parameter key="BPM2008" value="C:\Users\s102738\Desktop\BPM-text-analysis\BPM 2008 Milan"/>
          <parameter key="BPM2009" value="C:\Users\s102738\Desktop\BPM-text-analysis\BPM 2009 Ulm"/>
          <parameter key="BPM2010" value="C:\Users\s102738\Desktop\BPM-text-analysis\BPM 2010 Hoboken NY"/>
          <parameter key="BPM2011" value="C:\Users\s102738\Desktop\BPM-text-analysis\BPM 2011 Clermont Ferrand"/>
        </list>
        <process expanded="true" height="618" width="710">
          <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="86" y="177"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="165"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
      <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="source_input 2" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Now I want to insert a decision tree operator. I have saved the example set that the previous process created, and in a different process I did the following which is not working

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="618" width="710">
      <operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="71" y="281">
        <parameter key="repository_entry" value="//NewLocalRepository/BPM-text-analysis/EXAMPLESETFULL"/>
      </operator>
      <operator activated="true" class="decision_tree" compatibility="5.2.008" expanded="true" height="76" name="Decision Tree" width="90" x="313" y="165"/>
      <connect from_op="Retrieve" from_port="output" to_op="Decision Tree" to_port="training set"/>
      <connect from_op="Decision Tree" from_port="model" to_port="result 1"/>
      <connect from_op="Decision Tree" from_port="exampleSet" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

any ideas?

MariusHelf · January 2013

As stated above, for text classification decision trees are far from being the optimal choice. However, please specify a bit more detailed what exactly is not working, what you are expecting and what actually happens.
Without knowing your expectations and your data it's hard to see where your problems occur.

Best regards,
Marius

margkw · January 2013

My data are ten folders of pdf files. With the first process I am tokenizing them and I also do a stopword filtering. After doing that, I saved the example set that had been created, and I tried to make a decision tree (At the second process) which would help me see some kind of pattern in those documents.. For example if we see the word "process" and the word "network" and the word "on line" it will lead us to the 6th folder. I was asked to do that by making a decision tree and by association rules.
Two separate ways.

I know I have made to separate processes (one for the tokenization and one for the tree.) . Maybe this could be done with a single one..

MariusHelf · January 2013

If you do it with one or with two processes does not matter. But it is generally a good idea to separate preprocessing and model creation, so you are fine with using two processes.

To get these indicator attributes/patterns, usually the Decision Tree is a good choice, however, with so many attributes, it may be of limited use. Anyway, it should work - which error do you get when running the process that creates the tree?

Instead of using the tree, you could also create a Linear SVM model for each of your 10 classes which separates that class from all other classes (keyword "1 vs. all classification"). When inspecting the model you will see weights associated with each attribute/word. Great absolute values there indicate a strong influence of that word - if the weight is negative for one class, if positive for the other class.

Best regards,
Marius

margkw · January 2013

First of all thank you Marius for all the great help.

When trying to use the decision tree with my example set , I get an error that says that metadata is underspecified..

No idea why this happens.

I will also try what you indicated again tomorrow . I hope it will work, so I can give it as an alternative solution.
About the association rules it goes the same way?

Thank you again!

MariusHelf · January 2013

Just hit the Run button, and your process will run. The problems with at the bottom only lists *possible* problems, but sometimes it is too pessimistic and the process runs fine nevertheless.

Happy Mining!

~Marius

margkw · February 2013

margkw wrote:

First of all thank you Marius for all the great help.

When trying to use the decision tree with my example set , I get an error that says that metadata is underspecified.. No idea why this happens.

Thank you again!

it also says cannot check precondition/( ????)

MariusHelf · February 2013

That only means that the decision tree does not know what kind of data it will receive until you actually execute the process. That is because the text processing has to read all the documents to know which words will be part of the text body, and that is done only when executing the process. Until then, the so-called meta-data is unknown, resulting in the quoted error.

Just ignore it and try to hit the big blue Run button.
If an error occurs during actual execution, please let us know and we'll try to give you further assistance.

Best regards,
Marius

margkw · February 2013

you were totally right about the decision tree!it worked, thank you..Another text mining question now..While I am tokenizing a file, which is the best filter to use to remove certain words that occur too often? I am already using the "filter stopwords" operator, but I need to remove more..If I use the filter by content operator can I remove multiple words?

edit: I solved this problem by using the operator multiple times. If there is a more efficient way please inform me ..
Another question.I want to extract the results (the wordlist actually) into an xls format.Is that possible? I am searching for such an option but I cannot find it.

MariusHelf · February 2013

Hi,

you can experiment with the prune parameters of the Process Documents operator to remove words that appear too often/too seldom.

Best regards,
Marius

margkw · April 2013

Is there a way to extract the results into excell form?

MariusHelf · April 2013

You can write an example set, i.e. a data table, to an Excel file with the Write Excel operator.

Does that help you?

Best regards,
Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"text processing results into decison tree?"

Answers