Testing/Training data separation

nawafpower · April 2011

Hi All,
Does any body knows how does the Rapid Miner separate the data fed to it as training and testing? Like I fed my data in folders and put label for each folder path, now how does my data be separated inside the folders as training and testing? Any help is appreciated.

homburg · April 2011

Hi nawafpower.

Usually there are several ways to use test and training data. First, in case your data is already split into two different resources you may create two repository entries for them and connect the corresponding Retrieve operators to a ModelApplier Operator. Second, and this is the most common case, let RapidMiner do the work for you: Just use an operator like XValidation. This operator automatically splits your data into several subsets needed for cross validations. In the Repositories view you find a folder called Samples which contain some sample experiments including processes which use XValidation for performance measurement. Simply open one of them and double click on the XValidation operator to see how things are connected internally.

Greetings,
Helge

nawafpower · April 2011

Thanks for the reply, I found about the X-Validation split and I did it, but I have another issue now, when your model classify your text files it gives you the outcome as confusion matrix with precision and accuracy for each class. Now what I need to find out, in case of a file was misclassified as another class, how can I know which file was that? it only shows number of files correctly classified and if there is a misclassified it show you where and how many where misclassified. If you have any way to do this I will be grateful, and in case after I trained my model I want to test one file to see in which class it does fit, how can I do that?
I appreciate any help.

homburg · April 2011

The XValidtion does serveral runs over different sample partitions of the data. Therefor it only provides some statistical data. Nevertheless this is quite useful to compare performance and reliability of different learners and parameter settings.
Of course you can keep things a little bit simpler. Use the Split Data operator to split your data into test and training partition, connect the trainig data output to a learner operator and feed the test data into an Apply Model operator. Finally connect the model output of your learner to the applier and the applier output for labeled data to one of the main resource ports of your process. Now you should receive a data view of your test partition which is enriched by a column called prediction, showing you how your model classified each example. You may add a Performance operator to the last connection to see the confusion matrix for this particular job. In case you change partiton or sampling type parameters of the Split Data operator you will get different learning results. This is why XValidation with its crossvalidation ability is often used to get more reliable performance values.

Greetings,
Helge

nawafpower · April 2011

Hi Helge,
I did what you said and it works fine, but still my problem is: how can I know which file in specific was misclassified? If I have lets say 30 files in folder A which supposed to come out as class A for example, the model now split the files and show me the ratio that I have select in the confusion matrix, but what about the testing files? where are they? and how can I check their classification? is there a way to show a detailed list of each file with its class? I know too many questions, and anyone who knows any answer is welcome to reply here with my full appreciation.

homburg · April 2011

As far as i understand your problem you want to look at the classification of each individual example / file. To do this you can do what i mentioned in my last post. Connect the output of the model applier directly to the process main port or, in case you used a performance measure operator, connect the performance output (per) and the example output (exa) to the main ports. This should ensure that you receive not only the confusion matrix but also a data view of your test partition showing you label and prediction columns. For every line the prediction should match its label otherwise the model made an error in that particular example.

Greetings,
Helge

nawafpower · April 2011

Hi Helge,
Is it possible for the RapidMiner to enter an infinite loop? I did what you told me about connecting the second output of the performance (exa) to the (res) output, I added a split data to split my data into training to go to classifier and testing to go to the (unl) port of the apply model, up to this moment the RapidMiner has been running for 2 days 2 hours 30 minutes, (50 hours and 30 minutes) so I am just worried should I wait? for how long? or just terminate? please Help. I appreciate all your help so far.

nawafpower · April 2011

My model is still working, 55 minutes from now and total running will be 4 DAYS???should I terminate?I think I should, it's too much time, even if the imaginary outcomes, that may never come, were perfect, still the performance of the model is terrible.

land · April 2011

Hi,
I can't say much about this. Runtimes of days, weeks or even months might be frequently the case if you are training a computational complex model, say neural nets or SVMs, on huge data sets.
So, without having the process that shows me what you did at all, and without the data specs, I have no clue if it's a good idea to wait or not.
In general it's a good idea to take a look at the status bar, where each operator is shown with the number of execution and runtime. So is there just one single operator running all the four days? Or is it just the 1000000th execution of the same operator?

Greetings,
Sebastian

nawafpower · April 2011

Here is the issue, if I input my data set (339 text files 4KB each) to the Split Validation that has the classifier, Apply model and Performance , the results is out within less than 5 minutes with Accuracy of 97%, BUT, my problem was, I don't know which specific file was misclassified. After I got the reply from Helge I change the model to use second output of performance (exa) to output (res), that forced me to delete the split Validation and use split data after the data process, and give one output to classifier and second output to (unl) input of the apply model. with this specific setting I got 19% accuracy using SVM??? and 90% using Naive Bayse
When I got the 4 days processing, which I canceled eventually, was different setting, I manually set different folders for training and another folders for testing, and fed these folders to two blocks of process document from files, and remove the split data.
I wonder why the SVM did so bad in this setting. any feed back is appreciated.

nawafpower · April 2011

And, for my described model setting in the previous post, the data is splitted to training and testing with the predefined ratio, NOW my BIG requirement is: if I need to test a SPECIFIC text file to see where does it fit, how can I do that? what I have tried for now is to put this file in one of the data folders, but first run didn't show the file, most probably that it was with the training files. since the output view only the test part of the data.

homburg · April 2011

Here is a sample process of a simple classification task. Data is retrieved and splitted 0.7/0.3 using stratified sampling. The first partition is used to train a decision tree model and the second partition serves as test set for this model. Finally the model, its validation and the training set including prediction values is returned.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Root"> 
    <process expanded="true" height="584" width="815">
      <operator activated="true" class="retrieve" compatibility="5.1.006" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="../../data/Sonar"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="5.1.006" expanded="true" height="94" name="Split Data" width="90" x="246" y="30">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.7"/>
          <parameter key="ratio" value="0.3"/>
        </enumeration>
        <parameter key="sampling_type" value="stratified sampling"/>
      </operator>
      <operator activated="true" class="decision_tree" compatibility="5.1.006" expanded="true" height="76" name="DecisionTree" width="90" x="447" y="30"/>
      <operator activated="true" class="apply_model" compatibility="5.1.006" expanded="true" height="76" name="Apply Model" width="90" x="581" y="165">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="5.1.006" expanded="true" height="76" name="Performance" width="90" x="648" y="30">
        <list key="class_weights"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="DecisionTree" to_port="training set"/>
      <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="DecisionTree" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 3"/>
      <connect from_op="Performance" from_port="performance" to_port="result 1"/>
      <connect from_op="Performance" from_port="example set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

In case you want to classifiy another data set use the model applier to compute more predictions. In order to give you better advice you may post your process here and / or provide some more information about your data.

Regards,
Helge

nawafpower · May 2011

Your model is exactly the same I got, only difference is you got the Apply model (mod) output connected to (res) output, what does this do?
And still my issue if I need to test a specific extra file, not had been before in the data set, how can I test that file to see where does it fit? This is my BIGGEST issue now.
Thanks

homburg · May 2011

When you connect the mod output to the final res connector the model is delivered as a process result. In the sample process a decision tree model is shown in the result tab.
But now to your last question: What do you mean with testing an extra file? Does this file contain another data set? In case this data set has the same structure you can use the model you have already trained. There are operators that allow you to write a model to disk (WriteModel & ReadModel) and others to store them in your repository (Store & Retrieve). Simply build a process which loads the data, retrieves the model and applies the latter to the first. Connect the lab output of the applier and you should receive your prediction values.

Cheers,
Helge

nawafpower · May 2011

Hi Heldge,
Thanks for your reply, but I didn't catch it well, now I have 10 folders containing text for different authors, each folder should be classified for that author if the model was correct, now I read this folders and supply the output after preprocessing to split data so I can do training and testing portion, I send the training output of the split data to classifier and the other output as testing to the (unl) input of the apply model and the output of classifier to (mod) input of the apply model, then send the (lab) output of apply model to performance and the (mod) output of the apply model to output (res).
now I did add to this setting a (write model) from the (mod) output to (inp) of write model and (thr) output of the write model to (mod) input of the apply model. this will write the model some where I can decide.
Now in detail, what did you say about saving my data to repository and if I need to test a text file not in my data set, or more than one text file, to see whose the author of this/these text(s)? I already tried to read model and input one text and apply model, when run, or actually still running from 5 days ago

. so where did I miss? Does the read model get the trained model on my data set? please help.
Regards and thanks for your time.

nawafpower · May 2011

Sorry my previous reply was for homburg.

nawafpower · May 2011

Where is everybody? is there some kind of vacation for the forum? no replies for long time???

land · May 2011

Hi,
yes, in fact I have been in vacation. And after coming back a overwhelming number of open threads welcomed me

Following my heuristic that a thread with 17 replies should be already solved, I didn't take a closer look here in my first sweep.

So, could you please summarize the problem and the current state? Would help me a lot, if I don't have not to read all the previous posts...

Greetings,
Sebastian

homburg · May 2011

Hi again.

Your last post made things look a bit clearer. You want to classify text data by predicting the author who might have written it. Such an approach needs text mining techniques to work with unstructured data. Please make sure that you have installed the Text Processing Extension (via Help -> Update RapidMiner). Unfortunately the analysis of unstructured data like texts is a more complex task. Therefor some preliminary steps are needed before you can start with things like learning models or validating results. Maybe it is a good idea to take a closer look at some useful introduction videos dealing with this topic. A video which shows how to classify texts dealing with different topics can be found here:
http://rapidminerresources.com/index.php?page=text-mining-3
In addition to that Neil McGuigan produced a great series of five videos dealing with RapidMiner and Text-Mining which are available via his blog:
http://vancouverdata.blogspot.com/2010_11_01_archive.html

Greetings,
Helge

el_chief · May 2011

thanks Helge!

nawafpower · May 2011

Thanks Helge for the links, I have seen these Video before and they were a great help for me understanding how to deal with text mining, but still don't solve my problem, and Sebastian, thanks for your well to solve my problem, to make it as simple as possible, I have 10 folders with text files for each author, so I have 10 authors and 10 minimum text files each, I used the process document from files, I did the preprocessing like unify case and stop words, etc. inside the process document from files. after that I send the output to split data to feed the classifier and the (unl) input of the apply model and final stage to performance, my problem was that with this setting the model accuracy is around 93% but if I needed to test some files that never been applied in this model for testing it was hard, I made it as the model to write the model as first step then use another setup to read model and apply it to different data, some times it works with very low accuracy like around 18% some times it works for days with no outcomes, this is my issue, I need to train the model on identifying the author of some text files then being able to identify different files for same authors on same model with out training again. can you help me sebastian or Helge or Homeburg or Neil? any body is welcome to help.

el_chief · May 2011

you have to save your wordlist that you trained on

then, when you are predicting authorship, you use the old wordlist instead of training a new one

nawafpower · May 2011

What I am trying right now is write the model in first step then open another setting that will read the previous model by "read model" and input text files by the process document from files as the input will be one folder containing different files for different authors that had never been seen by the model to test if the model will work well, I just tried it and the first stage gave me accuracy of 99.1% wile the second stage identifies around 6 correct out of 37 files, which is really bad, I don't know why.
Your suggestion is to save the word list, how? and in what stage? do I have to insert a block like write model? and when use this word list in the next stage that will be with out the model read or what? Sorry for my many questions.

nawafpower · June 2011

Hello Guys,
Any update regarding my last post? I didn't hear a response from Neil or any one ? Is there any thing need to be clarified in my post? let me know and I appreciate any help.

roya67 · November 2012

nawafpower wrote:

Hello Guys,
Any update regarding my last post? I didn't hear a response from Neil or any one ? Is there any thing need to be clarified in my post? let me know and I appreciate any help.

Hi, I read all this topic to see if I can find any answer to my problem , but I have to mention my problem, I hope you provide me the answer.

first, about nawafpower's problem , I think you can find your answer in http://www.youtube.com/watch?v=9I0BcMuhPe8

I'm also doing text classification. I have stored a model and a wordlist. and I used them to classify new documents. for example I have 1000 test document, I can see that what the model predicts for that but I can not save them in a file to see what is predicted for a certain document
I want to save a file which one entry is the name of the document and one other entry for the predicted label
I really appreciate your time.
please help me.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Testing/Training data separation

Answers