[SOLVED] Optimizing Parameters over Serveral ExampleSets

danieldaniel Member Posts: 12 Contributor II
edited November 2018 in Help
Hello everyone,

I am stuck with a problem and I hope I can get some ideas here on how to handle this in rapidminer.

I have 100 examples of binary classification tasks that I want to evaluate some algorithms on. I figured out how to build a process that performs validation and measures the performance.

Now I would like to find the best parameters for some of the algorithms, for example SVM. I saw that there is the Grid Optimization Operator that can help me figure out the best parameters for the learner with a given input.

However I don't see how it can help me optimize the parameter across my 100 examples instead of just one example. I can give in serveral ExampleSets through the input port and run multiple validations inside, but the Operator only offers one output port for performance. I could also perform a loop for every exampleSet but I haven't found a way to combine multiple performance vectors.

Can anyone kindly direct me into the right direction?

Thanks in advance,

Daniel

Answers

  • wesselwessel Member Posts: 537 Maven
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="391" width="435">
          <operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
            <parameter key="repository_entry" value="//Samples/data/Sonar"/>
          </operator>
          <operator activated="true" class="optimize_parameters_grid" compatibility="5.2.008" expanded="true" height="94" name="Optimize Parameters (Grid)" width="90" x="180" y="30">
            <list key="parameters">
              <parameter key="SVM.max_iterations" value="[1;10000;100;linear]"/>
            </list>
            <parameter key="parallelize_optimization_process" value="true"/>
            <process expanded="true" height="391" width="165">
              <operator activated="true" class="x_validation" compatibility="5.2.008" expanded="true" height="112" name="Validation" width="90" x="45" y="30">
                <process expanded="true" height="391" width="303">
                  <operator activated="true" class="support_vector_machine" compatibility="5.2.008" expanded="true" height="112" name="SVM" width="90" x="106" y="30">
                    <parameter key="convergence_epsilon" value="1.0E-4"/>
                    <parameter key="max_iterations" value="10000"/>
                  </operator>
                  <connect from_port="training" to_op="SVM" to_port="training set"/>
                  <connect from_op="SVM" from_port="model" to_port="model"/>
                  <portSpacing port="source_training" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                  <portSpacing port="sink_through 1" spacing="0"/>
                </process>
                <process expanded="true" height="391" width="303">
                  <operator activated="true" class="apply_model" compatibility="5.2.008" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                    <list key="application_parameters"/>
                  </operator>
                  <operator activated="true" class="performance" compatibility="5.2.008" expanded="true" height="76" name="Performance" width="90" x="174" y="30"/>
                  <connect from_port="model" to_op="Apply Model" to_port="model"/>
                  <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
                  <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
                  <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
                  <portSpacing port="source_model" spacing="0"/>
                  <portSpacing port="source_test set" spacing="0"/>
                  <portSpacing port="source_through 1" spacing="0"/>
                  <portSpacing port="sink_averagable 1" spacing="0"/>
                  <portSpacing port="sink_averagable 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="log" compatibility="5.2.008" expanded="true" height="76" name="Log" width="90" x="287" y="63">
                <list key="log">
                  <parameter key="p" value="operator.Validation.value.performance"/>
                  <parameter key="d" value="operator.Validation.value.deviation"/>
                  <parameter key="m" value="operator.SVM.parameter.max_iterations"/>
                </list>
              </operator>
              <connect from_port="input 1" to_op="Validation" to_port="training"/>
              <connect from_op="Validation" from_port="averagable 1" to_op="Log" to_port="through 1"/>
              <connect from_op="Log" from_port="through 1" to_port="performance"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
              <portSpacing port="sink_result 1" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="log_to_data" compatibility="5.2.008" expanded="true" height="94" name="Log to Data" width="90" x="315" y="30"/>
          <connect from_op="Retrieve" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
          <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_op="Log to Data" to_port="through 1"/>
          <connect from_op="Log to Data" from_port="exampleSet" to_port="result 1"/>
          <connect from_op="Log to Data" from_port="through 1" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • danieldaniel Member Posts: 12 Contributor II
    Hello Wessel,

    Thanks so much for posting the process!!! It's basically how my process looks now, except that I load data from a different source than the repository.

    I have checked the repository entry you entered there and it looks to me like there is only one example set consisting of 208 rows with examples.

    I was probably misleading when I was talking about 100 examples. Basicially those 100 examples are 100 files which each contain a list of prelabeled examples I want to evaluate the algorithms on. So one file holds one exampleset with serveral rows and I have 100 of these.

    Or am I missing the point? Can I give a collection of example sets to the optimizer? The reason is that the optimal parameter I get for one file might not perform so well across the whole collection. I want to find the value that is optimal for the whole 100 example sets. Is that possible?

    Thanks again for clearing up my confusion,

    Daniel
  • wesselwessel Member Posts: 537 Maven
    I'm not sure I understand.

    First you build your setup without any optimization.
    So if you need performance over multiple data sets, you build this.
    But in the end, you produce a single number, that is your measure of performance.

    Then you copy over this entire setup to the inside of "Optimize Parameters" operator.
    Only if it is too slow, you can do optimizations, for example, move the load data to the outside.

    Best regards,

    Wessel
  • danieldaniel Member Posts: 12 Contributor II
    Thanks for your reply, I really appreciate your help.

    That is exactly where my problem is. In my current setup I run only one example set, because I don't know how to combine the peformance vectors of multiple learning and testing iterations.

    How do I measure the performance across multiple data sets? I can create an X-Validation for each of my 100 subproblems and this will return 100 performance vectors. But how do I boil them down into one Performance Vector that I can pass on to the Optimization Process?

    I have tried working with the loop operators to loop over several example sets but I found that when I output the performance vector that the loop produces it's not a collection or a combined performance vector but only the performance vector of the last x-validation that was run.

    Kind regards,

    Daniel
  • wesselwessel Member Posts: 537 Maven
    If you don't know it probably doesn't make sense.
    I can give you literature on how to do it, but probably you don't want to do it in the first place.
    My guess, what you want to do is, simply rerun your experiments D times, where D is the number of data sets.
    Then you see how optimal parameters vary per data set.
  • danieldaniel Member Posts: 12 Contributor II
    Hello Wessel,

    Well, I really do need to find the optimal parameters for the algorithm over those 100 examples. I am sure what you said makes perfect sense and I would really appreciate if you could give me a hint about how to achieve this in rapidminer.

    I don't want to rerun the experiments for each data set, because then I only have 100 optimal parameters, which dont necessarily have to be the optimal parameters for the overall problem.

    Please can you provide me with that literature or a hint on how to combine the performance vectors in rapidminer? This would really save me!

    Even if you are right about rerunning instead I would still like to explore the other possibility, if possible.

    Thank you!

    Daniel
  • danieldaniel Member Posts: 12 Contributor II
    Hello,

    As I see it the arithmetic mean of the performance measures should do the job. I just haven't found a way to generate this in rapidminer.

    Can anyone help out?

    Thanks,

    Daniel

    EDIT: Nevermind. I found the Collections Operators. That seems to be what I was looking for. Thank you anyway.
  • wesselwessel Member Posts: 537 Maven
    You talk about 100 examples.
    Typically 100 examples are contained within a single data set!
  • danieldaniel Member Posts: 12 Contributor II
    Sorry if I was misleading. I realized later that examples does not properly describe my problem. I have 100 files which each contain a different number of examples.

    I wrote in my original post that I was looking for a way to combine performance vectors though. Maybe that got lost in our discussion.
    daniel wrote:

    I could also perform a loop for every exampleSet but I haven't found a way to combine multiple performance vectors.
  • wesselwessel Member Posts: 537 Maven
    An example set is typically called a data set.
    A row in a data set is typically referred to as an instance, or as an example.

    You can take the average performance on each data set.
    Performance is the value of some performance measure, e.g. accuracy, correlation, RMSE.
    Average performance on multiple data sets, very rarely has any useful interpretation.
    Since you are obviously a beginner, I'm warning you, that it is very likely, that you are making some mistakes in your analysis.

    To resolve this, I suggest you give a complete problem description.
    Like give the full context of your problem.
    Mention if you are doing regression, or classification, or maybe unsupervised learning.
    Mention the domain your problem is from.
    Mention some estimate on the number of examples per data set.
    Paste down a few data rows, with some form of description, class distribution, etc.
    All of this should be no more than 5 minutes work.


  • haddockhaddock Member Posts: 849 Maven
    Hi,

    I think the last comment is harsh, because..

    1.
    I have 100 examples of binary classification tasks that I want to evaluate some algorithms on.
    2. Providing performance for unsupervised learning is a contradiction in terms.

    My two cents would suggest combining all the example sets, even in slices, and optimising over them by Xvalidation, rather than combining the performance vectors. I did this recently over 32 million examples, provided in 37 SQL tables. Just an idea from another beginner.
  • danieldaniel Member Posts: 12 Contributor II
    Hello,

    Thanks for your input. About my problem:

    @wessel:

    I have a set of authors. There are 100 of them and they are split up across 18 different names. In each of the files I created I have a list of documents. The list contains the documents all authors with the same name wrote. Each of the files represents a binary classification task. The documents of one authored have the first label, the rest has the second label. The number of documents differs for each name.

    Currently I am only doing classification tasks and I wanted to optimize the parameters of the SVM learner across the whole lot of datasets. This is because I don't want a learner that performs good for one dataset and bad for another but one that performs as good as possible across all my datasets. Before starting the training process I read the data from my xml files, turn them into documents and calculate the tf-idf weights for the tokens. I also apply a stop word filter and stem the tokens.

    I hope this is enough insight into my problem. As the optimize parameter operator expects a single performance vector, if I want to optimize across all my datasets I see no way around getting the average of the performance vectors and feeding that to the optimization operator. If you have an idea how else this could be achieved I would really appreciate your input.


    @haddock:

    Thanks for your suggestion. I am not sure how this is going to work out. To measure the performance I have to compare the prelabeled data with the prediction the classifier returns. If I merge the datasets then I have to either change the labels (this will make my binary classification a multi class classification task) or I keep the current labels and then I have multiple different authors sharing a label. I don't think that would work. Maybe I am missing something? I am very grateful for all your suggestions!

    Thanks,

    Daniel
  • haddockhaddock Member Posts: 849 Maven
    Hi,

    Without seeing some examples I cannot comment further.
  • wesselwessel Member Posts: 537 Maven
    If I understand correctly:
    You roughly have 2000 documents.
    A single document is written by a single author.
    There are a total of 100 different authors (these are the class labels).
    All documents are converted to feature vectors using some tf-idf scheme.

    You wish to build a model, that is able to predict the author, for a given document



    My suggestion here would be, create a single file, that contains all documents, and all 100 authors (class labels).
    As measure of performance: http://en.wikipedia.org/wiki/Cohen%27s_kappa
    This measure is build in Rapid Miner.

    There are some boosting schemes that allow you do to binary classification, on a data set containing 100 class labels.
    These schemes can do 1-out-of-k encoding (what you are doing), but also other forms of encoding with repair, which is far more efficient.

    Best regards,

    Wessel
Sign In or Register to comment.