Question

user1user1 Member Posts: 4 Contributor I
edited November 2018 in Help
I was wondering how the maximal number of XValidations embedded into an EvolutionaryParameterOptimization
can be determined.

My settings for the evolutionary parameter optimization are:
"max_generations" value="5"
"generations_without_improval" value="-1" (on purpose to make things more clear)
"population_size" value="20"
"tournament_fraction" value="0.3"

And for the Xvalidation, the parameter "number_of_validations" is set to 2.

Here is the corresponding code:

<operator name="Root" class="Process" expanded="yes">
  <operator name="ExampleSource" class="ExampleSource">
        <parameter key="attributes" value="../data/polynomial.aml"/>
    </operator>
    <operator name="ParameterOptimization" class="EvolutionaryParameterOptimization" expanded="yes">
        <list key="parameters">
          <parameter key="LibSVMLearner.C" value="0.1:100"/>
          <parameter key="LibSVMLearner.degree" value="2:7"/>
        </list>
        <parameter key="max_generations" value="5"/>
        <parameter key="generations_without_improval" value="-1"/>
        <parameter key="population_size" value="20"/>
        <parameter key="tournament_fraction" value="0.3"/>
        <parameter key="local_random_seed" value="2001"/>
        <parameter key="show_convergence_plot" value="true"/>
        <operator name="Validation" class="XValidation" expanded="yes">
            <parameter key="number_of_validations" value="2"/>
            <parameter key="sampling_type" value="shuffled sampling"/>
            <operator name="LibSVMLearner" class="LibSVMLearner">
                <parameter key="svm_type" value="epsilon-SVR"/>
                <parameter key="kernel_type" value="poly"/>
                <parameter key="C" value="76.53909856172457"/>
                <list key="class_weights">
                </list>
            </operator>
            <operator name="ApplierChain" class="OperatorChain" expanded="yes">
                <operator name="Test" class="ModelApplier">
                    <list key="application_parameters">
                    </list>
                </operator>
                <operator name="Performance" class="Performance">
                </operator>
            </operator>
        </operator>
        <operator name="Log" class="ProcessLog">
            <parameter key="filename" value="paraopt.log"/>
            <list key="log">
              <parameter key="C" value="operator.LibSVMLearner.parameter.C"/>
              <parameter key="degree" value="operator.LibSVMLearner.parameter.degree"/>
              <parameter key="performance" value="operator.Validation.value.performance"/>
              <parameter key="iterations" value="operator.Validation.value.iteration"/>
            </list>
        </operator>
    </operator>
</operator>
I would expect that that for each individual (within a population) 2 validations are performed. Since the
population size is 20, there are 2*20=40 validations in each generation. Using 5 generations I would
expect, 200 validations in total.

But when I check the output of the ProcessLog operator, the parameter optimization computes 248 performance
values, which in my opinion should represent one individual each, with 2 iterations (the two runs of the validation).
Thus, in total 2*248=596 validations are performed in total. Why not just 200?

Marcus

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Marcus,
    I think the population_size parameter specifies the size of the initial population, which might change in the next generations. This might cause the deviation from the expected number.

    Greetings,
      Sebastian
  • user1user1 Member Posts: 4 Contributor I
    Sebastian,

    evolutionary algorithms with a variable population size are IMHO
    not that common. Do you have by any chance a reference
    (paper/URL/book) that describes the principles you are using in
    RapidMiner for this parameter optimization?

    So, does this mean that the number of validations cannot be
    bounded by a maximal number of validations?

    Marcus
  • keithkeith Member Posts: 157 Maven

    I had asked a similar question a few months ago, and Ingo gave a little more background on what RM does behind the scenes with evolutionary algorithms:


    http://rapid-i.com/rapidforum/index.php/topic,344.0.html

    Hope this helps,
    Keith
  • user1user1 Member Posts: 4 Contributor I
    I quote Ingo's answer (from the mentioned thread):

    1. pairs of individuals are randomly selected and crossover is performed with a certain probability --> depending on this probabbility a random number of additional individuals (children) will be produced and have to be evaluted
    2. on those children mutations might be applied which again will deliver some additional individuals (since RM keeps both the original and the mutated search point) --> again some more individuals to evalute
    3. on the other hand, individuals which did not change will not be re-evaluated --> this can even drop the number of evalutions
    What I don't understand is why individuals created in 1. and 2. have to be evaluated in addition to the present individuals
    from the generation.

    I would assume that you have evaluated individuals from generation n, then select some of them for cross-over and mutation, and finally put these possibly new individuals in generation n+1. In the next round, all individuals (if new) from generation n+1 are then evaluated. Thus, I would expect that in each generation at most p individuals, with p being the population size, are evaluated. But this seems to be wrong. It seems to me that the new offspring individuals (after crossover and mutation) are evaluated but their fitness values are dropped such that they have to be re-evaluated
    in generation n+1.

    Marcus
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Marcus,
    it might happen, that two individuals mutate AND cross over, so that the number might increase over the population size.

    Greetings,
      Sebastian
  • user1user1 Member Posts: 4 Contributor I
    So, let me sum up to make sure I got it right.

    after the first generation with an initial population the fitness values for each individual are computer.
    Then in the selection phase the fittest individuals (fraction specified for example by 'tournament_fraction')
    are determined. For those, pairs are randomly selected and crossover is performed with 'crossover_prob'.
    For these new individuals the fitness must be evaluated. So, after this step we have possibly some more
    individuals due to the additional children.

    Next, on these children mutation is done. For these mutated individuals again fitness evaluation must
    be performed. So, in addition to the additional "crossover" children we may get new "mutation" children.
    Together with the parent, the individuals represent the offspring.

    Finally, the reinsertion step is performed by selecting the fittest individual from the offspring and insert them
    into the next generation. Which reinsertion strategy are you actually using? Depending on the strategy, I assume
    that, as Sebastian wrote previously, the population size might become larger or smaller in the following generation.


    The evolutionary parameter optimizations has the nice feature 'show_convergence_plot'. How is actually
    the blue curve computed? I might imagine that for each generation the average performance is computed.

    Marcus
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Marcus,
    I'm sorry, but I'm neither a specialist in this topic nor have I written this operators. Everybody which has participated on writing this part of rapid miner is currently out of office due to various reasons. So I cannot give any absolut answer...


    Greetings,
      Sebastian
Sign In or Register to comment.