Training set ordering matters??

simplyfatsimplyfat Member Posts: 3 Contributor I
edited November 2018 in Help
Hi guys,
i'm pretty new to rapidminer and data mining in general and i wanted to talk about a strange behavior i'm seeing with rapidminer 5.0 beta:

as you can see from the screenshot, i am fetching a training set from a mysql database and passing it to a neuronal network. so far war so good.
image
you can see the training set in the other screenshot:
image
it is labeled with a value between 0 and 1, which should be predicted afterwards. now the strange thing: i've copy&pasted the training set sql-query from somewhere else and at the end there was still a "ORDER BY `label` DESC". i thought this was rather useless (i am not limiting the number of sql results), so i removed the ORDER BY clause. but that made my prediction worse! far worse! i could not belive it, so i reproduced it many times...

can somebody tell me, why the neuronal network depends on the ordering of its training set and why it is better with DESC than with ASC?

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    Neuronal Networks depend on the ordering of their training inputs. The training of them works like follows:
    1. For each training example x
      1.1.  Feed x forward
      1.2. Calculate error
      1.3. Propagate error back and adapt weights of neurons.

    So you see, that it heavily depends on the ordering. That it has a better accuracy if you sort it in one way than the other probably is simply random. You might insert some noise attribute with the noise operator and use the sort operator to sort after the noise. You then could compare it several times, varying the random seed of the Noise operator, how the performance behaves.

    By the way: This is one of the properties of Neural Nets, why I don't use them...

    Greetings,
      Sebastian
  • simplyfatsimplyfat Member Posts: 3 Contributor I
    ah! thank you very much. i thought ordering dependency would go away if you have more training cycles than training set entries.

    what you do with noise is the same as a "ORDER BY RAND()", right? i tried it and it was bad, too.

    what learner would you recommend instead for my purpose? you can see the data set in the screenshot. it is about proteins..
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I would recommend to use the LibSVM. You will probably would have to optimize the kernel type and its dependent parameters, but in nearly all cases a tuned SVM is at least as good as a NeuralNet.
    And although the linear regression is a very simple learner, it's always worth a try in combination with feature generation. It's quick, so that many iterations can be performed to find a suitable attribute set. And compared with the heavy weight learning schemes like SVM and NeuralNet, it produces understandable results.

    Greetings,
      Sebastian
Sign In or Register to comment.