"Preprocessing Data for Decision Tree (Weights)"

mmaelzermmaelzer Member Posts: 2 Contributor I
edited May 2019 in Help
Hi,

I have a special problem because of the characteristics of my data. The attributes are:

- ID (I declared as ID)
- contact (nominal and declared as regular)
- product (nominal and declared as regular)
- execution (nominal and declared as label)
- quantity (numerical and declared as weight)

The data covers all possible combinations of contact, product and execution, if the combination doesn't exist, the quantity is zero, if the quantity is 300, then this case appeared 300 times (in reality but not in the datasheet). So it isn´t leading to the desired results, when i build a decision tree or some rules. I tried to declare the quantity-attribute as weight, but seemingly it isn´t the right way. Can someone tell me, how to weight the data correctly?

Thanks a lot!

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I would have suggested to declare the quantity as weight. This should work with learners supporting weights. What went wrong?
    By the way:
    I would filter out all examples having quantity =0 using the example filter operator. This would at least make things faster.

    Greetings,
     Sebastian
  • mmaelzermmaelzer Member Posts: 2 Contributor I
    Hi Sebastian,

    Filtering out examples with quantity 0 reduces the classification error (to 75%). When I´m not filtering out this examples the classification error is at 99%. Because of this I thought that weights are not correctly used or declared.
    At first I used a X-Validation, as I understood this splits the dataset into two or more disjoint datasets (problematic because of the fact, that every case appeares just one time). classification error: 89% with filter/ 99% without filter
    Now I tried to split the data manually in two datasets (month1, month2) covering all cases and used month1 as trainingset for the learner und month 2 as testset after applying the model to the testset. classification error: 75% with filter/ 99% without filter
    The tree doesn´t represent the data, for example:

    contact - product - execution - quantity
       c1     -      p1    -        e1      -    2
       c1     -      p1    -        e2      -    500

    leads to this path in the tree: c1 -> p1 -> e1
    It seems like the learner takes the first combination and ignores the weights.
    I tried it with decision tree and CHAID.

    Regards,

    M. Mälzer
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    sorry, but I don't see any need for doing classification anyway. If you have each combination of the nominal attributes and each combination is assigned a label, where's the need for learning? It seems to me, the list of combinations with labels is a perfect classifier?

    Greetings,
      Sebastian
Sign In or Register to comment.