The right tool for the job

BananamasterBananamaster Member Posts: 4 Contributor I
Hey guys,

I have this idea of what I want to do, but I'm not to sure if RapidMiner is the right tool for the job.

I have a dataset which basically describes my transactions. I have a lot of different attributes like if the sold good is used or new or how long I had it on storage or if it was a bike or a boat and many more. Additionally for each of these transactions I have my margin which is, in some unfortunate cases, negative. I now want to find out: Which attribute or combination of attributes share the transactions I lose money on. The perfect result would be a tree reading somewhat like this:
I loose 1000$ on all of my transactions. 1% of which is lost on boats, so that is not the problem. The rest of the 99% split in 15% loss on new goods, but 84% on used. The 15% in new split in 1%: storage time under 5 days and 14% storage time over or equal to 5 days, whereas the 84% on used goods split in 12% cars and 72% bicycles. And so on.
Not a very complicated algorithm if each attribute is a discrete and limited set.

Now, is RapidMiner a tool that can help me do that? If so: Is that a standard function and do I only have to import my data and use a given function or to I have to build a function like this on my own using RapidMiner?

Thanks guys

Lukas

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi Bananamaster,

    i am sure Rapidminer can solve this. But i do not excactly understand the problem.

    Let me describe what i understand:
    You have a table like this

    negative  Bike new
    positive  NoBike used
    negative NoBike used

    ...

    And now you find the attributes which are most influencing for your marginin (positive/negative).

    If so, you are in the area of feature selection. There are quite some algorithms around. If the problem is as easy as you suspect, i would try the Weight By Information Gain, Weight by Gini Index or Weight by Correlation operators first. The result is a weight vector, which represents the importance of the attribute (be careful for correlations, anti correlations are also "important", be sure to square the value).
    Note: Most of the Weight by operators are ignoring interactions between attributes. So if you have a "is new AND is a bike" pattern inside, they might fail. In this case, you might try a feature selection using Weight By SVM or a Forward Selection.

    Best,

    Martin

    Edit: Maybe you need do convert you nominal values to numerical first. Try Nominal to Numerical using dummy coding.
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • BananamasterBananamaster Member Posts: 4 Contributor I
    Hi Martin,

    thanks a lot for the help.

    You seem to be on the right track. My table looks like yours, but instead of negative or positive I have an integer in the first column describing how much money i gained or lost. This is important because how much money I lost is more important than how many transactions I have lost that money on. This seems to be where my problems start, because neither the decision tree nor the weight operators can handle integer labels. I wonder why because right now the operator is probably comparing count(positive) to count(negative) and if it could be able to compare sum(positive) to sum (negative), it would probably deliver exactly what I'm looking for.

    But perhaps only from a beginners perspective ;)

    Do you have another idea?

    Thanks again

    Lukas
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi again

    I don't know why you want to use a decision tree, but the decision tree is usually no algorithm you use for regression.
    Anyway - why do you want to use a learner?

    A learner would a way for you, to score automatically. You could score the products on stock. So you would know beforehand  the margin  of this item.

    If that's what you want to do: Try a regression algorithm like Linear Regression, k-NN, Neural Net or SVM.

    The weight by operators have to work on numeric labels. I am not sure about integer labels, but you can simply make them numerical. They might have problems with polynomial values. But as i said, you can use nominal to numerical before.

    Best,

    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.