Options

Related attributes

michieldmichield Member Posts: 8 Contributor II
edited November 2018 in Help
Hello everyone,

In the following task I could really use some tips or hints from other miners

My dataset does not suffice the rule that the number of samples is over 10 times the number of variables per sample and the number of choises of the label variable. In other words...
I have only very few samples and a lot of data variables per sample, but I still would like to do something useful with the data set namely the following:

- Try to find which variables (attributes) give related information, and if possible show this in some kind of graphical manner
- From all the unrelated variables, try to find a few combinations which best describe a model for the label variable
Of course, since the number of samples is so low there could be many combinations of the latter. The attributes are not all of the same type (so some are bins, some are numbers, some are text)

A little about the data set:
Suppose I am a car maker and I have 6 car models, but some have design flaws in them that I would like to find, I try to parametrise each designs in a set of variables (attributes, now there are only 6 (plus CarModel, but that shouldnt be used to mine) but imagine that there are 300 attributes).
CarModel WheelSize WheelBrand EngineType EnginePower EngineBrand  Failure(label)
Corvega  18              Brimstone    Nitro            6GigaWatt      RollsDavidson CarExploded

A little about the way I tried to do this before:
- To see which attributes are directly related to the label attribute, I used correlationmatrix. I then looked at all variables which are (in absolute sense) closest to 1 and thought those were important attributes. The drawback is I could not look at combinations of attributes.
- In parallel I tried to create a decision tree. The problem with this approach was, that there were many possibilities and the program just took the first attribute it encountered in the data set at which it could classify well, so what I did was remove that attribute to look at which was the next attribute and look at the model again.

Could anyone please give hints in how to better approach this problem than I did before?
Thank you!

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    this is a common but still complex problem. I doubt we can discuss all possible ways here. In RapidMiner there are many useful tools for solving this problem:
    You could use the more heuristic attribute weighting schemas, or the wrapper based approach using Forward or Backward selection with an integrated learning algorithm.
    You can even select the subset by random and try to draw conclusion from random results inside a loop.

    You will have to experiment with the combination of possibilities to find the most suitable for your (very) specific task. For an orientation, there are many samples for weighting and selecting attributes in the Sample repository.

    Greetings,
    Sebastian
  • Options
    michieldmichield Member Posts: 8 Contributor II
    Thank you Sebastian!

    With Samples do you mean the community expansion or is there another location that I am not aware of yet with samples in them?

    Which of the many possible ways to do this would you recommend yourself? (if possible a web-link to an example or a search term I could
    use in the community expansion would make things much easier for me).
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    take a look at the Sample Repository that is delivered with RapidMiner. There are several examples for this!

    Greetings,
    Sebastian
  • Options
    michieldmichield Member Posts: 8 Contributor II
    Hello everyone,

    Thank you for the helpful advice, I found the samples (and browsed the community processes). Now I have the following question:

    I managed to give a weight to the attributes (weight by information gain) and I can cap off the data set by doing "select by weight" and then
    sending the result of that to decision tree. Now I wonder if there is also an operator that would be able to order the dataset by weight instead of
    capping it off (like select by weight does), so that my decision tree encounters the most relevant attribute first, then the next, etc.

    Does such an operator ("re-order by weight") exist and if so, what is its name? (I tried to find one by typing in "weight" in the operator search etc.)
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    what you suggest does not make sense: The Decision Tree will take a look at ALL attributes, regardless of their ordering.

    Greetings,
      Sebastian
Sign In or Register to comment.