Learning / Recognizing a Distribution

GhostriderGhostrider Member Posts: 60 Contributor II
It's pretty straight-forward to learn / categorize labels associated with a small collection of numerical or categorical attributes.  However, is there a way to categorize distributions?  I know attributes can be used to describe a distribution such as quantiles, median, mean, std. deviation, etc.  But say I don't know which of these would be useful.  Is there a learning algorithm that can be useful for classifying distribution?  Maybe it's the case that a distribution that is skewed to the right always corresponds to some label.  Is there an automated way to detect / learn this without having to have a human recognize that trait?  I think optical character recognition would do something very similar...is there a way to classify distributions in RapidMiner?  How would they be input?


  • Options
    dan_agapedan_agape Member Posts: 106 Maven

    Interesting comment! Usually identifying a distribution is closer to the statistics field, and is performed with techniques appropriate to statistics: one tests a hypothesis that the values of a variable/attribute follow a certain specified theoretical distribution. So far, in the software I have seen by now, the statistician's involvement is essential, as the hypothesis is clearly made by him/her before being tested.

    However, it would be interesting to automatise this and offer it as a generic feature (and perhaps data mining and statistical software will do it at some point since useful in some applications). Such a feature could consist in automatically utilising several theoretical or specified distributions, assigning a hypothesis to each distribution regarding how well data fits into that distribution, then calculating the p values while testing those hypotheses, and then selecting the distribution with the largest p value as the result of the categorisation. However, the statisticians might not appreciate it very much, by saying that the result is not really the best distribution followed by the data, but rather the distribution that has the least chances to be inconsistent with the data. A distinction that may not upset data miners anyway, as we love heuristics and less theoretically founded things very much, if they prove to be useful in practice.

    Coming back to your question, as far as I know (since I still explore the software) RM does not seem to do what you have asked for. But in the meanwhile, the good news is that RM has made an expected move and has got closer to the statistics world by incorporating R as a plug-in (to be available shortly, as announced). This transforms RM in (one of) the broadest software as applicability in the DM&Stats more and more integrated worlds. Excellent move!

  • Options
    GhostriderGhostrider Member Posts: 60 Contributor II
    Humm...it's not so much about recognizing a distribution, actually. Say I have a time series or even a set of elements.  For now, assume that I have many, many samples of variable size.  Some sample sets have 50 items / members, some have only 6.  If I want to categorize each set as either having or not having some quality.  How would I do that?  Since the size varies, I cannot simply consider each item in the set as an attribute.

    I think this problem might be similar to the optical character recognition problem where words are recognized as groups of lines / shapes of variable length.
Sign In or Register to comment.