Statistical Significance

emaema Member Posts: 33  Guru
Hi all,

I am doing regular classification validation , shown below  
<operator name="Root" class="Process" expanded="yes">
   <description text="#ylt#p#ygt#This process is very similar to the process #yquot#03_XValidation_Numerical.xml#yquot#. The basic process setup is exactly the same, i.e. the first inner operator must produce a model from the given training data set and the second inner operator must be able to handle this model and the test data and must provide a PerformanceVector. #ylt#/p#ygt# In contrast to the previous process we now use a classification learner (J48) which is evaluated by several nominal performance criteria.#ylt#/p#ygt#  #ylt#p#ygt# The cross validation building block is very common for many (more complex) RapidMiner processes. However, there are several more validation schemes available in RapidMiner which will be dicussed in the next sample processes. #ylt#/p#ygt#"/>
   <parameter key="logfile" value="C:\knn.txt"/>
   <operator name="TextInput (4)" class="TextInput" expanded="no">
       <list key="texts">
         <parameter key="b" value=".."/>
         <parameter key="P" value=".."/>
       </list>
       <parameter key="default_content_encoding" value="utf8"/>
       <parameter key="default_content_language" value="utf8"/>
       <parameter key="prune_below" value="3"/>
       <list key="namespaces">
       </list>
       <parameter key="create_text_visualizer" value="true"/>
       <operator name="StringTokenizer (4)" class="StringTokenizer">
       </operator>
       <operator name="TokenLengthFilter (4)" class="TokenLengthFilter">
           <parameter key="min_chars" value="3"/>
       </operator>
   </operator>
   <operator name="XValidation (3)" class="XValidation" expanded="yes">
       <operator name="NearestNeighbors" class="NearestNeighbors">
           <parameter key="k" value="3"/>
           <parameter key="measure_types" value="NumericalMeasures"/>
           <parameter key="numerical_measure" value="CosineSimilarity"/>
       </operator>
       <operator name="OperatorChain (3)" class="OperatorChain" expanded="yes">
           <operator name="ModelApplier (3)" class="ModelApplier">
               <list key="application_parameters">
               </list>
           </operator>
           <operator name="ClassificationPerformance (3)" class="ClassificationPerformance">
               <parameter key="accuracy" value="true"/>
               <parameter key="classification_error" value="true"/>
               <parameter key="kappa" value="true"/>
               <parameter key="weighted_mean_recall" value="true"/>
               <parameter key="weighted_mean_precision" value="true"/>
               <parameter key="spearman_rho" value="true"/>
               <parameter key="kendall_tau" value="true"/>
               <parameter key="absolute_error" value="true"/>
               <parameter key="relative_error" value="true"/>
               <parameter key="relative_error_lenient" value="true"/>
               <parameter key="relative_error_strict" value="true"/>
               <parameter key="normalized_absolute_error" value="true"/>
               <parameter key="root_mean_squared_error" value="true"/>
               <parameter key="root_relative_squared_error" value="true"/>
               <parameter key="squared_error" value="true"/>
               <parameter key="correlation" value="true"/>
               <parameter key="squared_correlation" value="true"/>
               <parameter key="cross-entropy" value="true"/>
               <parameter key="margin" value="true"/>
               <parameter key="soft_margin_loss" value="true"/>
               <parameter key="logistic_loss" value="true"/>
               <list key="class_weights">
               </list>
           </operator>
       </operator>
   </operator>
</operator>


my question is, other than xvalidation , does rapidminer has any ability to calculate "statistical significance"

Thank you

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,526   Unicorn
    Hi Ema,
    RapidMiner provides operator for checking if results are statistically significant better compared to others using the operators in the Validation / Significance group. Namely it provides you with an ANOVA and a T-Test operator for comparing performance vectors.

    Is that what you searched for?

    Greetings,
      Sebastian
  • lindawulindawu Member Posts: 5 Contributor II
    Hi,

    I just download the RapidMiner and was impressed by all the data mining methods in it. However, is it another way to test significance, like Fisher's test? For example, consider a rule:

    A1=> A0, i.e., prob(A0|A1) > prob(A0)

    we can rewrite it as

    prob(A0|A1) *prob(A1) > prob(A0) * prob(A1)
    prob(A0&A1) > prob(A0)*prob(A1)

    Therefore, we can test hypothesis H0

    H0: prob(A0&A1) = prob(A0)*prob(A1)

    against alternative hypothesis H1

    H1: prob(A0&A1) != prob(A0)*prob(A1)

    if H0 is confirmed, then A0=>A1 is not a statistically significant rule.

    Any functionability on this test?
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,526   Unicorn
    Hi,
    where would you like to add this feature? Should it apply to Association Rules or the Rule model? Testing general data mining models could be a little difficult with that, since we don't have a probability there. Or am I misunderstanding something?

    Greetings,
      Sebastian
  • lindawulindawu Member Posts: 5 Contributor II
    Yes, I think it would be useful to add to rulelearner.
  • steffensteffen Member Posts: 347  Guru
    Hello lindawu

    I understand what you are implying. Speaking as a bayesian, you want to test whether the occurrence of an attribute (or the specific value of a an attribute) is independent of the occurence of another attribute (specific value of another attribute). This is in general a good idea, however...
    • most learners are constructed in such a way that only significant combinations are weighted more than insignificant ones, to improve overall quality and to reduce overfitting
    • I would not care if I had a model containing only insignificant rules (in a sense of a statistical hypothesis test), but which delivers well-tested (!) low error-rates
    So ... if you sink that the quality of rule models can be improved significantly, why dont you try it out by coding it yourself :) ? You see, the rapidminer guys have a lot of stuff to do, so the best way to persuade them to include "new" approaches is to provide code and an example demonstrating the power of the idea.

    happy mining,

    steffen
Sign In or Register to comment.