Statistical Significance

ema · June 2009

Hi all,

I am doing regular classification validation , shown below
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#p#ygt#This process is very similar to the process #yquot#03_XValidation_Numerical.xml#yquot#. The basic process setup is exactly the same, i.e. the first inner operator must produce a model from the given training data set and the second inner operator must be able to handle this model and the test data and must provide a PerformanceVector. #ylt#/p#ygt# In contrast to the previous process we now use a classification learner (J48) which is evaluated by several nominal performance criteria.#ylt#/p#ygt# #ylt#p#ygt# The cross validation building block is very common for many (more complex) RapidMiner processes. However, there are several more validation schemes available in RapidMiner which will be dicussed in the next sample processes. #ylt#/p#ygt#"/>
<parameter key="logfile" value="C:\knn.txt"/>
<operator name="TextInput (4)" class="TextInput" expanded="no">
<list key="texts">
<parameter key="b" value=".."/>
<parameter key="P" value=".."/>
</list>
<parameter key="default_content_encoding" value="utf8"/>
<parameter key="default_content_language" value="utf8"/>
<parameter key="prune_below" value="3"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="true"/>
<operator name="StringTokenizer (4)" class="StringTokenizer">
</operator>
<operator name="TokenLengthFilter (4)" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
</operator>
<operator name="XValidation (3)" class="XValidation" expanded="yes">
<operator name="NearestNeighbors" class="NearestNeighbors">
<parameter key="k" value="3"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<operator name="OperatorChain (3)" class="OperatorChain" expanded="yes">
<operator name="ModelApplier (3)" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="ClassificationPerformance (3)" class="ClassificationPerformance">
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="true"/>
<parameter key="kappa" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<parameter key="spearman_rho" value="true"/>
<parameter key="kendall_tau" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="relative_error" value="true"/>
<parameter key="relative_error_lenient" value="true"/>
<parameter key="relative_error_strict" value="true"/>
<parameter key="normalized_absolute_error" value="true"/>
<parameter key="root_mean_squared_error" value="true"/>
<parameter key="root_relative_squared_error" value="true"/>
<parameter key="squared_error" value="true"/>
<parameter key="correlation" value="true"/>
<parameter key="squared_correlation" value="true"/>
<parameter key="cross-entropy" value="true"/>
<parameter key="margin" value="true"/>
<parameter key="soft_margin_loss" value="true"/>
<parameter key="logistic_loss" value="true"/>
<list key="class_weights">
</list>
</operator>
</operator>
</operator>
</operator>

my question is, other than xvalidation , does rapidminer has any ability to calculate "statistical significance"

Thank you

land · June 2009

Hi Ema,
RapidMiner provides operator for checking if results are statistically significant better compared to others using the operators in the Validation / Significance group. Namely it provides you with an ANOVA and a T-Test operator for comparing performance vectors.

Is that what you searched for?

Greetings,
Sebastian

lindawu · February 2010

Hi,

I just download the RapidMiner and was impressed by all the data mining methods in it. However, is it another way to test significance, like Fisher's test? For example, consider a rule:

A1=> A0, i.e., prob(A0|A1) > prob(A0)

we can rewrite it as

prob(A0|A1) *prob(A1) > prob(A0) * prob(A1)
prob(A0&A1) > prob(A0)*prob(A1)

Therefore, we can test hypothesis H0

H0: prob(A0&A1) = prob(A0)*prob(A1)

against alternative hypothesis H1

H1: prob(A0&A1) != prob(A0)*prob(A1)

if H0 is confirmed, then A0=>A1 is not a statistically significant rule.

Any functionability on this test?

land · February 2010

Hi,
where would you like to add this feature? Should it apply to Association Rules or the Rule model? Testing general data mining models could be a little difficult with that, since we don't have a probability there. Or am I misunderstanding something?

Greetings,
Sebastian

lindawu · March 2010

Yes, I think it would be useful to add to rulelearner.

steffen · March 2010

Hello lindawu

I understand what you are implying. Speaking as a bayesian, you want to test whether the occurrence of an attribute (or the specific value of a an attribute) is independent of the occurence of another attribute (specific value of another attribute). This is in general a good idea, however...

most learners are constructed in such a way that only significant combinations are weighted more than insignificant ones, to improve overall quality and to reduce overfitting
I would not care if I had a model containing only insignificant rules (in a sense of a statistical hypothesis test), but which delivers well-tested (!) low error-rates

So ... if you sink that the quality of rule models can be improved significantly, why dont you try it out by coding it yourself

? You see, the rapidminer guys have a lot of stuff to do, so the best way to persuade them to include "new" approaches is to provide code and an example demonstrating the power of the idea.

happy mining,

steffen

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Statistical Significance

Answers