Options

# Inferential Statistics - R, Python or Extension

RapidMiner Certified Analyst, Member Posts: 46 Guru
edited September 2019 in Help
As a partner, I am looking to use RapidMiner to integrate related inferential statistical methods such as hypothesis testing, confidence intervals, chi-square, etc. as part of a client implementation. I see there is a pay-for extension to do this work, but given the simplicity of these methods and unwanted burden of managing a paid for subscription to integrate these methods for only occasional use, is there a no-charge library of operators available, or do I need to just leverage R or Python and create my own? We only need a few methods for occasional use and I'd like to know if there are other options besides R, Python or the pay-for extension? Thanks!
Tagged:

• Options
RapidMiner Certified Analyst, Member Posts: 46 Guru
Solution Accepted
I normally calculate the z test statistic by taking the sample mean (or median) - null hypothesis value (what I'm testing) all divided by the standard error assuming the constraints of the central limit theorem. So, for SE I usually use the sample standard deviation/sq root of samples. I then compare this result with the critical z value (1.65 for a one tail test and level of significance of 5%) to see if I should reject or accept the hypothesis. The math is quite simple, I was just looking for a simple operator to automate the work given how important testing our data and results is to our particular use cases. I believe I can make all of this work with your suggestions above.

• Options
Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist
edited September 2019
Hi Michael,

i've just aded (last thursday) an operator called 'Compare Distributions' to SMILE extension. It provides KS-Test, Chi-Square Test, F-Test and T-Test. Would this already help?

BR,
Martin
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany
• Options
RapidMiner Certified Analyst, Member Posts: 46 Guru
awesome, you're several steps ahead of me as usual. It looks like this will work, and I'll review the documentation. Could you also point me in the right direction for calculating a z test statistic?
• Options
Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist
Hi Michael,

so the idea is to get the number of std-devs from the mean? I think we don't have it yet.

But, Tukey Test in Operator Toolbox is fairly similar, imo superior. It's defined as:

For each selected attribute a confidence of the Tukey Test is calculated. This confidence is defined as the distance between the current value to the median, divided by the distance of the lower/upper 'Tukey Test boundary' to the median.

So instead of mean and std_dev we take Inter quartile range and median. Median is more robust to outliers than mean, so i and many stats-people prefer it.

Can you have a look at Tukey test? We may just write the same stuff but with mean and std_dev if that's what you need.

Cheers,

Martin

- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany
• Options
Member Posts: 2 Contributor I
Hello, I am trying to use the compare distributions operator to do T-tes,F-tes and Kolmogorov, but i can not find the significance level that is being used neither where i can change it.
• Options
Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist
Hi CB123,
i can go wrong here, but the operator should return you the statistics and the p-value for this statistics. There is no significance level involved, as far as I remember. Isnt the significance level only used to reject the hypothesis for a given p-value?
Best,
Martin
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany
• Options
Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
Hi @CB123,

in KS test, the KS statistics, p-value will be returned as Dr Martin mentioned above. What is the usual significant level used by you in practice?

The common alpha values (significant level) of 0.05 and 0.01 are simply based on tradition.

When a P value is less than or equal to the significance level, you reject the null hypothesis. If we take the P value from statistical tests and compare it to the common significance levels. For example the P value of 0.03112 is statistically significant at an alpha level of 0.05, but not at the 0.01 level.

KStest http://haifengl.github.io/api/java/smile/stat/hypothesis/KSTest.html

Hope it helps.

YY
• Options
Member Posts: 2 Contributor I