Inferential Statistics - R, Python or Extension

michaelglovenmichaelgloven RapidMiner Certified Analyst, Member Posts: 46 Guru
edited September 2019 in Help
As a partner, I am looking to use RapidMiner to integrate related inferential statistical methods such as hypothesis testing, confidence intervals, chi-square, etc. as part of a client implementation. I see there is a pay-for extension to do this work, but given the simplicity of these methods and unwanted burden of managing a paid for subscription to integrate these methods for only occasional use, is there a no-charge library of operators available, or do I need to just leverage R or Python and create my own? We only need a few methods for occasional use and I'd like to know if there are other options besides R, Python or the pay-for extension? Thanks! 
Tagged:

Best Answer

  • michaelglovenmichaelgloven RapidMiner Certified Analyst, Member Posts: 46 Guru
    Solution Accepted
    I normally calculate the z test statistic by taking the sample mean (or median) - null hypothesis value (what I'm testing) all divided by the standard error assuming the constraints of the central limit theorem. So, for SE I usually use the sample standard deviation/sq root of samples. I then compare this result with the critical z value (1.65 for a one tail test and level of significance of 5%) to see if I should reject or accept the hypothesis. The math is quite simple, I was just looking for a simple operator to automate the work given how important testing our data and results is to our particular use cases. I believe I can make all of this work with your suggestions above.

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    edited September 2019
    Hi Michael,

    i've just aded (last thursday) an operator called 'Compare Distributions' to SMILE extension. It provides KS-Test, Chi-Square Test, F-Test and T-Test. Would this already help?

    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • michaelglovenmichaelgloven RapidMiner Certified Analyst, Member Posts: 46 Guru
    awesome, you're several steps ahead of me as usual. It looks like this will work, and I'll review the documentation. Could you also point me in the right direction for calculating a z test statistic?
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Hi Michael,

    so the idea is to get the number of std-devs from the mean? I think we don't have it yet.

    But, Tukey Test in Operator Toolbox is fairly similar, imo superior. It's defined as:

    For each selected attribute a confidence of the Tukey Test is calculated. This confidence is defined as the distance between the current value to the median, divided by the distance of the lower/upper 'Tukey Test boundary' to the median.

    So instead of mean and std_dev we take Inter quartile range and median. Median is more robust to outliers than mean, so i and many stats-people prefer it.

    Can you have a look at Tukey test? We may just write the same stuff but with mean and std_dev if that's what you need.


    Cheers,

    Martin


    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • CB123CB123 Member Posts: 2 Contributor I
    Hello, I am trying to use the compare distributions operator to do T-tes,F-tes and Kolmogorov, but i can not find the significance level that is being used neither where i can change it.
    Thank you in advance
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Hi CB123,
    i can go wrong here, but the operator should return you the statistics and the p-value for this statistics. There is no significance level involved, as far as I remember. Isnt the significance level only used to reject the hypothesis for a given p-value?
    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    Hi @CB123,

    in KS test, the KS statistics, p-value will be returned as Dr Martin mentioned above. What is the usual significant level used by you in practice? 

    The common alpha values (significant level) of 0.05 and 0.01 are simply based on tradition.

    When a P value is less than or equal to the significance level, you reject the null hypothesis. If we take the P value from statistical tests and compare it to the common significance levels. For example the P value of 0.03112 is statistically significant at an alpha level of 0.05, but not at the 0.01 level.

    KStest http://haifengl.github.io/api/java/smile/stat/hypothesis/KSTest.html

    Hope it helps.

    YY
  • CB123CB123 Member Posts: 2 Contributor I
    Thank you very much for your answers!
    My problem is that I was trying to automatize the steps in T Test and F test, and I need more than the p-value, like the statistics T and F,and the critical region.
    Is there any way to calculate columns using the distributions F and T like in excel?

    Thank you!
Sign In or Register to comment.