Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Statistical Significance
CraigBostonUSA
Employee, Member Posts: 34 RM Team Member
What are some good rules of thumb, online resources, measurement standards for knowing when you have a good amount of data, healthy sample size, statistical significance?
Simulating 6,000 Die Rolls - Visualization Created with R (source code included - see comments) [OC] from r/dataisbeautiful
0
Best Answer
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornWell, this is a very interesting question. I'm sure you are aware that there is a major debate raging in the world of statistics these days over the core concepts of statistical significance between frequentists, Bayesians, and their various offshoots. It seems like these debates were just getting acrimonious 15 years ago when I was studying applied statistics in graduate school, and they've only gotten more heated and bitter ever since, as a simple web search will demonstrate! So I'd be very careful wading into those types of discussions without a clear understanding of what the frame of reference is for the question.
Having said that, here's a couple of simple online calculators that allow you to do classic statistical tests based on the key parameters you are trying to estimate:
http://powerandsamplesize.com/Calculators/
https://www.stat.ubc.ca/~rollin/stats/ssize/n2.html
In my opinion, the biggest problem with predictive modeling projects these days isn't usually with sample sizes that are too small--it's with poorly framed business questions (sometimes accompanied by overly-academic concern for classical statistical tests). With modern data collection practices, unless you are working in the field of outlier detection or prediction of very rare events, you are likely to have samples that are "large enough" (at least by classical statistics standards).
Here are a few additional rules of thumb drawn from my own experience (note: not academic research into the topic of statistical significance, just based on my project experience and what works vs what doesn't). If you have several hundred observations (per class if you are doing classification prediction, or overall for numerical prediction) with the number of attributes << number of examples, that should be sufficient for the traditional ML algorithms. Some of the more "modern" (i.e., computationally intensive) ML algorithms can be more data hungry (with deep learning problems being a prime example), leaving you needing orders of magnitude more input data to get good fitting models. If you have other input data "problems" (e.g., high matrix sparsity, multi-collinearity, severe heteroschedasticity, etc.), that can also complicate things; likewise if you have lots more attributes than cases, which can lead to over-specificity with many classical techniques.
It is also important to remember that there are also diminishing marginal returns to massive datasets when building scorecards, with lots more demand on computation time and computing hardware with very few offsetting gains. For instance, it is very unlikely that you would need hundreds of thousands of input examples if there is a robust signal vs noise pattern that you want to capture, when thousands will demonstrate that same pattern adequately. Sampling should be used liberally when the initial dataset is very large (setting aside issues related to balancing the sample when classes are very imbalanced, which is another matter).
6