🎉 🎉 RAPIDMINER 9.10 IS OUT!!! 🎉🎉
Download the latest version helping analytics teams accelerate time-to-value for streaming and IIOT use cases.
computing 1-p_value in Weight by Chi Squared Statistic operator
An option to include the calculation of 1- p_value as a weight for an attribute in the above operator, as an alternative to the the weight given by a chi square statistic value for the same attribute, would be very useful. A button to allow to choose between 1- p_value and the statistic itself, for all the input attributes, would be ideal.
With this facility, one can select the attributes for which there is evidence, from the statistical reasoning point of view, that they are not independent with respect to the label attribute. Indeed, one would choose the computation of 1-p_value as a weight per attribute in the above operator, and then would select all the attributes whose weight is at least 0.95.
Moreover, this facility would allow a clear indication, which is statistically supported, whether or not the input attributes are likely to have predictive power with respect to the label attribute. For example if all the input attribute weights (calculated as 1-p values, so as complements of p values) were under let us say 0.4 in a dataset, then the classification models one would try to build would likely perform poorly, since the data is consistent with the hypothesis that the input attributes are independent with respect to the label attribute. It is not possible to say this thing based on the chi square statistic values. These need to be converted into p values first (or, as suggested, into complements of p values) for more insight on the dataset to mine.
So the weights computed as complements of the p values from Pearson's chi square statistical test can in many cases signal that a dataset is inappropriate for a given classification problem (saving time spent for trying to build various poorly performing models in an attempt to find a good one, that actually is likely not to exist). When the dataset is appropriate, these weights can differentiate attributes for which there is statistical evidence that they are not independent of the label attribute (corresponding to large complements of p values), so that they can be used in the process of building the model. Moreover, sorting attributes according to the complements of p values as weights is similar to sorting attributes according to the less meaningful chi square statistic value weights (that is, one can choose the top k attributes as usual, etc). So why not computing the weights also as the complements of the p values in the Weight by Chi Squared Statistic operator, or simply adding a new - Weight by Chi Square Complement p Value - operator?