Options

# computing 1-p_value in Weight by Chi Squared Statistic operator

Hi,

An option to include the calculation of 1- p_value as a weight for an attribute in the above operator, as an alternative to the the weight given by a chi square statistic value for the same attribute, would be very useful. A button to allow to choose between 1- p_value and the statistic itself, for all the input attributes, would be ideal.

With this facility, one can select the attributes for which there is evidence, from the statistical reasoning point of view, that they are not independent with respect to the label attribute. Indeed, one would choose the computation of 1-p_value as a weight per attribute in the above operator, and then would select all the attributes whose weight is at least 0.95.

Moreover, this facility would allow a clear indication, which is statistically supported, whether or not the input attributes are likely to have predictive power with respect to the label attribute. For example if all the input attribute weights (calculated as 1-p values, so as complements of p values) were under let us say 0.4 in a dataset, then the classification models one would try to build would likely perform poorly, since the data is consistent with the hypothesis that the input attributes are independent with respect to the label attribute. It is not possible to say this thing based on the chi square statistic values. These need to be converted into p values first (or, as suggested, into complements of p values) for more insight on the dataset to mine.

So the weights computed as complements of the p values from Pearson's chi square statistical test can in many cases signal that a dataset is inappropriate for a given classification problem (saving time spent for trying to build various poorly performing models in an attempt to find a good one, that actually is likely not to exist). When the dataset is appropriate, these weights can differentiate attributes for which there is statistical evidence that they are not independent of the label attribute (corresponding to large complements of p values), so that they can be used in the process of building the model. Moreover, sorting attributes according to the complements of p values as weights is similar to sorting attributes according to the less meaningful chi square statistic value weights (that is, one can choose the top k attributes as usual, etc). So why not computing the weights also as the complements of the p values in the Weight by Chi Squared Statistic operator, or simply adding a new - Weight by Chi Square Complement p Value - operator?

Dan

An option to include the calculation of 1- p_value as a weight for an attribute in the above operator, as an alternative to the the weight given by a chi square statistic value for the same attribute, would be very useful. A button to allow to choose between 1- p_value and the statistic itself, for all the input attributes, would be ideal.

With this facility, one can select the attributes for which there is evidence, from the statistical reasoning point of view, that they are not independent with respect to the label attribute. Indeed, one would choose the computation of 1-p_value as a weight per attribute in the above operator, and then would select all the attributes whose weight is at least 0.95.

Moreover, this facility would allow a clear indication, which is statistically supported, whether or not the input attributes are likely to have predictive power with respect to the label attribute. For example if all the input attribute weights (calculated as 1-p values, so as complements of p values) were under let us say 0.4 in a dataset, then the classification models one would try to build would likely perform poorly, since the data is consistent with the hypothesis that the input attributes are independent with respect to the label attribute. It is not possible to say this thing based on the chi square statistic values. These need to be converted into p values first (or, as suggested, into complements of p values) for more insight on the dataset to mine.

So the weights computed as complements of the p values from Pearson's chi square statistical test can in many cases signal that a dataset is inappropriate for a given classification problem (saving time spent for trying to build various poorly performing models in an attempt to find a good one, that actually is likely not to exist). When the dataset is appropriate, these weights can differentiate attributes for which there is statistical evidence that they are not independent of the label attribute (corresponding to large complements of p values), so that they can be used in the process of building the model. Moreover, sorting attributes according to the complements of p values as weights is similar to sorting attributes according to the less meaningful chi square statistic value weights (that is, one can choose the top k attributes as usual, etc). So why not computing the weights also as the complements of the p values in the Weight by Chi Squared Statistic operator, or simply adding a new - Weight by Chi Square Complement p Value - operator?

Dan

0

## Answers

849MavenThe Wikipedia entry on P-values http://en.wikipedia.org/wiki/P-value is quite explicit... As the article explains.. So could you explain a bit further the benefits of the operator you propose, because I'm sure I'm missing something?

Many thanks

106MavenReading an introductory statistics book would help to clarify the fundamentals of statistical reasoning for you.

Statisticians use p values rather than their complements (1-p_value), obviously. One rationale of proposing complements of p values as alternative weights in the mentioned RM operator is suggested from the following equivalence that holds for a given degrees of freedom value of the chi square distribution:

bigger chi square statistic <-> smaller p value <-> bigger complement of p value

Practically speaking, RM already employs the chi square statistic whose values are seen as weights that can be used to select desirably the best input attributes in a classification problem. For instance you may pick up the top 10 attributes with the highest chi square statistic values to do your analysis with, as input attributes. Instead of this you may pick the top 10 attributes with the highest complements of p values. The complements of p values as weights may do a similar (if not a better) job to that of the chi square statistic weights.

However p-values can do more, as explained initially, and for a clear understanding of the explanations provided before about the usefulness of the light extension that was proposed, you may need some proper understanding of the foundations of the statistical reasoning - so I advise you to read a good foundation book in statistics first before tackling the subject further.

Finally, to conclude, obviously that 1-p_value is not the probability for the alternative hypothesis to be true. It is the complement of the probability that the chosen statistics (seen as random variable) is more or equal to the value of the statistics computed using the data sample, assuming that the null hypothesis is true. If p_value <=0.05 (or equivalently the complement of the p_value > =0.95) then the null hypothesis is rejected (and implicitly the alternative hypothesis is accepted) at 0.05 level of significance. When the p_value is bigger than the threshold of 0.05 (or equivalently the complement of the p value is smaller than 0.95 - and an example value was chosen as 0.4) then this is indication that the data sample is consistent with the null hypothesis. This situation corresponds to smaller values of the chi square statistic, or equivalently to smaller weights for the attributes, as computed by RM's mentioned operator. And since you would have done some work in Data Mining, you know that especially when your data has a high dimensionality (and not only then) you would wish to select attributes with higher weights as computed by this operator for instance (which correspond to higher complements of p values) to get a good model built by employing a part of your dataset only. [[Here the null hypothesis was: an input attribute is independent w.r.t. the class attribute. The alternative hypothesis was the negation of the null hypothesis.]]

Dan

849MavenA dirty dozen: twelve p-value misconceptions.

Goodman S.

Source

Departments of Oncology, Epidemiology, and Biostatistics, Johns Hopkins Schools of Medicine and Public Health, Baltimore, MD, USA. Sgoodman@jhmi.edu

Abstract

The P value is a measure of statistical evidence that appears in virtually all medical research papers. Its interpretation is made extraordinarily difficult because it is not part of any formal system of statistical inference. As a result, the P value's inferential meaning is widely and often wildly misconstrued, a fact that has been pointed out in innumerable papers and books appearing since at least the 1940s. This commentary reviews a dozen of these common misinterpretations and explains why each is wrong. It also reviews the possible consequences of these improper understandings or representations of its meaning. Finally, it contrasts the P value with its Bayesian counterpart, the Bayes' factor, which has virtually all of the desirable properties of an evidential measure that the P value lacks, most notably interpretability. The most serious consequence of this array of P-value misconceptions is the false belief that the probability of a conclusion being in error can be calculated from the data in a single experiment without reference to external evidence or the plausibility of the underlying mechanism.

At least I know how little I know, being rude advertises ignorance.

:-*

1,751RM Founderplease stay calm and fair - asking that somebody should first aquire basic knowledge before starting a (from my point meaningful and useful) discussion with you is certainly not a good style of discussion. Haddock certainly is knowing what he is talking about!

Ok, let's come back to the discussion instead of insulting each other, please...

Cheers,

Ingo

106Maven@Ingo: My posting was limited to arguments on the subject, and was clearly fair. Moreover, I have made a pertinent recommendation to someone, in this case Haddock, to read suitable relevant material since he seemed interested in the subject, and seemed to need such a recommendation (no offence). If he finds inconvenient to be recommended a material that would help to improve the own knowledge on the subject, then I can do nothing about it. It is common on this forum to recommend people to follow introductory tutorials or documentation, when it seems they would need it and benefit from. There is nothing rude in this and I see Haddock himself did such recommendations to people not familiar with RM or with its documentation, on a number of occasions.

“Haddock certainly is knowing what he is talking about!”

No offence, perhaps he was less convincing on this occasion, at least the question in the posting suggested so. The extension of RapidMiner operator I suggested was based on a statistical tool which is fundamental for decision tree algorithms as CHAID and QUEST, whose details Haddock didn’t seem to be aware of. CHAID and QUEST depend heavily on the use of p values and Pearson’s chi square test, and the same test plus ANOVA and Levene F- tests, respectively. Moreover the main ideas in these algorithms capture the much simplified idea the operator extension I suggested is based on. So it is certain that knowing and understanding a bit of the mechanism of CHAID or QUEST would make my proposed operator extension seem trivially clear. However these algorithms (or the simple extension I proposed) can be understood better only assuming some good understanding of statistical tests, thus the recommendation I made for consulting a good introductory Stats book. Finally, statistical tests (including p values) are current standard in Stats, are studied at least by Maths/Stats students in all the universities, and are certainly used by two of the major players in the commercial Data Mining software as IBM SPSS Modeler and SAS Enterprise Miner (and not only by them) - see for instance the implementations of CHAID and QUEST. RapidMiner too uses the chi square statistic (in a limited way, unfortunately) which is an inseparable component element of the Pearson’s test mentioned above, as the concept of p value is.

For Haddock and for those interested in details on these algorithms, various good documentation is available also online, including from SAS Institute, IBM SPSS (which use them also in their statistical software). For instance these show how p values are employed in selecting predictors / input attributes:

http://support.spss.com/productsext/spss/documentation/statistics/algorithms/14.0/TREE-CHAID.pdf

http://support.spss.com/productsext/spss/documentation/statistics/algorithms/14.0/TREE-QUEST.pdf

@Haddock: Childish manner to end a posting (with that emoticon), for a respected veteran of this forum ...

Regarding the paper you quoted above (written by a medical doctor and researcher in oncology, according to his webpage), indeed, it illustrates usual problems researchers in medicine may have with using Statistics properly, in particular statistical tests (and p values as a component concept). These frequent problems of improper use are encountered in other scientific communities whose members are large statistics consumers (e.g. Social Sciences).There has been some controversy regarding the pluses and minuses of statistical tests (and p values), much of it supported also by poor understanding and improper application of these tools, as illustrated by the paper you cite. However statistical tests (and p values) tools are part of the standard in the field of statistical inference (and these tools are what students in Maths/Stats from Harvard, Cambridge, and everywhere else, are currently taught) and remain so as there is no largely accepted better approach.

I am sure we are all busy with our work and/or study of Data Mining so let’s focus on this and on related subjects, only, on this forum.

Dan

849MavenSteven N. Goodman, M.D., M.H.S., Ph.D., is Professor of Oncology in the Division of Biostatistics of the Johns Hopkins Kimmel Cancer Center, with appointments in the Departments of Pediatrics, Biostatistics and Epidemiology in the Johns Hopkins Schools of Medicine and Public Health. Dr. Goodman received a B.A. from Harvard, an M.D. from NYU, trained in Pediatrics at Washington University in St. Louis, received his M.H.S. in Biostatistics, and his Ph.D. in Epidemiology from Johns Hopkins University. He served as co-director of the Johns Hopkins Evidence-Based Practice Center, is on the board of directors of the Society for Clinical Trials, was co-director of the Baltimore Cochrane Center from 1994 to 1998, and is on the core faculties of the Johns Hopkins Berman Bioethics Institute, the Center for Clinical Trials, the Graduate Training Program in Clinical Investigation and the Johns Hopkins Center for the History and Philosophy of Science. He is the editor of Clinical Trials: Journal of the Society for Clinical Trials, has been Statistical Editor of the Annals of Internal Medicine since 1987 and for the Journal of General Internal Medicine from 1999 to 2000. He has served on a wide variety of national panels, including the Institute of Medicine's Committee on Veterans and Agent Orange, and is currently on the IOM Committee on Vaccine Safety, the Medicare Coverage Advisory Commission, and the Surgeon General's committees to write the 2001 and 2002 reports on Smoking and Health. He currently chairs a panel assessing the long-term outcomes of assisted reproductive technologies, established by the Genetics and Public Policy Institute and sponsored by the American Society for Reproductive Medicine and the American Academy of Pediatrics (AAP). He represents the AAP on the Medical Advisory Panel of the National Blue Cross/Blue Shield Technology Evaluation program, and served as a consultant to the President s Advisory Commission on Human Radiation Experiments. He has published over 90 scientific papers, and writes and teaches on evidence evaluation and inferential, methodological, and ethical issues in epidemiology and clinical research.

Tough call !

1,751RM FounderDan, I do not have any problem with your recommendation to read more material per se. I just meant that starting your first answer to Haddock's reaction with this recommendation might heat the discussion too much. At this point of time, Haddock just has asked for more information and pointed out the fact that there indeed is some discussion about the usefulness of this measure. I got your point that this question of him actually was the reason for your recommendation but please understand that Haddock - who just wanted to start a fruitful dicussion with you - probably looks for more than the answers a) read more material and b) it's also used by others hence it has to be good

Haddock, your last answer did also not really help to calm things down. So please: If you have the feeling that the discussion is not giving you the expected information, ask again or simply ignore it.

Ok, back to the original topic: I don't have any problem with the suggested extension at all. It is quite straightforward and might help you and others. One of my major concerns with this statistic is the fact that it only takes into account a single feature at a time and does not look at feature subsets. A single feature might not explain anything, a combination of two or more features might explain everything. The most simple and prominent example probably is the XOR-function. Hence, the statement "For example if all the input attribute weights (calculated as 1-p values) were under let us say 0.4 in a dataset, then the classification models one would try to build would likely perform poorly" is not necessarily true.

However, this is true for almost all other feature evaluation schemes as well and nevertheless many people (including me) find them useful in certain applications. In fact, there are literally hundreds of operators I wouldn't use since I believe (by knowledge or experience and sometimes even by prejudice) that there are better alternatives. But at the same time those are exactly the operators used a lot by others and it's good that they are part of RM.

What do you mean by that? That's there is some error or that the statistic is not available at other places as well? Do you have some recommendation where you think it is missing in this case?

Yes, please let's do so!

Cheers,

Ingo

106MavenThanks for your comments. That's rather a busy period, but I'll get back on your points.

Thanks,

Dan

106MavenA few remarks regarding your points. Got your point. Note however that you omitted one essential aspect: there was more than a) and b). Primarily some technical details and explanations for the main idea (i.e. to use p value complements as weights to select predictive attributes) had been provided in the initial posting (including justifications based on Math Statistics). If the lengthy posting was insufficient, then certainly some intro Stats reading would have helped I guess.

Regarding your point b) above, perhaps the best remark I can make here, would be the fact that most of the software appearing in the upper half in the result of the kdnudggets’ 2010 poll http://www.kdnuggets.com/polls/2010/data-mining-analytics-tools.html regarding the use of Data Mining/analytics tools, make use of statistical tests and p values in their algorithms. I refer here to:

R, Excel, Statsoft Statistica, SAS (Stats), SAS Enterprise Miner, IBM SPSS Statistics, IBM SPSS Modeler, Matlab, Microsoft SQL Server, Oracle Data Mining, Weka.

Notable names from this upper half poll result that seem not to make use of p values (yet) include RapidMiner and KNIME. Perhaps I will come back on this list. However, when there is such an omnipresent use of statistical tests and p values, even those of us that would not have a background in Mathematical Statistics and/or Computer Science (in order to better understand them and their use in Data Mining algorithms) are likely to realise that there must be something good about these statistical concepts. Obviously why p values are good to use has primarily been justified with other arguments than just saying: others use them. Sounds good, thanks. Obviously in feature selection, the only proven best method is that in which all subsets of input attributes are evaluated. Since this method is extremely impractical, we come to the use of heuristics (for the non computer scientists on the forum, heuristics are algorithms that provide approximate or partial solutions, that are computationally cheaper, so they may be good alternatives to a computationally expensive method that would provide a complete, exact solution). So yes, the heuristic based on chi-square test considers one attribute at a time, which sometimes may prove to be a disadvantage, although in practice it works very well. But all the heuristics do have their disadvantages, don’t they?, including your preferred ones, that’s why they are just heuristics. Moreover, it is hard to demonstrate that a heuristic in feature selection performs better than another one in all circumstances. Method A can work better than method B on a dataset, and worse than method B on another dataset. In particular I doubt that you could demonstrate that your favourite feature selection heuristic gives a better solution than the chi-square test heuristic on each dataset. That would have been an outstanding research paper I guess

In practice the best would be to possibly try some feature selection heuristics and stick to one that works fast enough and provides a good result in that particular problem. I often use the statistical tests (chi-square) to select the best features, and this works very well for most of my problems. In addition to the support it gets from its theoretical foundation, one notable plus of this heuristic is that it is very cheap computationally, so very fast.

People interested in details regarding this method may want to have a look in Han’s Data Mining book (for newcomers in the field, this is one of the mostly used textbooks in Data Mining university courses, and popular among practitioners - tools users and tools implementers). In the Data Pre-processing chapter, where the chi-square statistical test is presented, one says “the ‘best’ (and ‘worst’) attributes are typically determined using tests of statistical significance ”. One obviously refers to the statistical tests; moreover significance here is related to the so called significance levels (typically 0.05 or 0.01), that are thresholds for the p value.

In a next posting I will probably refer to another very popular book in (Statistical) Data Mining and Machine Learning, which is, no doubt, known by most of you guys on this forum, especially if you have a Computer Science or Math Stats background – it’s Hastie’s book. That’s an excellent reading, and there we can see again statistical tests and p values at work. Hopefully we will finally see p values at work in RM too I mean that in RM the chi square statistic could be better used together with p values in feature weighting / selection as discussed so far, and primarily in the implementation of decision tree algorithms as CHAID and QUEST for instance (along with other statistical tests and their own statistics measures).

Regards,

Dan

537MavenWhere exactly should Rapid Miner display p-values?

In the weights by Chi Squared Statistic?

Like for the iris data set, it should have an extra column, with p-value?

a2 0.0

a1 0.28633153931967253

a3 0.8971556723299764

a4 1.0

Best regards,

Wessel

106MavenSorry for my late reply. For the calculation of the p-value one should consider

non-normalised weights (yours seem to be normalised). In addition the number of

distinct values of the either nominal or discretized numeric attribute for which we compute the

p-value, and the number of classes, need to be taken into account in the calculation.

I will post an example.

Regards,

Dan

106MavenApologies for not having responded to all queries asked here, it may have been because a hectic schedule in that period.

I referred, in this topic, to some books that may be useful to anybody that is looking to get more knowledge and better understanding of Data Mining. I promised also to recommend some good Data Mining books to one of the users having posted here, Haddock, which, despite showing an authoritative position on this subject, did not seem to have sufficient knowledge of Statistical Data Mining (in particular regarding Data Mining algorithms using p-values).

These are among the best Data Mining books, providing solid foundations in the field. Here are the titles. Enjoy!

Best,

Dan

Note: All these books describe, among many other popular data mining techniques, also techniques using p-values and/or significance levels (which are threshold values for p-values)

1. Introduction to Data Mining, by Tan, Steinbach and Kumar, (Addison Wesley)

Courses tackling Statistical Aspects of Data Mining and based on the above book (and the use of R) have been taught at Stanford and disseminated through Google Tech Talks- see recorded sessions at http://www.youtube.com/watch?v=zRsMEl6PHhM

-------

2. Data Mining Concepts and Techniques, by Han, Kamber and Pei, (Elsevier)

One of the most popular books in university Computer Science courses, and among researchers

-------

3. Data Mining - Practical Machine Learning Tools and Techniques, by Witten, Frank, and Hall, (Elsevier)

Again, one of the excellent and most popular books in university Computer Science courses, from the authors that produced also Weka

-------

4. The Elements of Statistical Learning: Data Mining, Inference and Prediction, by Hastie, Tibshirani and Friedman, (Springer)

One of the most popular books in university Statistics and Computer Science courses, and one of the most praised by researchers too.

A copy can be downloaded for free (great, indeed!) from the authors' webpages at the Dept of Statistics at Stanford University

-------

5. Data Mining Techniques, by Linoff and Berry, (Willey)

One of the most popular books among Data Mining professionals (also used in some university courses), written by very respected guys with long hands-on experience in Data Mining

849Mavenhttp://www.graphpad.com/support/faqid/1317/

106MavenBy the way, have you read any of the books indicated above? As a data miner, it's good to read at least

one of these. It would be very beneficial for your general expertise in the field. Especially when you don't have

a background in Computer Science (as it may be the case with you), you may need to read a good foundation Data Mining textbook.

Anyway, this would be good before expressing yourself authoritatively on this Data Mining forum.

849Maven106MavenHaddock, thanks for finally expressing yourself. I did not make such a statement. This is basic thing. Read my posts again.

And perhaps read at least one introductory book from those I recommended you above, unless

you already did so (in which case I am so curios why you do not want to tell

us about it). This may help you find out about and understand concepts used in the content of the current topic.

849Maven106MavenIt does not surprise me since I know you need to improve fundamentally

your expertise in this area before you are able to talk

in depthabout such topics.OK, your many stars show you are good at

playingwith RapidMiner (you aregood at clicking on buttons I think), but your comments show you lack fundamental knowledge in Data Mining.

With no degree in Computer Science (I bet) and no book read out of a "must-read" list

of Data Mining books, you seem to be a fake Data Mining "guru" on this forum,

behaving badly on quite many occasions with new comers. So I reiterate my advice: read at least

one serious Data Mining book and after that come back and criticise users here.

Next lesson for you will follow shortly here http://rapid-i.com/rapidforum/index.php/topic,5823.0.html ;

Dan

849MavenI went on a RapidMiner course, I take it you think that was rubbish as well.

537Maven"Its true what I'm saying because the other guy knows nothing and I know all".

Would be nice if you could make all future posts on topic and interesting to read.

Best regards,

Wessel

849MavenPlease read my very first post in this miserable thread - I pointed out that Wikipedia said this approach had dangers, ever since then this **** has banged on about how little I know. That's absolutely irrelevant, It doesn't address what is written in Wikipedia, and it reflects very badly on RM that nobody else flags up the 1-P fallacies. Please read your comment again - is it on point, or are you just being pompous?

1,993RM Engineering@dan_, haddock:

I've watched your personal feude now for some time in the hopes you two would calm yourself down and refrain from using personal insults and provocations. Sadly this did not happen so I guess I have to make this very clear: This forum is for people to talk about Rapid-I products and help each other out when questions arise. It is NOT a place for insults, slander, arrogance and other nonsense. Nobody is required to like everybody but everybody is required to behave and control himself. Future open and hidden insults or provocations by either side will not be tolerated.

Regards,

Marco

106MavenThe probability computed by the t-test mentioned above (the so called p-value) is not the probability that the null hypothesis is right or wrong.

According to [Ross, Introductory Statistics, Academic Press, 2010] or any other Stats book the p-value is the probability for the test statistic to be beyond some values (computed using the data sample), assuming the null hypothesis was true. When the test statistic is in the critical region (or equivalently, the p-value is below a threshold called significance level), the null hypothesis is rejected as it is judged to be inconsistent with the data sample.

There is another fallacy in which the expression 1-p appears, where p is a pvalue. Since the expression 1-p appears also in my posts,

Haddock made a wrong/superficial connection between this fallacy and the idea that I had exposed in this topic. It's wrong to put the

label "fallacy" whenever one sees 1-p (just because there exists some fallacy about 1-p). Such confusions are possible when statistical inference and tests are not sufficiently understood (although statistical tests are used in Data Mining - for instance in decision tree algorithms like CHAID and QUEST, etc). My backgrounds in Computer Science and Mathematical Statistics help me to avoid such fallacies: I did not state in my posts that 1-p is the probability that the alternative hypothesis is true, as Haddock suggested. This kind of error is certainly not made by statisticians.

@Haddock I am ready any time to discuss on data mining with you. It is regrettable however that you use such a language when you run out of

data mining arguments. It's good though that you attended a RapidMiner course. Obviously it is hard to make such brief courses comprehensive. If you want more, get one of the books I recommended in the list. In particular this will demonstrate you also how p-values are used in Data Mining.