- March 07, 2016
# The American Statistical Association statement on p-values

There are no statistics that inflame the passions of statisticians and scientists as does the

*p*value. The*p*value is, informally, a statistic used for assessing whether a "null hypothesis" (e.g., that the difference in performance between two conditions is 0) should be taken seriously. It is simultaneously the most used and most hated statistic in all of science. Use of*p*values has been called bad science, associated with cult-like and ritualistic behaviour, and pegged as a major cause of the so-called "crisis" that many of the sciences find themselves in.In response to the controversy over

*p*values, the American Statistical Association has today taken the unprecedented step of releasing a statement regarding a consensus viewpoint on the use of*p*values. This statement represents the input of the world's top experts on the topic. The whole statement is worth reading, but I'm going to focus on their six "principles" regarding values, first defining the*p*value, then describing the principles and adding my own brief commentary.Formally, the

*p*value is the probability of finding a result as, or more, "discrepant" than the one observed, assuming that some hypothesis is true. Small values represent more discrepant observations. As an example, consider flipping a coin 10 times and suppose we observe a single heads-up flip out of 10. Under the hypothesis that the coin is fair, we would expect five heads-up flips. One head or no head at all are as (or more) discrepant as what we observed, and the same applies to nine heads and ten heads. The probability of obtaining either 0, 1, 9, or 10 heads-up flips is small, assuming the coin is fair: about 0.02.Thus, if we observe one of those outcomes, it seems clear that this represents a discrepancy from what we would expect if the coin were fair, and therefore, in some sense, represents evidence relevant to judging whether the coin is fair or not. The question has always been:

*how should we use this information*? There is a rich, interesting, and ongoing philosophical debate about this question that I do not have space to engage. I will simply say that in much of psychology, common practice is simply to look and see whether —as in the case of 0, 1, 9, or 10 heads out of 10 flips—the*p*value is less than .05 and if so, declare the null hypothesis false. This*is*bad science, as we’ll discuss in the context of the ASA’s six principles.**The American Statistical Association’s six principles regarding***p*values**P-values can indicate how incompatible the data are with a specified statistical model.**

This principle arises from the typical definition of a

*p*value and is relatively uncontroversial, with the possible exception of the phrase "how incompatible". The same*p*value does not necessarily indicate the same incompatibility, particularly for different sample sizes (e.g., a moderate discrepancy with a moderate sample size can yield the same*p*value as a very tiny discrepancy with a large sample size). But*p*values are useful for informally assessing the performance of a model and perhaps finding specific weaknesses in it.**P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.**

There's not much to say about this except that it is true. Although

*p*values are often misunderstood as "the probability of the null hypothesis", this simply isn't correct. This misunderstanding can lead to weakly supported conclusions being published.**Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.**

One should, in general, beware of arbitrary thresholds. There are two possible issues here: first, the evidence is not substantially different on either side of the threshold. Declaring evidence for an effect when

*p*=.0499 but not when*p=*.0501 is self-evidently silly. In practice, this leads to opportunistic interpretations of the evidence around the threshold. If one*wants*to argue for an effect, a value that just misses the threshold is much more convincing than if the existence of the effect would be problematic for one's preferred theory.Moreover, likelihoodists and Bayesians argue that if you

*want*a threshold on the evidence,*p*might be the wrong statistic for that threshold.**Proper inference requires full reporting and transparency.**

The interpretation of statistical results can be substantially affected by information that is not conveyed in the

*p*value. Many articles rush to report a*p*value that helps their argument without reporting details such as stopping rules and the extent of exploratory analysis on the data. Good, cumulative science requires that researchers report all the relevant aspects of an analysis, and possibly sharing the data itself.**A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.**

A small

*p*value may simply mean that we collected a lot of data and have the resolution to see very small effects. Good scientific reporting requires a numerical, visual, or practical assessment of the effect size found in an experiment. The importance of a finding must be assessed in light of the theoretical context and is largely independent of the size of the*p*value itself. A small*p*value might revolutionize physics or represent uninteresting nuisance variation (Meehl’s “crud” factor). Careful consideration of the research context is essential.**By itself, a***p***value does not provide a good measure of evidence regarding a model or hypothesis.**

This is likely to be surprising to many users of

*p*values, because a common understanding of*p*values is that they*measure the evidence against the null hypothesis*. This is, however, an incorrect interpretation of values. The distinction is rather subtle: a*p*value is evidence, but is not a*measure*of evidence. A sense of the evidence requires to bring something more to the table. A frequentist needs something like the concept of power and an alternative hypothesis, because they want to ensure that the experiment provides a good test of the null hypothesis (e.g., Mayo & Spanos, 2006). A Bayesian requires an alternative hypothesis against which to compare the null (e.g., Bayes factors).The statement by the American Statistical Association wraps up by suggesting that given the problems that researchers have interpreting

*p*values, other approaches might be entertained, including confidence intervals and Bayesian methods. These alternative approaches, of course, come with their own conceptual challenges, some discussed previously on this blog.Hopefully with their six principles, the American Statistical Association has laid the groundwork for a productive conversation in the research community about the future direction of methods in the sciences.

# Discuss This

# Post a Comment

You must log in to leave a comment.

## Featured Content Archive

Jul 21, 2016We owe most of our visual acuity to our fovea, the area of the retina that is most densely packed with photo-receptive cones. But it does not follow that peripheral vision is the poor cousin of the fovea: Quite on the contrary, there are situations in which peripheral vision can get the job done whereas the fovea fails. Recent research examines how multiple moving targets can be tracked peripherally.Jul 14, 2016All cultures rely on collective historical memories to define their identity and to form a shared interpretation of reality. Recent research investigated the processes underlying the formation of historical memories in Croatians whose parents were affected by war in the 1990s.Jul 06, 2016There is much evidence to sugest that video gaming can help with certain dimensions of executive function. Less is known about which dimensions are affected, or whether individual differences can help explain disagreements between studies. To further delve into this issue, a recent study looked at whether gaming habits starting in pre-adolescence can sharpen executive processing skills even further.Jun 30, 2016One of the essential goals of psychology is generalization: describing ways in which people are similar. Of course, human behaviour varies across situations, times, and individuals, and hence often defies generalization. Ignoring this variability and assuming that people are the same can lead to improper generalizations about human behaviour. A recent article describes instances where previous research in decision making has ignored individual differences and thus made inappropriate inferences.Jun 28, 2016Many researchers are using MTurk now for data collection. Many thousands of journal articles have now reported results from the MTurk population, with a doubling of the total in the last year or so. With so many researchers using the MTurk population it is important to know how large the pool of subjects really is.