Will P-Value Triumph over Abuses and Attacks?

The null hypothesis significance testing procedure (NHSTP) was devised to guide scientific researchers decide whether an observed difference between comparative groups is due to chance or if there is a significant effect. However, prolific use of NHSTP by non‐statisticians in many quantitative studies has resulted in widespread misinterpretations, misuses and abuses. We briefly recall a recent ban on p value − and summarize the official statement written by the American Statistical Association (ASA), which explains what p‐value is, what it is not, and how to interpret and use it correctly


Introduction
In this review, we present in a non-statistician's language the fundamental statistical inferential methodology known as the null hypothesis significance testing procedure (NHSTP). It purports to answer the question: "When we observe some differences between comparative groups, could they have arisen by chance even though there is no real effect, or is there a significant effect?" To guide scientific researchers answer this question, statisticians have devised with extreme care the NHSTP. Specifically, they quantify the weight of evidence in the entire data against the null hypothesis of no effect in one number-p value − ; but they do so only after they have followed a long list of safeguards to ensure the proper use of NHSTP.
In Section 2, we describe the genesis of p-value, its definition, correct interpretation and proper use. In Section 3, we address some widespread misinterpretations of p value − arising usually out of incomplete knowledge, but sometimes out of deeply held beliefs to the contrary; and we equip the reader to counter such misinterpretations. Next, in Section 4, we mention the pitfalls of misuses and abuses of p value − . In Section 5, we recall a recent drastic action by one journal to ban the use of p value − in their publications; and we summarize the reactions of academics to the ban. Section 6 presents a summary of the policy statement written by the American Statistical Association (ASA)-a statement that a.
Expounds the principles that declare what p-value is and is not,

b.
Lists approaches that can serve as alternatives to NHSTP and p-value, and c.
Highlights some features of good statistical (and scientific) practice.
In Section 7, we conclude the paper by answering the question in the title of this paper, and outline what every practicing statistician must do to secure the rightful place of NHSTP and p-value. Specifically, we should not only report a significant p value − , but also report all related issues such as which model is adopted, what assumptions are made, whether the data support these assumptions, how data are collected, and the list of all hypotheses tested and p-values computed, including those that are not significant. We must supplement p value − with descriptive and graphical summaries of data, interval estimates of parameters; and we must disclose the achieved power of the NHSTP after adjusting for multiple testing, if any.

Genesis of p value −
The first known use of p value − was in 1770 by Pierre-Simon Laplace, who studied over half a million births, and concluded that there is an unexplained effect leading to an excess of boys compared to girls [1]. The concept of p-value was formally introduced as a methodology by Karl Pearson [2] in the context of Pearson's chi-squared test designed to decide whether the observed difference between sets of categorical data can be attributed to chance alone. The method became known as NHSTP. Thereafter, Ronald Fisher popularized the NHSTP in a wide range of contexts in the 1920's and 1930's in his books [3,4]. Ever since its inception, statisticians have debated about its proper use and interpretation. See the list of references in [5]. preferred test is the one that maximizes the power1 , β − which is the probability of rejecting 0 H when a particular alternative hypothesis holds. By choosing the sample size sufficiently large, one can ensure that the probability of type II error, when the effect size is a specified practically important amount, is also reasonably low (or, the power is sufficiently high). However, as the adage says, "There aren't no such thing as a free lunch." For any fixed sample size, as one sets α lower, one simultaneously makes β larger! Therefore, the threshold α ought to depend on the relative costs of making the two types of error. (For example, in medicine the rationale is that it is better to tell a healthy patient "we may have found something-let's test further," than to tell a diseased patient "all is well." On the contrary, in criminology it is preferable to release a guilty person than to convict an innocent person.) Nevertheless, in practice, in an overwhelming number of cases, regardless of the sample size, α is taken to be .05. There is nothing sacrosanct about .05; but it continues to prevail, perhaps because Fisher proposed 1 in 20 as a reasonable threshold, even though he further commented that the threshold could as well be 1 in 50, or 1 in 100.

Interpretation of p value −
Other than falling on one side of the threshold α or the other-leading to a decision to reject

Misinterpretations of p-value and how to dispel them
Over the years, p value − has become like a litmus test for establishing statistical significance (or departure from 0 H ) in almost all quantitative disciplines such as biology, chemistry, clinical trials, criminology,economics, education, engineering, finance, marketing research, medicine, physics, political science,psychology, and social science. Unfortunately, in the hands of otherwise well-meaning scientists who arestatistically untrained, p-value has been often misinterpreted and misused. Many resources, includinginternet sites and even some textbooks, give wrong interpretations of p-value causing unsuspecting readers to fall prey to misusing p value − -so much so that it has been a matter of considerable controversy. In 2014, statistician and science writer Regina Nuzzo [6]: "The p value − was never meant to be used the way it's used today."

Biostatistics and Biometrics Open Access Journal
We won't make an exhaustive list of possible misinterpretations and misuses of p value − . Instead, we mention only the two most common misinterpretations. The second most common misinterpretation is that p value − is the probability that one will mistakenly reject a true H0. With this misinterpretation, when p value − is small the misuser is lulled to believe that, the chance of making an error being small, it is highly likely he is not making an error by rejecting 0  H when 0 H holds, the sampling distribution of the statistic, t − based on a random sample from a normally distributed population, is the so-called t-distribution with ( ) 1 n − degrees of freedom(DF). Its density function, like that of the standard normal density, is symmetric and unimodal around zero;but its peak is less tall and its tails are thicker than the corresponding parts of the standard normal densitywhen the DF is small. Moreover, the t-density approaches the normal density as the DF increases. Therefore, when H0 holds true, most scientists would obtain a statistic t − (in absolute value) closer to zero; but somewould obtain values in the tails because of randomness in the data! Similar explanation exists for the nulldistribution of any test statistic.In practice, a scientist obtains only one sample and hence only one value of the test statistic. How can thescientist determine whether the observed value of the test statistic is near the center of the null distribution,or in its tails? We need a measure of incompatibility with 0 H and compatibility with the alternative hypothesis; and p value − they assume reporting such findings will be in vain. In fact, many journals suffer from publication bias forthey publish only statistically significant results; and they decline to publish non-significant results or resultsthat reproduce a previous finding, arguing that the latter two are not novel in appeal. Regrettably, publishedsignificance turns out to be spurious all too often; and other scientific teams cannot reproduce it.
Some researchers conclude that they have "discovered" significance simply because they have satisfied the bar " .05." p value − <= However, they may have done so by cherry picking promising findings, a practice alsoknown as data dredging, significance chasing and p-hacking. This is an abuse of the NHSTP. Willy-nillyapplication of multiple testing based on the same data (for example, doing post hoc pairwise comparisonsafter an analysis of variance) without adjusting the test-wise probability of type I error inflates the overallprobability of type I error. In such multiple testing scenarios, individual p-values are misleading, unless the test-wise α is adjusted downwards to catch the highly statistically significant results. In addition, a givenstudy maybe sufficiently powered to detect a certain effect size when only one test is to be made; but it maylack sufficient power to detect the same effect size if several tests are to be performed.To prevent abuse of NHSTP, post hoc discovery of an effect, which was not initially planned for, is not anacceptable method of establishing a scientific truth; at best, it can serve as a basis for designing a follow-upresearch study. This is why a pharmaceutical company must provide to the U. S. Food & Drug Administration (FDA) a detailed protocol before a clinical trial is carried out. If the data fail to reject the null hypothesisproposed in the protocol, but they point to some other new finding, the FDA will not accept such a finding.The company must conduct another clinical trial to establish their claim.
Furthermore, a small p value − , by itself, does not indicate the importance of a finding. We give three examples:A drug can have a statistically significant effect on patients' blood cholesterol levels without having anytherapeutic effect. A vitamin may have a statistically significant increase on average life expectancy; but theestimated one-day extension of life expectancy is hardly of any practical significance. In a designedexperiment involving many factors, some higher order interactions may turn out to be statistically significant;but a simpler model, which assumes such higher order interactions are mere noises, may fit the data quiteadequately.In 2005, John Ioannidis [9]: "It is misleading to emphasize the statistically significant findings of anysingle team. What matters is the totality of the evidence." What also matters is the totality of the choices madeby the scientist-the number of hypotheses explored during the study, all data collection decisions, allstatistical analyses conducted, and all p value − computed.

A ban on p-value and reactions to the ban
In March 2015, editors David Trafimow and Michael Marks of Basic and Applied Social Psychology took an unprecedented, drastic decision to ban the use of p-value, as well as confidence intervals (CIs), in theirjournal [10]. Instead, BASP requires strong descriptive statistics, including effect sizes, and encourage thepresentation of frequency or distributional data when feasible, and also encourage the use of larger samplesizes (although they stop short of requiring particular sample sizes). They argue, "The NHSTP has dominatedpsychology for decades; we hope that by instituting the first NHSTP ban, we demonstrate that psychologydoes not need the crutch of the NHSTP, and that other journals follow suit." Although, the controversy has been looming since 1960 [11], the BASP ban on p value − shocked statisticians and created quite a fuss among researchers. The Royal Statistical Society solicited letters fromacademics to express how they felt about the ban. These letters all tell a similar storyp value − are prone to misuse and misinterpretation; and we need to be more careful about how we design and interpret the resultsof our experiment; but we must not throw out the entire NHSTP.
Within two months of the BASP ban onthree British psychologists wrote [12]: "CIs offer an asyet undeveloped but potentially very valuable tool for psychologists to interpret their data." They point outthat the reason for the original development of NHSTP (along with CIs of effect sizes) was to guideresearchers how they should act in the future based on whether they found a real effect or not. Whatguidelines should researchers follow to make such fundamental decisions if CIs and NHSTP are banned?Furthermore, while supporting BASP's recommendation for large sample sizes to increase the precision of theestimates, they argue that reporting that precision through CIs should be required, rather than forbidden.The ban was so radical that for the first time in its 175 years of existence, the American Statistical Association(ASA) Board took position on a specific matter of statistical practice, and developed a policy statement on p value − and statistical significance. A team of over two dozen prominent statisticians took nearly a year tocreate this policy statement. With this statement, the ASA hopes to shed light on an aspect of Statistics that istoo often misunderstood and misused in the broader research community, and to open a fresh discussion anddraw renewed and vigorous attention to changing the practice of science with regards to the use of statisticalinference. The full ASA statement is found in [5]. Below we give only a brief summary.

Summary of the ASA Statement
The ASA statement begins with the definition of p value − as we already gave in Subsection 2.2 above.Then it proposes six principles that can improve the conduct or interpretation of quantitative science; next, itmentions some other approaches as alternatives to p value − and NHSTP; and finally, it concludes with a list oftraits of a good statistical practice. For the readers' benefit, we summarize them below.

Other approaches
Approaches other than p-value and NHSTP include methods that emphasize estimation over testing, such asconfidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence such aslikelihood ratios or Bayes factors; and decision-theoretic modeling and false discovery rates. All thesemeasures and approaches rely on further assumptions; but they may more directly address the size of aneffect (and its associated uncertainty), or declare that the hypothesis is tenable.

Features of good statistical practice
Good statistical practice, as an essential component of good scientific practice, emphasizes principles of goodstudy design and conduct, a variety of numerical and graphical summaries of data, understanding of thephenomenon under study, interpretation of results in context, complete reporting and proper logical andquantitative understanding of what data summaries mean. No single index should substitute for scientificreasoning.

Summary and Conclusion
The concept of NHSTP and p value − was designed to answer the question: "When we observe some differencesin the comparative groups, could they have arisen by chance even though there is no real effect, or is there asignificant effect?" It was meant to guide the researcher how to proceed in future: To continue to subscribe tothe null hypothesis of no effect, or to switch allegiance and subscribe to the alternative hypothesis ofsignificant effect. Whatever the recommended decision, the scientist must acknowledge the potential forcommitting one type of error or the other, even though they held the probabilities of committing such errorsbelow reasonable bounds. The NHSTP was not meant to establish the truth one way or another, nor is itsupposed to substitute for the scientific task of explaining why the effect is there or not there. Therefore, p value − deserves neither super-glorification nor outright denouncement.
We are optimistic that the answer to the question in the title of this paper is affirmative. When proper safeguards are taken to apply the NHSTP correctly, p value − performs its designated task just fine. Therefore,banning its use by one journal will not cause its demise. However, to let p value − secure its rightful place, firstwe must carefully ensure the following: a.
The sample size is large enough; b. The sample is random; and c.
There is no bias.
Then we must disclose all choices made during formulation of hypotheses based on experts'scientific judgment about their plausibility and results of similar studies. Next, we must choose appropriateexperimental design to collect relevant data. Finally, we must report a comprehensive set of inferentialstatistics, including supporting evidence for all assumptions. In case of multiple testing, we must adjust thetest-wise error rates to control the overall probability of type I error and to correctly identify which p value − is statistically significant. When the statistician's work is over, we must let the scientific experts wrestle withthe scientific justifications of the statistical findings. Onward with the responsible use of NHSTP!