The P-Value controversy - What's a practitioner to do?
The recent controversy on p-values left many of us who work with data wondering what to do. Should we abandon p-values altogether and switch instead to reporting confidence intervals and effect sizes? Or should we go back to the basics and make sure we fully understand what p-values mean and how they should actually be applied?
There are valid arguments to be made on both sides of the p-value divide. Assuming we still hold some faith in the goodness of the p-values, how do we re-calibrate our approach to using them?
First, we need to understand some of the history behind p-values in order to get some proper context. As clarified by Regina Nuzzo, the concept of p-value was introduced in the 1920's by Fisher "simply as an informal way to judge whether evidence was significant in the old-fashioned sense: worthy of a second look". However, over the years, p-values became the "bottom line" to a study (to borrow terminology employed by Steven Novella) - the end of the road rather than a promising beginning.
The notion of p-value as the "bottom line" for a study is interesting because it forces us to think about what needs to happen both before and after we draw that line.
Before we draw the "bottom line" for a study, we must remember that the p-value itself is an estimate, so its reliability will depend on a variety of factors, including an adequate study design, an appropriately selected sample, a set of validated and clean data collected from that sample, an appropriate statistical analysis for the research question of interest, a sufficiently large sample size, etc. This is what prompted Simonsohn to advise scientists to be transparent and "admit everything": how they determined their sample size, all data exclusions (if any), all data manipulations and all outcome measures used in the study . All of this information will provide a context for others to judge whether or not they can trust p-values reported in scientific publications.
After we draw the "bottom line" for a study, we need to bear in mind that the p-value requires us to make a decision and draw a conclusion: "Based on a p-value of 0.001 for our test, we decide to reject the null hypothesis Ho in favour of the alternative hypothesis Ha and we conclude that the data in our study provide strong evidence against the null hypothesis Ho." If we remember to focus on ourselves as agents responsible for making decisions and drawing conclusions based on evidence provided by the data, we will avoid the trap of believing we can play God and verify which of the null and alternative hypotheses is true. In 1925, Fisher himself claimed that the p-value indicates the strength of evidence against the null hypothesis. Years later, in 1955, he further claimed that "significance tests can falsify but never verify hypotheses".
If our data provide strong evidence against the null hypothesis Ho, we are in that promising situation where our findings are worthy of a second look. Something interesting may be going on and the only way to know whether this is the case is to try and replicate the findings of our study. We may be able to replicate these findings ourselves by conducting a second study or, most likely, others will get intrigued by our findings and proceed to conduct similar studies. In the latter situation, it is imperative for us to make it easy for others to replicate our study by adopting good practices advocated by proponents of reproducible research.
If our data fail to provide strong evidence against the null hypothesis Ho, we need to reflect on what may be at play (e.g., a sample size that was too small, a study design that was inadequate, a research question that needs to be refined, an outcome that needs to be reformulated, a fruitless research direction that needs to be aborted).
No study should be interpreted in isolation, just like no number (p-value included) should be interpreted in isolation
по верхней ссылке диспут в линькедине