For too long, many scientists’ careers have been built around the pursuit of a single statistic: p<.05.
In many scientific disciplines, that’s the threshold beyond which study results can be declared “statistically significant,” which is often interpreted to mean that it’s unlikely the results were a fluke, a result of random chance.
Though this isn’t what it actually means in practice. “Statistical significance” is too often misunderstood — and misused. That’s why a trio of scientists writing in Nature this week are calling “for the entire concept of statistical significance to be abandoned.”
Their biggest argument: “Statistically significant” or “not statistically significant” is too often easily misinterpreted to mean either “the study worked” or “the study did not work.” A “true” effect can sometimes yield a p-value of greater than .05. And we know from recent years that science is rife with false-positive studies that achieved values of less than .05 (read my explainer on the replication crisis in social science for more).
The Nature commentary authors argue that the math is not the problem. Instead, it’s human psychology. Bucketing results into “statistically significant” and “statistically non-significant,” they write, leads to a too black-and-white approach to scrutinizing science.
More than 800 other scientists and statisticians across the world have signed on to this manifesto. For now, it seems more like a provocative argument than the start of a real sea change. “Nature,” for one, “is not seeking to change how it considers statistical analysis in evaluation of papers at this time,” the journal noted.
But the tides may be rising against “statistical significance.” This isn’t the first time scientists and statisticians have challenged the status quo. In 2016, I wrote about how a large group of them called for raising the threshold to .005, making it much harder to call a result “statistically significant.” (Concurrently, with the Nature commentary, the journal The American Statistician devoted an entire issue to the problem of “statistical significance.”) There’s a wide recognition that p-values can be problematic.
I suspect this proposal will be heavily debated (as is everything in science). At least this latest call for radical change does highlight an important fact plaguing science: Statistical significance is widely misunderstood. Let me walk you through it. I think it will help you understand this debate better, and help you see that there are a lot more ways to judge the merits of a scientific finding than p-values.
Wait, what is a p-value? What’s statistical significance?
Even the simplest definitions of p-values tend to get complicated, so bear with me as I break it down.
When researchers calculate a p-value, they’re putting to the test what’s known as the null hypothesis. First thing to know: This is not a test of the question the experimenter most desperately wants to answer.
Let’s say the experimenter really wants to know if eating one bar of chocolate a day leads to weight loss. To test that, they assign 50 participants to eat one bar of chocolate a day. Another 50 are commanded to abstain from the delicious stuff. Both groups are weighed before the experiment and then after, and their average weight change is compared.
The null hypothesis is the devil’s advocate argument. It states there is no difference in the weight loss of the chocolate eaters versus the chocolate abstainers.
Rejecting the null is a major hurdle scientists need to clear to prove their hypothesis. If the null stands, it means they haven’t eliminated a major alternative explanation for their results. And what is science if not a process of narrowing down explanations?
So how do they rule out the null? They calculate some statistics.
The researcher basically asks: How ridiculous would it be to believe the null hypothesis is the true answer, given the results we’re seeing?
Rejecting the null is kind of like the “innocent until proven guilty” principle in court cases, Regina Nuzzo, a mathematics professor at Gallaudet University, explained. In court, you start off with the assumption that the defendant is innocent. Then you start looking at the evidence: the bloody knife with his fingerprints on it, his history of violence, eyewitness accounts. As the evidence mounts, that presumption of innocence starts to look naive. At a certain point, jurors get the feeling, beyond a reasonable doubt, that the defendant is not innocent.
Null hypothesis testing follows a similar logic: If there are huge and consistent weight differences between the chocolate eaters and chocolate abstainers, the null hypothesis — that there are no weight differences — starts to look silly and you can reject it.
You might be thinking: Isn’t this a pretty roundabout way to prove an experiment worked?
You are correct!
Rejecting the null hypothesis is indirect evidence of an experimental hypothesis. It says nothing about whether your scientific conclusion is correct.
Sure, the chocolate eaters may lose some weight. But is it because of the chocolate? Maybe. Or maybe they felt extra guilty eating candy every day, and they knew they were going to be weighed by strangers wearing lab coats (weird!), so they skimped on other meals.
Rejecting the null doesn’t tell you anything about the mechanism by which chocolate causes weight loss. It doesn’t tell you if the experiment is well designed, or well controlled for, or if the results have been cherry-picked.
It just helps you understand how rare the results are.
But — and this is a tricky, tricky point — it’s not how rare the results of your experiment are. It’s how rare the results would be in the world where the null hypothesis is true. That is, it’s how rare the results would be if nothing in your experiment worked and the difference in weight was due to random chance alone.
Here’s where the p-value comes in: The p-value quantifies this rareness. It tells you how often you’d see the numerical results of an experiment — or even more extreme results — if the null hypothesis is true and there’s no difference between the groups.
If the p-value is very small, it means the numbers would rarely (but not never!) occur by chance alone. So when the p is small, researchers start to think the null hypothesis looks improbable. And they take a leap to conclude “their [experimental] data are pretty unlikely to be due to random chance,” Nuzzo explains.
Here’s another tricky point: Researchers can never completely rule out the null (just like jurors are not firsthand witnesses to a crime). So scientists instead pick a threshold where they feel pretty confident that they can reject the null. For many disciplines, that’s now set at less than .05.
Ideally, a p of .05 means if you ran the experiment 100 times — again, assuming the null hypothesis is true — you’d see these same numbers (or more extreme results) five times.
And one last, super-thorny concept that almost everyone gets wrong: A p<.05 does not mean there’s less than a 5 percent chance your experimental results are due to random chance. It does not mean there’s only a 5 percent chance you’ve landed on a false positive. Nope. Not at all.
Again: A p-value of less than .05 means that there is less than a 5 percent chance of seeing these results (or more extreme results), in the world where the null hypothesis is true. This sounds nitpicky, but it’s critical. It’s the misunderstanding that leads people to be unduly confident in p-values. The false-positive rate for experiments at p=.05 can be much higher than 5 percent.
Let’s repeat it: P-values don’t necessarily tell you if an experiment “worked” or not
Psychology PhD student Kristoffer Magnusson has designed a pretty cool interactive calculator that estimates the probability of obtaining a range of p-values for any given true difference between groups. I used it to create the following scenario.
Let’s say there’s a study where the actual difference between two groups is equal to half a standard deviation. (Yes, this is a nerdy way of putting it. But think of it like this: It means 69 percent of those in the experimental group show results higher than the mean of the control group. Researchers call this a “medium-size” effect.) And let’s say there are 50 people each in the experimental group and the control group.
In this scenario, you should only be able to obtain a p-value between .03 and .05 around 7.62 percent of the time.
If you ran this experiment over and over and over again, you’d actually expect to see a lot more p-values with a much lower number. That’s what the following chart shows. The x-axis is the specific p-values, and the y-axis is the frequency you’d find them repeating this experiment. Look how many p-values you’d find below .001.
This is why many scientists get wary when they see too many results cluster around .05. It shouldn’t happen that often and raises red flags that the results have been cherry-picked, or, in science-speak, “p-hacked.” In science, it can be much too easy to game and tweak statistics to achieve significance.
And from this chart, you’ll see: Yes, you can obtain a p-value of greater than .05 when an experimental hypothesis is true. It just shouldn’t happen as often. In this case, around 9.84 percent of all p-values should fall between .05 and .1.
There are better, more nuanced approaches to evaluating science
Many scientists recognize there are most robust ways to evaluate a scientific finding. And they already engage in them. But they, somehow, don’t currently hold as much power as “statistical significance.” They are:
- Concentrating on effect sizes (how big of a difference does an intervention make, and is it practically meaningful?)
- Confidence intervals (what’s the range of doubt built into any given answer?)
- Whether a result is novel study or a replication (put some more weight into a theory many labs have looked into)
- Whether a study’s design was preregistered (so that authors can’t manipulate their results post-test), and that the underlying data is freely accessible (so anyone can check the math)
- There are also alternative statistical techniques — like Bayesian analysis — that in some ways more directly evaluate a study’s results. (P-values ask the question “how rare are my results?” Bayes factors ask the question “what is the probability my hypothesis is the best explanation for the results we found?” Both approaches have trade-offs. )
The real problem isn’t with statistical significance; it’s with the culture of science
The authors of the latest Nature commentary aren’t calling for the end of p-values. They’d still like scientists to report them where appropriate, but not necessarily label them “significant” or not.
There’s likely to be argument around this strategy. Some might think it’s useful to have simple rules of thumb, or thresholds, to evaluate science. And we still need to have phrases in our language to describe scientific results. Erasing “statistical significance” might just confuse things.
In any case, changing the definition of statistical significance, or nixing it entirely, doesn’t address the real problem. And the real problem is the culture of science.
In 2016, Vox sent out a survey to more than 200 scientists asking, “If you could change one thing about how science works today, what would it be and why?” One of the clear themes in the responses: The institutions of science need to get better at rewarding failure.
One young scientist told us, “I feel torn between asking questions that I know will lead to statistical significance and asking questions that matter.”
The biggest problem in science isn’t statistical significance; it’s the culture. She felt torn because young scientists need publications to get jobs. Under the status quo, in order to get publications, you need statistically significant results. Statistical significance alone didn’t lead to the replication crisis. The institutions of science incentivized the behaviors that allowed it to fester.