This just in, green jelly beans cause cancer

Seth
3 min readJul 9, 2018

--

If you’ve taken an introductory statistics course, you have probably learned what P-values are and also forgotten. Quick refresher, P-values are outputs which indicate the probability that the data coming from a treatment population are as different as they are due to chance. For example, if you grow plants in one room with the standard light, soil, and water that would be prescribed, and the seeds from the same bag in the same conditions but in a room with Yeezy blasting, you can take certain metrics, such as height or dried weight, and compare them across groups. You put the data into the appropriate test, and a P-value comes out indicating how confident you should be that the treatment had an effect. As the P-value that comes out gets lower, you should typically be more confident that the treatment group is actually different, assuming you’ve used the right test.

The P-value that you select will be your tolerance for chance to create a false positive. P-values of .05 are pretty standard, which translates into a tolerance for rejecting the null hypothesis because getting the data that you got (if these methods are correct) has a 95% chance of being due to an actual effect from Yeezy.

This sounds then as though studies should be pretty reliable, since given that a sample has such a low probability of being caused by chance, we want to believe that most studies are valid. Unfortunately, the real story looks more like the following.

Imagine you are a researcher, and you have 100 ideas. We can say w (some fraction) of your ideas are good (true) ideas, and you’re going to run your tests with a P-value of p. This means that you should expect 100w+(100-w)p ideas to come through the study with a stamp of approval. We can round this to 100(p+w) for very small w’s, since this will not impact the insight. Under this model, for a researcher who uses a P-value which is similar to their actual win rate w, we can expect about half of their statistically significant findings to be due to random chance. That is to say, if only 5% of their ideas are “correct,” and 95% are “incorrect”, then half of their studies which have “statistically significant” findings will be bullshit.

In academic circles, where there is tremendous pressure to increase paper churn and where “p-hacking” has become an issue of crisis, the relative proportion of ideas that are “true” presumably declines precipitously. While we can not know the true values, the recent replicability crisis in the sciences is the beginning of the field recognizing this serious issue, and hopefully we will remedy it be drastically lowering P-values and investing the appropriate resources in increasing population sizes in order to do so. In the meantime, take the “statistically significant” studies that you read with a grain of salt — you can “prove” jelly beans give you cancer by testing about 20 varieties.

--

--

Seth
Seth

Written by Seth

This is a public notepad. My views do not reflect the views of institutions that I'm affiliated with.