I don't think Lindley's paradox supports p-circling

(vilgot-huhn.github.io)

41 points | by speckx 8 hours ago

5 comments

  • CrazyStat 6 hours ago
    Huh.

    This is an interesting post but the author’s usage of Lindley’s paradox seems to be unrelated to the Lindley’s paradox I’m familiar with:

    > If we raise the power even further, we get to “Lindley’s paradox”, the fact that p-values in this bin can be less likely then they are under the null.

    Lindley’s paradox as I know it (and as described by Wikipedia [1]) is about the potential for arbitrarily large disagreements between frequentist and Bayesian analyses of the same data. In particular, you can have an arbitrarily small p-value (p < epsilon) from the frequentist analysis while at the same time having arbitrarily large posterior probabilities for the null hypothesis model (P(M_0|X) > 1-epsilon) from the Bayesian analysis of the same data, without any particularly funky priors or anything like that.

    I don’t see any relationship to the phenomenon given the name of Lindley’s paradox in the blog post.

    [1] https://en.wikipedia.org/wiki/Lindley%27s_paradox

  • jeremysalwen 4 hours ago
    Admittedly not a statistician, but I think the article is missing the point. The reason why people circle the P values is because nobody actually cares about the thing the p-value is measuring. What they actually care about is whether the null hypothesis is true or some other hypothesis is true. You can wave your hands around about how actually when you said it was significant what you were really saying was something technical about a hypothetical world where the null hypothesis is factually true, and so it's unfair to circle your p value because technically your statement about this hypothetical world is still true. This is not a good argument against p value circling, but rather it merely demonstrates that the technical definition of a p value is not relevant to the real world.

    The fact remains that for things which are claimed to be true but turn out to not be true later, the p values that were provided in the paper are very often near the significance threshold. Not so much for things which are obviously and strongly true. This is direct evidence of something that we already know, which is thst nobody cares about p values per se, they only use them to communicate information about something being true or false in the real world, and the technical claim of "well maybe x or y is true, but when I said p=0.49 I was only talking about a hypothetical world where x is true, and my statement about that world still holds true" is no solace.

  • contravariant 6 hours ago
    Ultimately I think the paradox comes from mixing two paradigms that aren't really designed to be mixed.

    That said you can give a Bayesian argument for p-circling provided you have a prior on the power of the test. The details are almost impossible to work out except for a case by case calculation because unless I'm mistake the shape of the p-value distribution when the null-hypothesis does not hold is very ill defined.

    However it's quite possible to give some examples where intuitively a p-value of just below 0.05 would be highly suspicious. You just need to mix tests with high power with unclear results. Say for example you're testing the existence of gravity with various objects and you get a probability of <0.04% that objects just stay in the air indefinitely.

  • gweinberg 4 hours ago
    I read the page on Lindsey's paradox, and it's astonishing bullshit. It's well known that with sufficiently insane priors you can come up with stupid conclusions. The page asserts that a Bayesian would accept as reasonable priors that it's equally likely that the probability of child being born male is precisely 0.5 as it is that it has some other value, and also that if it has some other value that all values in the interval from zero to one are equally likely. But nobody on God's green earth would accept those as reasonable values, least of all a Bayesian. A Bayesian would say there's zero chance of it being precisely 0.5, but it is almost certainly really close to 0.5, just like a normal human being would.
    • Veedrac 1 hour ago
      Wikipedia has a section on this that I thought was presented fine.

      https://en.wikipedia.org/wiki/Lindley%27s_paradox#The_lack_o...

      Indeed, Bayesian approaches need effort to correct bad priors, and indeed the original hypothesis was bad.

      That said. First, in defense of the prior, it is infinitely more likely that the probability is exactly 0.5 than it is some individual uniformly chosen number to each side. There are causal mechanisms that can explain exactly even splits. I agree that it's much safer to use simpler priors that can at least approximate any precise simple prior, and will learn any 'close enough' match, but some privileged probability on 0.5 is not crazy, and can even be nice as a reference to help you check the power of your data.

      One really should separate out the update part of Bayes from the prior part of Bayes. The data fits differently under a lot of hypotheses. Like, it's good to check expected log odds against actual log odds, but Bayes updates are almost never going to tell you that a hypothesis is "true", because whether your log loss is good is relative to the baselines you're comparing it against. Someone might come up with a prior on the basis that particular ratios are evolutionarily selected for. Someone might come up with a model that predicts births sequentially using a genomics-over-time model and get a loss far better than any of the independent random variable hypotheses. The important part is the log-odds of hypotheses under observations, not the posterior.

    • CrazyStat 4 hours ago
      A few points because I actually think Lindley’s paradox is really important and underappreciated.

      (1) You can get the same effect with a prior distribution concentrated around a point instead of a point prior. The null hypothesis prior being a point prior is not what causes Lindley’s paradox.

      (2) Point priors aren’t intrinsically nonsensical. I suspect that you might accept a point prior for an ESP effect, for example (maybe not—I know one prominent statistician who believes ESP is real).

      (3) The prior probability assigned to each of the two models also doesn’t really matter, Lindley’s paradox arises from the marginal likelihoods (which depend on the priors for parameters within each model but not the prior probability of each model).

    • KK7NIL 2 hours ago
      Wikipedia is infamously bad at teaching math.

      This Veritasium video does a great job at explaining how such skewed priors can easily appear in our current academic system and the paradox in general: https://youtu.be/42QuXLucH3Q?si=c56F7Y3RB5SBeL4m

  • senkora 7 hours ago
    I really like this article.

    > One could specify a smallest effect size of interest and compare the plausibility of seeing the reported p-value under that distribution compared to the null distribution. 6 Maier and Lakens (2022) suggest you could do this exercise when planning a test in order to justify your choice of alpha-level

    Huh, I’d never thought to do that before. You pretty much have to choose a smallest effect size of interest in order to do a power analysis in the first place, to figure out how many samples to collect, so this is a neat next step to then base significance level off of it.

    • CrazyStat 6 hours ago
      In a perfect world everybody would be putting careful thought into their desired (acceptable) type I and type II error rates as part of the experimental design process before they ever collected any data.

      Given rampant incentive misalignments (the goal in academic research is often to publish something as much as—or more than—to discover truth), having fixed significance levels as standards across whole fields may be superior in practice.

      • levocardia 3 hours ago
        The real problem is that you very often don't have any idea about what your data are going to look like before you collect them; type 1/2 errors depend a lot on how big the sources of variance in your data are. Even a really simple case -- e.g. do students randomly assigned to AM vs PM sessions of a class score better on exams? -- has a lot of unknown parameters: variance of exam scores, variance in baseline student ability, variance of rate of change in score across the semester, can you approximate scores as gaussian or do you need beta, ordinal, or some other model, etc.

        Usually you have to go collect data first, then analyze it, then (in an ideal world where science is well-incentivized) replicate your own analysis in a second wave of data collection doing everything exactly the same. Psychology has actually gotten to a point where this is mostly how it works; many other fields have not.