Different estimators of joint probabilities of word spans arise by marginalizing the LLM’s scores from various completion orders. If the predictions are reliable, these estimates should be consistent. In this paper, we develop a statistical framework to test this and apply it to evaluate different LLMs. We show that both Masked Language Models and autoregressive models exhibit inconsistent predictions.