CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models
Recommended citation: Wagner, E., Slavutsky, Y., & Abend, O. (2024). "CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models." The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Different estimators of joint probabilities of word spans arise by marginalizing the LLM’s scores from various completion orders. If the predictions are reliable, these estimates should be consistent. In this paper, we develop a statistical framework to test this and apply it to evaluate different LLMs. We show that both Masked Language Models and autoregressive models exhibit inconsistent predictions.