“Cake causes herpes?” – promiscuous dichotomisation induces false positives

BMC Med Res Methodol. 2025 Nov 13;25(1):255. doi: 10.1186/s12874-025-02712-0.

ABSTRACT

BACKGROUND: Continuous biomedical data is often dichotomized into two or more groups for analysis, despite long-standing warnings from statisticians that this constitutes bad practice. This dichotomisation is typically discouraged because it reduces statistical power and may obscure important trends. This paper considers another reason to discourage this practice: that dichotomisation is a powerful tool to manipulate data, as dichotomising at an arbitrary yet flexible threshold (which we term ‘promiscuous dichotomisation’) represents a powerful researcher degree of freedom.

METHODS: The motivating question is how probable is it that given a set of uniformly distributed data a threshold can be engineered to produce the illusion of a true effect when none exists? To estimate this, we employed both analytical approaches and Monte-Carlo simulation approaches to quantify the expected number of spurious findings that could arise from manipulating a dichotomous threshold for an arbitrary data set. We also illustrate an example of this with NHANES data, showing how a spurious relationship between blood glucose and herpes status could be engineered.

RESULTS: For even a relatively small sample of [Formula: see text], a false positive rate of [Formula: see text] can be observed, rising to over [Formula: see text] if low counts scenarios are not excluded. With larger samples even with low-count exclusion, false positive rates in excess of [Formula: see text] for [Formula: see text] and [Formula: see text] for [Formula: see text] are possible, climbing to in excess of [Formula: see text] and [Formula: see text] respectively if low-count scenarios were not excluded. For most configurations, manipulation of thresholds was a highly viable methods of crafting a false positive result.

CONCLUSIONS: It is likely that manipulating cut-off points in measured variables represents a significant source of data manipulation in published science, and the ease of access of larger health databases means this is an issue that is likely to grow in severity. We discuss implications of this, and means of identifying potential promiscuous dichotomisation.

PMID:41233729 | DOI:10.1186/s12874-025-02712-0

By Nevin Manimala