The Associations Between Sensitivity and Specificity With Prevalence in Data Matching

J Public Health Manag Pract. 2026 May 5. doi: 10.1097/PHH.0000000000002355. Online ahead of print.

ABSTRACT

OBJECTIVE: To assess the associations between sensitivity and specificity with prevalence in data matching.

METHODS: Using publicly available data, a synthetic dataset of names (“source population”; 8 million records) was created with records randomly assigned as positive or negative for a health outcome, as well as sex and birth date. All positives were included in file 1 (“disease registry”), and a random sample of positives and negatives were selected and merged to create file 2 (“study population”). The prevalence in the source population was defined as the proportion of individuals in the synthetic dataset who were randomly assigned as positive, and the prevalence in the study population as the proportion of individuals in the study population who were positive. Multiple disease registry and study population file pairs were created and matched with various prevalence in the source and study populations. Link Plus 3.0, a probabilistic record linkage program, was used for the data matching.

RESULTS: As the prevalence in the source population increases from 0.1% to 10%, the sensitivity increases from 80.0% to 94.6% and the specificity decreases slightly; as the prevalence in the study population increases from 10% to 99%, the sensitivity remains stable around 95.0% and the specificity stays at about 100.0%.

CONCLUSIONS: In data matching, the sensitivity is positively and the specificity is negatively associated with the prevalence in the source population, but not associated with the prevalence in the study population.

PMID:42085690 | DOI:10.1097/PHH.0000000000002355

By Nevin Manimala