Measuring the Quality of Datasets: Development of the IDEFIM Indicator Set for Empirical Health Research

J Med Internet Res. 2026 Jun 17;28:e90482. doi: 10.2196/90482.

ABSTRACT

BACKGROUND: To be beneficial for empirical health research, a dataset must be fit for use. The quality of a dataset can only be influenced during data collection, yet it is evaluated multiple times during analysis or secondary use by applying quality indicators.

OBJECTIVE: This study aimed to establish an up-to-date set of indicators measuring the quality of datasets in empirical health research.

METHODS: A total of 3 pillars were combined. First, the 51 indicators of a German guideline from 2014 about the management of data quality were revised. Second, a literature review was performed looking for evidence sources since 2013 that describe, propose, or apply dataset quality indicators. Third, indicators were supplemented by a manual search and other sources. The quality indicators were then integrated into the IDEFIM framework. The IDEFIM framework distinguishes between the categories’ data, metadata, context, and openness quality. In this work, only the categories data and metadata quality, with their 14 dimensions were considered.

RESULTS: In total, 69 indicators qualified for the IDEFIM indicator set, 53 related to the category data quality, and 16 to the category metadata quality. A total of 30 indicators originated from the German guideline, 31 from the literature review. Three indicators were added to cover aspects of diversity, equity, and inclusion, and an additional 5 related to specifics of data and metadata quality not addressed so far. Most indicators were found in the dimensions accuracy (data) with 12 measures, completeness (data) with 12 measures, and consistency (data) with 19 measures. According to the number of supporting evidence sources, missing values in data elements (48 evidence sources), contradictions (31), and currentness (26) were the most popular quality indicators. Metadata quality was significantly less frequently addressed.

CONCLUSIONS: The presented IDEFIM indicator set can be used for the management of data collections as well as for the verification of a dataset’s quality for an intended use. The indicator set should also be considered in the design of a study in empirical health research and the development of software tools supporting the visualization of issues related to the quality of a dataset.

PMID:42308504 | DOI:10.2196/90482

By Nevin Manimala