An exploratory application of modern statistical methodology and machine learning techniques in assessment of crystallinity monitoring and control strategy development for high-risk drug manufacturing

J Biopharm Stat. 2026 May 24:1-14. doi: 10.1080/10543406.2026.2670525. Online ahead of print.

ABSTRACT

Pharmaceutical polymorphism and crystallinity changes may have significant impact on drug’s quality, efficacy, and safety. The risk mitigation strategies include pharmaceutical development, monitoring and control strategy establishment during manufacturing, and stability monitoring program during storage. All of those are resources intensive. During data mining and analysis of risk mitigation strategies of new drug applications (NDAs) and abbreviated new drug applications (ANDAs) which may involve polymorphism and crystallinity change, one of the challenges is data heterogeneity encountered in submissions. This data heterogeneity may result in data not readily available for automated analysis which is called missing data. Modern statistical methodologies are available to handle this missing data and associated data analyses; however, very limited deployment of these methods to pharmaceutical chemistry, manufacturing, and controls (CMC) regulatory domain being reported. In big data era, consideration of statistical methodologies in this field will become continuously more important as the amount of available data in regulatory submissions increases. In this study, through data mining of approved NDAs and ANDAs by the FDA during the years 2017-2022 which had polymorphism and/or crystallinity keywords, we established a dataset which contained 148 approved NDAs and ANDAs and involved crystallinity monitoring and control strategy development of high-risk drug product manufacturing processes. Then, we applied several advanced machine learning techniques for exploratory pattern recognition and risk classification in the pharmaceutical manufacturing CMC domain. Furthermore, we conducted Monte Carlo simulations to demonstrate the feasibility of risk classification with generated synthetic outcomes using supervised machine learning techniques to the dataset established.

PMID:42177643 | DOI:10.1080/10543406.2026.2670525

By Nevin Manimala