Biom J. 2021 Sep 29. doi: 10.1002/bimj.202000336. Online ahead of print.
ABSTRACT
In statistical research, variable selection and feature extraction are a typical issue. Variable selection in linear models has been fully developed, while it has received relatively little attention for longitudinal data. Since a longitudinal study involves within-subject correlations, the likelihood function of discrete longitudinal responses generally cannot be expressed in analytically closed form, and standard variable selection methods cannot be directly applied. As an alternative, the penalized generalized estimating equation (PGEE) is helpful but very likely results in incorrect variable selection if the working correlation matrix is misspecified. In many circumstances, the within-subject correlations are of interest and need to be modeled together with the mean. For longitudinal binary data, it becomes more challenging because the within-subject correlation coefficients have the so-called Fréchet-Hoeffding upper bound. In this paper, we proposed smoothly clipped absolute deviation (SCAD)-based and least absolute shrinkage and selection operator (LASSO)-based penalized joint generalized estimating equation (PJGEE) methods to simultaneously model the mean and correlations for longitudinal binary data, together with variable selection in the mean model. The estimated correlation coefficients satisfy the upper bound constraints. Simulation studies under different scenarios are made to assess the performance of the proposed method. Compared to existing PGEE methods that specify a working correlation matrix for longitudinal binary data, the proposed PJGEE method works much better in terms of variable selection consistency and parameter estimation accuracy. A real data set on Clinical Global Impression is analyzed for illustration.
PMID:34587284 | DOI:10.1002/bimj.202000336