ORIGINAL RESEARCH Year : 2010  Volume : 21  Issue : 4  Page : 480485 Using zero inflated models to analyze dental caries with many zeroes Shivalingappa B Javali^{1}, Parameshwar V Pandit^{2}, ^{1} Department of Public Health Dentistry, SDM College of Dental Sciences & Hospital, Dharwad, Karnataka, India ^{2} Department of Statistics, Bangalore University, Jnanabharathi, Bangalore, Karnataka, India Correspondence Address: Aim: The study aimed to analyze and determine the factors associated with dental caries experience contains many zeros by zero inflated models. Design: A cross sectional design was employed using clinical examination and questionnaire with interview method. Materials and Methods: A study was conducted during MarchAugust 2007 in Dharwad, Karnataka, India, involved a systematic random samples of 1760 individuals aged 1840 years. The dental caries examination was carried out by using DMFT index (i.e. Decayed (D), Missing (M), Filled (F)). The DMFT index data contains many zeros were analyzed with Zero Inflated Poisson (ZIP) and Zero Inflated Negative Binomial (ZINB) models. Results: The study findings indicated, the variables such as family size, frequency of brushing and duration of change of toothbrush were positively associated with dental caries. But the variable the frequency of sweet consumption is negatively associated with dental caries experience in Zero Inflated Poisson (ZIP) and Zero Inflated Negative Binomial (ZINB) models. Conclusions: The ZIP model is a very good fit over the standard Poisson model and the ZINB is the better statistical fit compared to the Negative Binomial model. The Zero Inflated Negative Binomial model is better fit over the Zero Inflated Poisson model for modeling the DMF count data.
Materials and Methods Study sample The present study was carried out during MarchAugust 2007 in SDM College of Dental Sciences and Hospital, Dharwad, Karnataka, India. Data on 1,760 subjects were obtained from the assessment of dental caries in SDM College of Dental Sciences and Hospital, Dharwad, Karnataka. Ethical approval for the current study was obtained from the SDM Ethical Committee. Data analysis The authors are interested in establishing the covariates of dental caries (DMF index). For convenience of fitting Zero Inflated models (ZIP and ZINB), the DMF index data were treated as a response/outcome variable. The independent or explanatory variables of the study considered here are: age (in years) as continuous variable, family size, frequency of sweet consumption per day, frequency of brushing, use of systemic toothpaste (1=if fluoridated toothpaste), rinsing habit after every meal with water (if yes=1), duration of change of toothbrush (1=if less or equal to 3 months, 2=otherwise) and tobacco smoking habit (if yes=1). Obviously, the multivariate analysis of estimation of parameters in occurrence of dental caries is appropriate for this dataset, and these analyses are usually undertaken. As can be seen in [Figure 1] and [Figure 2], the distribution of the DMF index data is sufficiently positively skewed. In other words, the majority of the study subjects have zero DMF count score and a minority have a high score of DMF index data. About 47.56% of the subjects presented without any sign of dental caries. Therefore, the estimated Poisson and Negative Binomial distribution do not fit the observed distribution of the DMF index well. The generalized linear models (GLMs) are good candidates for this kind of count data, although GLM violate the normality assumption because of a high proportion of zeros. Hence, we are further away from the traditional multiple linear (MLR) models as well as GLMs with different link functions. [34] Therefore, the Zero Inflated models provide a good, appropriate and improved statistical fit for DMF counts and the findings of these models were compared with the Poisson and Negative Binomial models.{Figure 1}{Figure 2} The DMFT index in dental epidemiology In dental epidemiology, the DMF index is an important and wellknown indicator of overall measure for the dental status of a person. It is a count standing for the number of Decayed, Missing and Filled Teeth (in which case it is called DMFT Index). As an application, we consider here data coming from a study of samples of 1,760 subjects aged between 15 and 84 years. The mean age of the study subjects was 41.07±15.11 years Clinical examination The selected subjects were called for free dental examination. The dental examination of dental caries was carried out by four, wellqualified community dentists, using plane mouth mirror, dental explorer, disposable gloves and sterilized instruments under artificial light for each subject. The findings of the dental caries examination were recorded according to diagnostic criteria recommended by the World Health Organization. [35] The data on dental caries by DMF index were recorded by the researcher. The interexamination calibration on DMF index was performed on 30 samples. The interexaminer agreement was calculated by Kappa statistic, and it was found to range from 0.6514 to 0.8214. The data on selected explanatory variables / covariates were collected and recorded by structured questionnaires with an interview method. Excess zeros and overdispersion in DMFT count data The Poisson and Binomial random variables are inherently restrictive as models for discrete data because they are determined by only one parameter. Consider, for example, the DMF index count data, i.e. dental caries, is the Poisson random variable, because for this random variable the variance is equal to mean, it cannot adequately model data where the variance is equal to mean, it cannot adequately models data where the variance is different from mean. Typically, this lack of fit is reflected in the variance, which is different from the mean. It means that the variance is greater than the mean and that this concept is used as an overdispersion. Because these can be estimated by the sample mean and sample variance, leading to the overdispersion test [INLINE:1] which takes on the value 54.1747 (S 2 =5.2374, [INSIDE:2]=1.8528, P<0.001), it shows that the situation of distribution of DMF index data clearly indicates that a simple Poisson and Binomial would not give an adequate fit because of strong overdispersion. Vuong test for testing the validity of the model It should be noted that the Poisson (NB) and ZIP (ZINB) models are not mutually exclusive. On the contrary, the features of these two models can be integrated and the Zero Inflated Poisson model with negative binomial distribution is called the ZINB. The Vuong test is available for testing the validity of the ZIP model against the alternativemodel. [36] More precisely, the Vuong statistic was used for testing the ZIP or ZINB regression model against the Poisson/NB models. If Vuong statistic (z) is greater than 1.96, the ZIP (ZINB) model is accepted. If Vuong statistic (z) is smaller than 1.96, the Poisson (NB) model is accepted. This approach was used for testing the validity of the ZIP/ZINB regression model against the Poisson model/NB regression model. Zeroinflated models for count data A simple and frequently applied statistical model for count distribution is the Poisson model, in which we assume that X (DMFT count data) follows a Poisson density [INLINE:2] with λi =exp(y iβ), where y i is a covariate vector and β is a vector of unknown coefficients to be estimated. [37] When there is a population heterogeneity or overdispersion, a gamma mixture of Poisson variables is often assumed. Let νi represent an unobserved individual frailty factor, with exp (νi ) following a gamma distribution with mean 1 and variance α. The counts are now assumed to follow a modification of the Poisson model, with mean λ* i=exp (y iβ+νi ). It leads to the negative binomial regression model [INLINE:3] where α is an ancillary parameter indicating the degree of overdispersion. The model converges to a Poisson model if α is close to 0. To account for the extra amount of zeros, the data show strong overdispersion or variance, which is greater than the mean. One common source of nonPoisson behavior that leads to overdispersion is when the observed count exhibit extra zeros. This can be modeled as a mixture of two subpopulations, representing that an observation occurs with probability π from a Poisson distribution while the probability (1π) indicates that the observed number of count is zero. This ZIP model is consistent with a process where 100π% of a population is at risk for the same event while 100(1π)% has no risk. We can write this ZIP density function as [INLINE:4] This equation indicates that the ZIP model is a special mixture model having two classes, where the first class has a fixed value at zero. In case of the DMFT index, this class could include those subjects with no caries risk at all. Simple calculations of mean and variance of Zero Inflated random variable x are given by [INLINE:10] which implies that the proportion of the overdispersion would be explained by the ZIP model. For DMF count data, we find overdispersion [INSIDE:6] (S 2=5.2374, mean=1.8528). The maximum likelihood estimators for the ZIP model are [INSIDE:4] 3.6795 and [INSIDE:5] 0.5035, clearly indicating that DMF count data lead to a fifed overdispersion of E(X){λ  E(X)}=3.3845, corresponding with an explained overdispersion of [INSIDE:1] Thus, 99.98% of the overdispersion would be explained by the ZIP model. We could consider a formal way of testing the hypothesis H 0= E (S 2 = E (X {1+ λ  E (X ), i.e. whether the ZIP and ZINB models satisfactorily explain the overdispersion. One way to test this hypothesis would be compare the likelihood under this null hypothesis, with the likelihood of the nonparametric maximum likelihood estimator. The maximum likelihood function of the ZIP model f(x; π, λ) is given by [INLINE:6] The vector of the parameter estimate is given by [INLINE:7] ZINB model A ZINB distribution arises as a mixture of a negative binomial and a distribution degenerated at zero and assigning a mass of π to extra zeros and a mass of (1π) to a negative binomial distribution, where 0≤π≤1. Note that the negative binomial distribution is a continuous mixture of Poisson distributions, which allows the Poisson mean λ to be gamma distributed. More specifically, the negative binomial distribution (Samuel et al.) is given by [INLINE:8] Where λ=E(Y), τ is the shape parameter that quantifies the amount of overdispersion and Y is the response variable of interest. The variance of Y is λ+λ2 / τ. A negative binomial distribution approaches a Poisson distribution when τ→∞(no overdispersion). A ZINB distribution arises as a mixture of a negative binomial and a distribution degenerated at zero, and is given by [INLINE:9] The mean and the variance of the ZINB distribution are EY=(1 π)λ and (1π)λ (1+πλ+λ/ τ), respectively. Observe that this distribution approaches the ZIP and the negative binomial distribution as τ→∞ and π→0, respectively. If both 1/ τ and π≈0, then the ZINB distribution reduces to the Poisson distribution. Results To understand how the different count regression models fit the DMFT data, we examine the fit of various regression models. The results of using the Poisson, NB, ZIP and ZINB regression models are given in [Table 1].{Table 1} First, we consider the Poisson regression model. Based on the likelihood ratio chisquare in [Table 1], the Poisson regression model does not provide an adequate fit to the DMFT count data. The observed proportion of zeros is 47.56% for the zero DMFT count data. The likelihood ratio chisquare value is larger (321.16) compared to the ZIP model (87.46). In such a situation, it would be appropriate to estimate the ZIP regression model. Is the ZIP regression model statistically preferred over the Poisson regression model? We apply the Vuong test statistic to check whether the ZIP regression model is a significant improvement over the Poisson regression model. To test for ZIP regression model or zero inflation, the value of the Vuong test statistic is calculated as 19.7200. This value is significant and higher when compared to the zstatistic, i.e. 1.96 at the 0.05 level. That is, the value of the Vuong test statistic provides evidence that the ZIP regression model is preferred over the Poisson regression model. Although the ZIP regression model does better in predicting the zero proportion (estimated proportion of zeros) and the Poisson and ZIP regression coefficients are quite different in magnitude, the standard errors for the ZIP regression coefficients tend to be larger than those for the Poisson regression coefficients. Similarly, to test for the ZINB regression model or zero inflation, the value of the Vuong test statistic is calculated as 10.5100. This value is significant and higher when compared to the zstatistic, i.e. 1.96 at the 0.05 level. That is, the value of the Vuong test statistic provides evidence that the ZINB model is preferred over the NB model. We consider the NB and ZINB regression models, which include a dispersion parameter. The test of α=0 by using the asymptotic Wald statistic showed that α is significantly different from zero for the NB and ZINB regression models in [Table 1]. The Poisson regression model is not appropriate for DMFT count data because we reject the hypothesis H0: α=0. The likelihood ratio, chisquare, for the Poisson, ZIP, NB and ZINB regression models are, respectively, 321.16, 87.46, 84.77 and 67.27. When NB is compared to the ZINB regression model, it clearly indicates that the likelihood ratio, chisquare, is very much smaller in the ZINB regression model compared to the Poison, ZIP and NB regression models. It indicates that modeling overdispersed data using NB and ZINB regression models are better fit to the DMFT data than the Poisson and ZIP regression models. A significant negative relationship is found between the independent covariates like family size, frequency of brushing and duration of change of toothbrush, having a significant negative relationship with DMF count data in the Poisson regression and NB regression models, but they have a significant and positive relationship with the DMFT data in the ZIP and ZINB regression models. Only age is found to be a positive relationship with DMFT data in the Poisson regression models and not in the other NB, ZIP and ZINB regression models. However, other covariates, i.e. frequency of sweet consumption, frequency of brushing, systematic toothpaste, rinsing habit, duration of change of toothbrush and smoking habit are found to have significant influences on dental caries. But, these covariates display different signs of influence on dental caries [Table 1]. Discussions and Conclusions In recent years, various researches in dental epidemiology showed that the DMFT count data are frequently right skewed, with a sizable proportion of zero values. For this type of data, the standard Poisson and standard NB regression models are not acceptable competitors for comparing covariates of DMFT data. Therefore, the comparison of results estimated by the use of the Poisson distribution and NB distribution displayed some differences. These differences could lead to incorrect interpretations, affecting the general results of the studies. The application addressed in this paper involves the estimation of the Poisson, NB, ZIP and ZINB regression models to predict the dental caries by using the DMFT index. Because count data frequently exhibit overdispersion in addition to possible zero inflation, an obvious methodology is to use a model that can accommodate overdispersion and zero inflation. We also consider the ZIP and ZINB regression models in terms of both zero inflation and overdispersion situations. The ZIP and ZINB regression models are alternates to the Poisson and NB regression models when there is a situation of zero inflation. The Poisson regression model did not converge in fitting the DMFT data. For this reason, we apply the ZIP and ZINB regression models for modeling overdispersed DMFT data with many zeros over the Poisson and NB models. We apply a Vuong test statistic that tests whether the ZIP regression model is a significant improvement over the Poison regression model and whether the ZINB regression model is a significant improvement over the NB regression mode. The value of the Vuong test statistic is larger than z=1.96 at the 0.05% level of significance in the ZIP and ZINB models. Based on the result for the DMF count data, the ZIP is a very good fit over the standard Poisson model and the ZINB is the better statistical fit compared to the NB model. The ZINB model is better fit over the ZIP model for modeling the DMF count data. Acknowledgment The authors thank Dr. V. Tippeswamy, Dr. K.V.V. Prasad and other subordinate staff, Department of Public Health Dentistry, SDM College of Dental Sciences and Hospital, Dharwad, Karnataka, India, for giving timely suggestions during the preparation of this article. References


