Indian Journal of Dental Research

: 2010  |  Volume : 21  |  Issue : 4  |  Page : 480--485

Using zero inflated models to analyze dental caries with many zeroes

Shivalingappa B Javali1, Parameshwar V Pandit2,  
1 Department of Public Health Dentistry, SDM College of Dental Sciences & Hospital, Dharwad, Karnataka, India
2 Department of Statistics, Bangalore University, Jnanabharathi, Bangalore, Karnataka, India

Correspondence Address:
Shivalingappa B Javali
Department of Public Health Dentistry, SDM College of Dental Sciences & Hospital, Dharwad, Karnataka


Aim: The study aimed to analyze and determine the factors associated with dental caries experience contains many zeros by zero inflated models. Design: A cross sectional design was employed using clinical examination and questionnaire with interview method. Materials and Methods: A study was conducted during March-August 2007 in Dharwad, Karnataka, India, involved a systematic random samples of 1760 individuals aged 18-40 years. The dental caries examination was carried out by using DMFT index (i.e. Decayed (D), Missing (M), Filled (F)). The DMFT index data contains many zeros were analyzed with Zero Inflated Poisson (ZIP) and Zero Inflated Negative Binomial (ZINB) models. Results: The study findings indicated, the variables such as family size, frequency of brushing and duration of change of toothbrush were positively associated with dental caries. But the variable the frequency of sweet consumption is negatively associated with dental caries experience in Zero Inflated Poisson (ZIP) and Zero Inflated Negative Binomial (ZINB) models. Conclusions: The ZIP model is a very good fit over the standard Poisson model and the ZINB is the better statistical fit compared to the Negative Binomial model. The Zero Inflated Negative Binomial model is better fit over the Zero Inflated Poisson model for modeling the DMF count data.

How to cite this article:
Javali SB, Pandit PV. Using zero inflated models to analyze dental caries with many zeroes.Indian J Dent Res 2010;21:480-485

How to cite this URL:
Javali SB, Pandit PV. Using zero inflated models to analyze dental caries with many zeroes. Indian J Dent Res [serial online] 2010 [cited 2020 Sep 18 ];21:480-485
Available from:

Full Text

A distinguishing characteristic of studies of epidemiology of dental caries is that the dataset invariably uses the DMF index. [1] It measures the degree of caries experience and represents the cumulative severity of dental caries of a subject or population. It is defined as the sum of the number of decayed (D), missing (M) due to caries and filled teeth (F), which comprises of data measuring binary subjects with caries (DMF>0) and subjects without caries (DMF=0) and counts of DMF or continuous population densities in their tendency to contain a large number of zero values. When the number of zeros is so large that some standard distributions do not fit the DMF index data well (e.g., Normal, Poisson, Binomial, Negative Binomial, etc.), the DMF index dataset is referred to as zero inflated. [2] Zero inflation is often the result of a large number of excess true zero observations caused by the real effect of the occurrence of dental caries of interest. The presence of zero inflation due to excess zeros, a special case of overdispersion, [3],[4],[5] creates problems in making a sound statistical inference by violating the basic assumptions implicit in the utilization of the standard distributions and misinterpretations of the variance-mean relationship of the error structure. [6] The presence of zero inflation as a result of zeros may or may not violate the distributional assumptions, but will lead to uncertainty regarding parameter estimates because it is no longer possible to determine whether a difference in the number of individuals taken for the study over time and space because of a change in the size of a population of probability of an individuals.

Zero-inflated count data and the application of models that cope with inflation are found in a wide range of disciplines like medicine, [7],[8],[9] occupational health, [10],[11],[12],[13] ecology, [6],[14],[15],[16],[17] road violence and accident and [11],[18] apple shoot proportion data [19] and defects in manufacture, [20] econometrics, [21] economics, [22] trip distribution [23] and political science. [24]

In DMF index, the data consists of counts and has a high proportion of excess zeros with respect to probability distributions such as the Poisson distribution or the Binomial distribution. However, these models often underestimate the observed dispersion. The distribution of the DMF index count data is overdispersed (overdispersed test value, O=54.1747, with S 2 =5.2374, mean=1.8528). This phenomenon is called as overdispersion, which occurs because a single Poisson parameter, λ, is often insufficient to explain the population. [3],[25],[26],[27],[28] In fact, in many cases, it can be suspected that the population heterogeneity, which has not been accounted for, is causing this overdispersion. In this paper, the specific form of non-parametric heterogeneity is π, viz. a two-mass distribution, giving mass (1-π) to count zero and mass π to count the second class with mean λ. This model is called as a Zero Inflation model. The literature of epidemiology of dental caries has seen a recent upsurge of interest in techniques for dealing with excess zero values. Recently, a fair amount of interest has focused on the analysis of Zero Inflated models without covariates, and have been discussed by Cohen [29] and Johnson and Kotz, [30] and the Zero Inflated models with covariates have been applied in public health scenarios, including data sets with zero inflation caused by true zero. [27],[28],[31],[32],[33]

In this paper, the Zero Inflated Poisson (ZIP) and Zero Inflated Negative Binomial (ZINB) regression model approaches for modeling dental caries data by DMFT index and results of these models compared with Standard Poisson Regression Model (PRM) and Negative Binomial Regression Model (NBRM) have been discudded.

 Materials and Methods

Study sample

The present study was carried out during March-August 2007 in SDM College of Dental Sciences and Hospital, Dharwad, Karnataka, India. Data on 1,760 subjects were obtained from the assessment of dental caries in SDM College of Dental Sciences and Hospital, Dharwad, Karnataka. Ethical approval for the current study was obtained from the SDM Ethical Committee.

Data analysis

The authors are interested in establishing the covariates of dental caries (DMF index). For convenience of fitting Zero Inflated models (ZIP and ZINB), the DMF index data were treated as a response/outcome variable. The independent or explanatory variables of the study considered here are: age (in years) as continuous variable, family size, frequency of sweet consumption per day, frequency of brushing, use of systemic toothpaste (1=if fluoridated toothpaste), rinsing habit after every meal with water (if yes=1), duration of change of toothbrush (1=if less or equal to 3 months, 2=otherwise) and tobacco smoking habit (if yes=1). Obviously, the multivariate analysis of estimation of parameters in occurrence of dental caries is appropriate for this dataset, and these analyses are usually undertaken.

As can be seen in [Figure 1] and [Figure 2], the distribution of the DMF index data is sufficiently positively skewed. In other words, the majority of the study subjects have zero DMF count score and a minority have a high score of DMF index data. About 47.56% of the subjects presented without any sign of dental caries. Therefore, the estimated Poisson and Negative Binomial distribution do not fit the observed distribution of the DMF index well. The generalized linear models (GLMs) are good candidates for this kind of count data, although GLM violate the normality assumption because of a high proportion of zeros. Hence, we are further away from the traditional multiple linear (MLR) models as well as GLMs with different link functions. [34] Therefore, the Zero Inflated models provide a good, appropriate and improved statistical fit for DMF counts and the findings of these models were compared with the Poisson and Negative Binomial models.{Figure 1}{Figure 2}

The DMFT index in dental epidemiology

In dental epidemiology, the DMF index is an important and well-known indicator of overall measure for the dental status of a person. It is a count standing for the number of Decayed, Missing and Filled Teeth (in which case it is called DMFT Index). As an application, we consider here data coming from a study of samples of 1,760 subjects aged between 15 and 84 years. The mean age of the study subjects was 41.07±15.11 years

Clinical examination

The selected subjects were called for free dental examination. The dental examination of dental caries was carried out by four, well-qualified community dentists, using plane mouth mirror, dental explorer, disposable gloves and sterilized instruments under artificial light for each subject. The findings of the dental caries examination were recorded according to diagnostic criteria recommended by the World Health Organization. [35] The data on dental caries by DMF index were recorded by the researcher. The interexamination calibration on DMF index was performed on 30 samples. The interexaminer agreement was calculated by Kappa statistic, and it was found to range from 0.6514 to 0.8214. The data on selected explanatory variables / covariates were collected and recorded by structured questionnaires with an interview method.

Excess zeros and overdispersion in DMFT count data

The Poisson and Binomial random variables are inherently restrictive as models for discrete data because they are determined by only one parameter. Consider, for example, the DMF index count data, i.e. dental caries, is the Poisson random variable, because for this random variable the variance is equal to mean, it cannot adequately model data where the variance is equal to mean, it cannot adequately models data where the variance is different from mean. Typically, this lack of fit is reflected in the variance, which is different from the mean. It means that the variance is greater than the mean and that this concept is used as an overdispersion.

Because these can be estimated by the sample mean and sample variance, leading to the overdispersion test


which takes on the value 54.1747 (S 2 =5.2374, [INSIDE:2]=1.8528, P<0.001), it shows that the situation of distribution of DMF index data clearly indicates that a simple Poisson and Binomial would not give an adequate fit because of strong overdispersion.

Vuong test for testing the validity of the model

It should be noted that the Poisson (NB) and ZIP (ZINB) models are not mutually exclusive. On the contrary, the features of these two models can be integrated and the Zero Inflated Poisson model with negative binomial distribution is called the ZINB. The Vuong test is available for testing the validity of the ZIP model against the alternativemodel. [36] More precisely, the Vuong statistic was used for testing the ZIP or ZINB regression model against the Poisson/NB models. If Vuong statistic (z) is greater than 1.96, the ZIP (ZINB) model is accepted. If Vuong statistic (z) is smaller than -1.96, the Poisson (NB) model is accepted. This approach was used for testing the validity of the ZIP/ZINB regression model against the Poisson model/NB regression model.

Zero-inflated models for count data

A simple and frequently applied statistical model for count distribution is the Poisson model, in which we assume that X (DMFT count data) follows a Poisson density


with λi =exp(y iβ), where y i is a covariate vector and β is a vector of unknown coefficients to be estimated. [37] When there is a population heterogeneity or overdispersion, a gamma mixture of Poisson variables is often assumed. Let νi represent an unobserved individual frailty factor, with exp (νi ) following a gamma distribution with mean 1 and variance α. The counts are now assumed to follow a modification of the Poisson model, with mean λ* i=exp (y iβ+νi ).

It leads to the negative binomial regression model


where α is an ancillary parameter indicating the degree of overdispersion. The model converges to a Poisson model if α is close to 0.

To account for the extra amount of zeros, the data show strong overdispersion or variance, which is greater than the mean. One common source of non-Poisson behavior that leads to overdispersion is when the observed count exhibit extra zeros. This can be modeled as a mixture of two subpopulations, representing that an observation occurs with probability π from a Poisson distribution while the probability (1-π) indicates that the observed number of count is zero. This ZIP model is consistent with a process where 100π% of a population is at risk for the same event while 100(1-π)% has no risk. We can write this ZIP density function as


This equation indicates that the ZIP model is a special mixture model having two classes, where the first class has a fixed value at zero. In case of the DMFT index, this class could include those subjects with no caries risk at all. Simple calculations of mean and variance of Zero Inflated random variable x are given by


which implies that the proportion of the overdispersion would be explained by the ZIP model. For DMF count data, we find overdispersion [INSIDE:6] (S 2=5.2374, mean=1.8528). The maximum likelihood estimators for the ZIP model are [INSIDE:4] 3.6795 and [INSIDE:5] 0.5035, clearly indicating that DMF count data lead to a fifed overdispersion of E(X){λ - E(X)}=3.3845, corresponding with an explained overdispersion of [INSIDE:1]

Thus, 99.98% of the overdispersion would be explained by the ZIP model. We could consider a formal way of testing the hypothesis H 0= E (S 2 = E (X {1+ λ - E (X ), i.e. whether the ZIP and ZINB models satisfactorily explain the overdispersion. One way to test this hypothesis would be compare the likelihood under this null hypothesis, with the likelihood of the non-parametric maximum likelihood estimator. The maximum likelihood function of the ZIP model f(x; π, λ) is given by


The vector of the parameter estimate is given by


ZINB model

A ZINB distribution arises as a mixture of a negative binomial and a distribution degenerated at zero and assigning a mass of π to extra zeros and a mass of (1-π) to a negative binomial distribution, where 0≤π≤1. Note that the negative binomial distribution is a continuous mixture of Poisson distributions, which allows the Poisson mean λ to be gamma distributed. More specifically, the negative binomial distribution (Samuel et al.) is given by


Where λ=E(Y), τ is the shape parameter that quantifies the amount of overdispersion and Y is the response variable of interest. The variance of Y is λ+λ2 / τ. A negative binomial distribution approaches a Poisson distribution when τ→∞(no overdispersion). A ZINB distribution arises as a mixture of a negative binomial and a distribution degenerated at zero, and is given by


The mean and the variance of the ZINB distribution are EY=(1- π)λ and (1-π)λ (1+πλ+λ/ τ), respectively. Observe that this distribution approaches the ZIP and the negative binomial distribution as τ→∞ and π→0, respectively. If both 1/ τ and π≈0, then the ZINB distribution reduces to the Poisson distribution.


To understand how the different count regression models fit the DMFT data, we examine the fit of various regression models. The results of using the Poisson, NB, ZIP and ZINB regression models are given in [Table 1].{Table 1}

First, we consider the Poisson regression model. Based on the likelihood ratio chi-square in [Table 1], the Poisson regression model does not provide an adequate fit to the DMFT count data. The observed proportion of zeros is 47.56% for the zero DMFT count data. The likelihood ratio chi-square value is larger (321.16) compared to the ZIP model (87.46). In such a situation, it would be appropriate to estimate the ZIP regression model. Is the ZIP regression model statistically preferred over the Poisson regression model? We apply the Vuong test statistic to check whether the ZIP regression model is a significant improvement over the Poisson regression model. To test for ZIP regression model or zero inflation, the value of the Vuong test statistic is calculated as 19.7200. This value is significant and higher when compared to the z-statistic, i.e. 1.96 at the 0.05 level. That is, the value of the Vuong test statistic provides evidence that the ZIP regression model is preferred over the Poisson regression model. Although the ZIP regression model does better in predicting the zero proportion (estimated proportion of zeros) and the Poisson and ZIP regression coefficients are quite different in magnitude, the standard errors for the ZIP regression coefficients tend to be larger than those for the Poisson regression coefficients.

Similarly, to test for the ZINB regression model or zero inflation, the value of the Vuong test statistic is calculated as 10.5100. This value is significant and higher when compared to the z-statistic, i.e. 1.96 at the 0.05 level. That is, the value of the Vuong test statistic provides evidence that the ZINB model is preferred over the NB model.

We consider the NB and ZINB regression models, which include a dispersion parameter. The test of α=0 by using the asymptotic Wald statistic showed that α is significantly different from zero for the NB and ZINB regression models in [Table 1]. The Poisson regression model is not appropriate for DMFT count data because we reject the hypothesis H0: α=0. The likelihood ratio, chi-square, for the Poisson, ZIP, NB and ZINB regression models are, respectively, 321.16, 87.46, 84.77 and 67.27. When NB is compared to the ZINB regression model, it clearly indicates that the likelihood ratio, chi-square, is very much smaller in the ZINB regression model compared to the Poison, ZIP and NB regression models. It indicates that modeling overdispersed data using NB and ZINB regression models are better fit to the DMFT data than the Poisson and ZIP regression models.

A significant negative relationship is found between the independent covariates like family size, frequency of brushing and duration of change of toothbrush, having a significant negative relationship with DMF count data in the Poisson regression and NB regression models, but they have a significant and positive relationship with the DMFT data in the ZIP and ZINB regression models. Only age is found to be a positive relationship with DMFT data in the Poisson regression models and not in the other NB, ZIP and ZINB regression models. However, other covariates, i.e. frequency of sweet consumption, frequency of brushing, systematic toothpaste, rinsing habit, duration of change of toothbrush and smoking habit are found to have significant influences on dental caries. But, these covariates display different signs of influence on dental caries [Table 1].

 Discussions and Conclusions

In recent years, various researches in dental epidemiology showed that the DMFT count data are frequently right skewed, with a sizable proportion of zero values. For this type of data, the standard Poisson and standard NB regression models are not acceptable competitors for comparing covariates of DMFT data. Therefore, the comparison of results estimated by the use of the Poisson distribution and NB distribution displayed some differences. These differences could lead to incorrect interpretations, affecting the general results of the studies.

The application addressed in this paper involves the estimation of the Poisson, NB, ZIP and ZINB regression models to predict the dental caries by using the DMFT index. Because count data frequently exhibit overdispersion in addition to possible zero inflation, an obvious methodology is to use a model that can accommodate overdispersion and zero inflation. We also consider the ZIP and ZINB regression models in terms of both zero inflation and overdispersion situations. The ZIP and ZINB regression models are alternates to the Poisson and NB regression models when there is a situation of zero inflation. The Poisson regression model did not converge in fitting the DMFT data.

For this reason, we apply the ZIP and ZINB regression models for modeling overdispersed DMFT data with many zeros over the Poisson and NB models. We apply a Vuong test statistic that tests whether the ZIP regression model is a significant improvement over the Poison regression model and whether the ZINB regression model is a significant improvement over the NB regression mode. The value of the Vuong test statistic is larger than z=1.96 at the 0.05% level of significance in the ZIP and ZINB models.

Based on the result for the DMF count data, the ZIP is a very good fit over the standard Poisson model and the ZINB is the better statistical fit compared to the NB model. The ZINB model is better fit over the ZIP model for modeling the DMF count data.


The authors thank Dr. V. Tippeswamy, Dr. K.V.V. Prasad and other subordinate staff, Department of Public Health Dentistry, SDM College of Dental Sciences and Hospital, Dharwad, Karnataka, India, for giving timely suggestions during the preparation of this article.


1Klein H, Palmer C, Knutson JW. Studies of dental caries I: Dental status and Dental needs of elementary school children. Public Health Rep 1938;53:751-65.
2Heilbron DC. Zero-altered and other regression models for count data with added zeros. Biom J 1994;36:531-47.
3McCullagh P, Nelder JA. Generalized Linear Models. 2 nd ed. London: Chapman and Hall; 1989. P. 511.
4Hinde J, Demetrio CGB. Overdispersion: models and estimation. Comput Stat Data Anal 1998;27:151-70.
5Poortema K. Modeling overdispersion of counts. Stat Neerl 1999;53:5-20.
6Barry SC, Welsh AH. Generalized additive modelling and zero inflated count data. Ecol Model 2002;157:179-88.
7Campbell MJ, Machin D, D'Arcangues C. Coping with extra Poisson variability in the analysis of factors influencing vaginal ring expulsions. Stat Med 1991;10:241-54.
8 Ghahramani M, Dean CB, Spinelli JJ. Simultaneous modelling of operative mortality and long-term survival after coronary artery bypass surgery. Stat Med 2001;20:1931-45.
9 Cheung YB. Zero-inflated models for regression analysis of count data: a study of growth and development. Stat Med 2002;21:1461-9.
10Lee AH, Stevenson MR, Wang K, Yau KKW. Modeling young driver motor vehicle crashes: data with extra zeros. Accid Anal Prev 2002;34:515-21.
11Carrivick PJW, Lee AH, Yau KKW. Zero-inflated Poisson modeling to evaluate occupational safety interventions. Saf Sci 2003;41:53-63.
12Wang K, Lee AH, Yau KKW, Carrivick PJW. Bivariate zero-inflated Poisson regression model to analyze occupational injuries. Accid Anal Prev 2003;35:625-9.
13Yau KK, Lee AH, Carrivick PJ. Modeling zero inflated count series with application to occupational health. Comput Methods Programs Biomed 2004;74:47-52.
14Welsh AH, Cunningham RB, Donnelly CF. Lindenmayer DB. Modelling the abundance of rare species: statistical models for counts with extra zeros. Ecol Model 1996;88:297-308.
15Podlich HM, Faddy MJ, Smyth GK. A general approach to modeling and analysis of species abundance data with extra zeros. J Agric Biol Environ Stat 2002;7:324-34.
16Kuhnert PM, Martin TG, Mengersen K, Possingham HP. Assessing the impacts of grazing levels on bird density in woodland habitat: a Bayesian approach using expert opinion. Environmetrics 2005;16:1-31.
17Martin TG, Kuhnert PM, Mengersen K, Possingham HP. The power of expert opinion in ecological models using Bayesian methods: impact of grazing on birds. Ecol Appl 2005;15: 266-80.
18Hall DB. Zero-inflated Poisson binomial regression with random effects: a case study. Biometrics 2000;56:1030-9.
19Ridout M, Demetrio CGB, Hinde J. Models for count data with many zeros. Invited paper presented at the Ninenteeth International Biometric Conference, Cape Town, South Africa 1998;179-190.
20Lambert D. Zero-inflated Poisson regression with an application to defects in manufacturing. Technometrics 1992;34:1-14.
21Freund DA, Kniesner TJ, LoSasso AT. Dealing with the common econometric problems of count data with excess zeros, endogenous treatment effects and attrition bias. Econ Lett 1999;62:7-12.
22Green W. Accounting for excess and simple selection in Poisson and Negative Binomial Regression Models. Working Paper EC-94-10, Department of economics, Stern School of Business, New York University, New York, N.Y.1994.
23Terza JV, Wilson P. Analyzing frequencies of several types of events: A mixed Multinomial-Poisson Approach. Rev of Economics and statistics 1990;72:108-15.
24Zorn CJW. Evaluating zero-inflated and hurdle Poisson Specifications. Working paper. Department of Political Science, Ohio State University, Columbus, OH.1996.
25Aitkin M, Anderson D, Francis B, Hinde J. Statistical modeling in GLIM. Oxford: Clarendon;1989.
26Breslow NE. Extra Poisson variation in log linear models. Appl Statist 1984;33:38-44.
27Bφhning D, Dietz E, Schlattmann P. The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology. J R Stat Soc 1999;162:195-209.
28Lewsey JD, Thomson WM. The utility of the zero-inflated Poisson and zero-inflated negative binomial models: a case study of cross-sectional and longitudinal DMF data examining the effect of socio-economic status. Comm Dent Oral Epidemiol 2004;32:183-9.
29Cohen AC. Estimation of mixtures of Discrete Distributions. In: Proceedings of the International Symposium on Discrete Distributions, Montreal, Quebec. 1963.
30Johnson NL, Kotz S. Discrete distributions: Distributions in Statistics. John Wiley and Sons, New York, N Y. 1968.
31Yip P. Conditional inference on a mixture model for the analysis of count data. Communs Statist Theory Meth 1991;20:2045-57.
32Johnson N, Kotz S, Kemp AW. Univariate Discrete Distributions, 2 nd ed. New York: Wiley; 1992.
33Fong DWT, Yip P. An algorithm for a mixture model of count data. Statist Probab Lett 1993;17:53-60.
34Javali SB, Pandit PV. Use of the Generalized Linear Models in Data Related to Dental Caries Index. Indian J Dent Res 2007;18:163-7.
35World Health Organization. Oral health surveys. Basic Methods. WHO Geneva;1997.
36Vuong QH. Likelihood ratio tests for model selection and non-nested Hypothesis. Econometrica 1989;57:307-33.
37 Bφhning D. A note on test for Poisson overdispersion. Biometrika 1994;81:418-9.