AIC and BIC model fit tests on weighted DHS data

Jungseok Lee

Join Date: Mar 2015

Posts: 3
#1

AIC and BIC model fit tests on weighted DHS data

31 Mar 2015, 20:37

Dear all,

I am currently working on a count model (negative binomial) and trying to choose the best model based on AIC or BIC model fit tests.
The issue that I am concerned is that I read some articles saying that these tests should not be used when using clustered / weighted data which are common for survey data. Because the way that I constructed my data is somewhat different from this statement, I am looking for any help to see whether these tests (AIC, BIC) are appropriate or not.

The model consists of two types of datasets. The dependent variable was obtained from the surveillance database, so neither weight nor sampling is matter. However, the independent variables were obtained from Demography and Health Survey (DHS) data where sample weights are required to use. The goal of this analysis is to find out statistically significant independent variables to explain variance of the dependent variable (as usual).
Prior to running a regression, the dataset for the independent variables was prepared by collapsing (by region) with the "sample weights" provided from DHS datasets. Thus, I do not have to use the "[iweight=weight]" option when running the regression (because the final dataset for independent variables was already weighted when collapsing, and no weight was required for the dependent variable). The regression and test outputs for one of the models are shown as below.

I was wondering if it would be okay to use AIC or BIC tests for model comparison in this context.
Thank you.

Jungseok Lee

. xi: glm inc1000 i.q3RF1 i.age_grp*inc_type, fam(nb)
i.q3RF1 _Iq3RF1_1-3 (naturally coded; _Iq3RF1_1 omitted)
i.age_grp _Iage_grp_1-5 (naturally coded; _Iage_grp_5 omitted)
i.age_~p*inc~pe _IageXinc_t_# (coded as above)
note: _IageXinc_t_1 omitted because of collinearity

Iteration 0: log likelihood = -228.99003
Iteration 1: log likelihood = -225.47151
Iteration 2: log likelihood = -225.43393
Iteration 3: log likelihood = -225.43391

Generalized linear models No. of obs = 84
Optimization : ML Residual df = 73
Scale parameter = 1
Deviance = 80.20500562 (1/df) Deviance = 1.098699
Pearson = 71.00797014 (1/df) Pearson = .9727119

Variance function: V(u) = u+(1)u^2 [Neg. Binomial]
Link function : g(u) = ln(u) [Log]

AIC = 5.629379
Log likelihood = -225.4339137 BIC = -243.2446

------------------------------------------------------------------------------
| OIM
inc1000 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Iq3RF1_2 | -.5920396 .3099836 -1.91 0.056 -1.199596 .0155171
_Iq3RF1_3 | .3788479 .3493202 1.08 0.278 -.305807 1.063503
_Iage_grp_1 | .9523037 .4453947 2.14 0.033 .079346 1.825261
_Iage_grp_2 | -4.379099 1.337118 -3.28 0.001 -6.999803 -1.758395
_Iage_grp_3 | -1.639636 .6600511 -2.48 0.013 -2.933312 -.3459597
_Iage_grp_4 | -3.685952 1.512576 -2.44 0.015 -6.650546 -.721358
inc_type | -3.278245 .6011499 -5.45 0.000 -4.456478 -2.100013
_IageXinc_~1 | (omitted)
_IageXinc_~2 | 5.701549 1.390779 4.10 0.000 2.975672 8.427426
_IageXinc_~3 | 2.357005 .7785509 3.03 0.002 .831073 3.882936
_IageXinc_~4 | 3.255374 1.596704 2.04 0.041 .1258912 6.384857
_cons | 4.277992 .5850281 7.31 0.000 3.131358 5.424626
------------------------------------------------------------------------------

. estat ic

-----------------------------------------------------------------------------
Model | Obs ll(null) ll(model) df AIC BIC
-------------+---------------------------------------------------------------
. | 84 . -225.4339 11 472.8678 499.6068
-----------------------------------------------------------------------------
Note: N=Obs used in calculating BIC; see [R] BIC note

Last edited by Jungseok Lee; 31 Mar 2015, 20:41.
Tags: None
Richard Williams

Join Date: Apr 2014

Posts: 4987
#2

31 Mar 2015, 21:14

Two side points:

Your code and output would be much easier to read if you used code tags. See pt. 12 of the FAQ,

Unless you are using an ancient version of Stata, do not use xi. Just use factor variable notation. See -help fvvarlist-

I don't really understand what the units of analysis are here. Do you have 84 regions, or what? What are the independent variables? You say you did collapsing -- so are these mean values or what? If you did the analysis right I have a feeling BIC and AIC are ok but I don't really understand how you came up with your variables or what they are.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Jungseok Lee

Join Date: Mar 2015
Posts: 3

31 Mar 2015, 22:37

Dear Richard Williams,

Thank you very much for your response.
I edited the regression outputs below following your advice.
The dependent variable is incidence rates (/1000) for 84 surveillance locations (some are in the same country, some are in different countries). The raw DHS data for the independent variables are the number of people who responded "yes" for certain categories (i.e. using tap water for their drinking water source). I am interested in looking at the proportions of the respondents who answered "yes" for each category out of total respondents. To get the proportions, I first create dummy variables for each category of the variable and sum them up by collapsing. When collapsing, the DHS sample weight (hv005) is used. The proportions are then calculated by dividing the summed frequency of each dummy variable by the total frequency. These proportions were used as independent variables for a count model.
I hope this clarifies more.

Code:

. xi: glm inc1000 i.q3RF1 i.age_grp*inc_type, fam(nb)
i.q3RF1           _Iq3RF1_1-3         (naturally coded; _Iq3RF1_1 omitted)
i.age_grp         _Iage_grp_1-5       (naturally coded; _Iage_grp_5 omitted)
i.age_~p*inc~pe   _IageXinc_t_#       (coded as above)
note: _IageXinc_t_1 omitted because of collinearity

Iteration 0:   log likelihood = -228.99003 
Iteration 1:   log likelihood = -225.47151 
Iteration 2:   log likelihood = -225.43393 
Iteration 3:   log likelihood = -225.43391 

Generalized linear models                          No. of obs      =        84
Optimization     : ML                              Residual df     =        73
                                                   Scale parameter =         1
Deviance         =  80.20500562                    (1/df) Deviance =  1.098699
Pearson          =  71.00797014                    (1/df) Pearson  =  .9727119

Variance function: V(u) = u+(1)u^2                 [Neg. Binomial]
Link function    : g(u) = ln(u)                    [Log]

                                                   AIC             =  5.629379
Log likelihood   = -225.4339137                    BIC             = -243.2446

------------------------------------------------------------------------------
             |                 OIM
     inc1000 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   _Iq3RF1_2 |  -.5920396   .3099836    -1.91   0.056    -1.199596    .0155171
   _Iq3RF1_3 |   .3788479   .3493202     1.08   0.278     -.305807    1.063503
 _Iage_grp_1 |   .9523037   .4453947     2.14   0.033      .079346    1.825261
 _Iage_grp_2 |  -4.379099   1.337118    -3.28   0.001    -6.999803   -1.758395
 _Iage_grp_3 |  -1.639636   .6600511    -2.48   0.013    -2.933312   -.3459597
 _Iage_grp_4 |  -3.685952   1.512576    -2.44   0.015    -6.650546    -.721358
    inc_type |  -3.278245   .6011499    -5.45   0.000    -4.456478   -2.100013
_IageXinc_~1 |  (omitted)
_IageXinc_~2 |   5.701549   1.390779     4.10   0.000     2.975672    8.427426
_IageXinc_~3 |   2.357005   .7785509     3.03   0.002      .831073    3.882936
_IageXinc_~4 |   3.255374   1.596704     2.04   0.041     .1258912    6.384857
       _cons |   4.277992   .5850281     7.31   0.000     3.131358    5.424626
------------------------------------------------------------------------------

. estat ic

-----------------------------------------------------------------------------
       Model |    Obs    ll(null)   ll(model)     df          AIC         BIC
-------------+---------------------------------------------------------------
           . |     84           .   -225.4339     11     472.8678    499.6068
-----------------------------------------------------------------------------
               Note:  N=Obs used in calculating BIC; see [R] BIC note

Last edited by Jungseok Lee; 31 Mar 2015, 22:40.

Comment

Richard Williams

Join Date: Apr 2014

Posts: 4987
#4

02 Apr 2015, 19:12

So, it sounds like the units of analysis are the 84 surveillance units. Values for the units were obtained in different ways -- some were obtained from the surveillance data set, others were computed using other data sets -- but in any event you have variable values for each surveillance unit. My guess is that BIC and AIC are ok, but somebody who has worked with these data may have much more expertise on these matters.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Announcement