Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • AIC and BIC model fit tests on weighted DHS data

    Dear all,

    I am currently working on a count model (negative binomial) and trying to choose the best model based on AIC or BIC model fit tests.
    The issue that I am concerned is that I read some articles saying that these tests should not be used when using clustered / weighted data which are common for survey data. Because the way that I constructed my data is somewhat different from this statement, I am looking for any help to see whether these tests (AIC, BIC) are appropriate or not.

    The model consists of two types of datasets. The dependent variable was obtained from the surveillance database, so neither weight nor sampling is matter. However, the independent variables were obtained from Demography and Health Survey (DHS) data where sample weights are required to use. The goal of this analysis is to find out statistically significant independent variables to explain variance of the dependent variable (as usual).
    Prior to running a regression, the dataset for the independent variables was prepared by collapsing (by region) with the "sample weights" provided from DHS datasets. Thus, I do not have to use the "[iweight=weight]" option when running the regression (because the final dataset for independent variables was already weighted when collapsing, and no weight was required for the dependent variable). The regression and test outputs for one of the models are shown as below.

    I was wondering if it would be okay to use AIC or BIC tests for model comparison in this context.
    Thank you.


    Jungseok Lee


    . xi: glm inc1000 i.q3RF1 i.age_grp*inc_type, fam(nb)
    i.q3RF1 _Iq3RF1_1-3 (naturally coded; _Iq3RF1_1 omitted)
    i.age_grp _Iage_grp_1-5 (naturally coded; _Iage_grp_5 omitted)
    i.age_~p*inc~pe _IageXinc_t_# (coded as above)
    note: _IageXinc_t_1 omitted because of collinearity

    Iteration 0: log likelihood = -228.99003
    Iteration 1: log likelihood = -225.47151
    Iteration 2: log likelihood = -225.43393
    Iteration 3: log likelihood = -225.43391

    Generalized linear models No. of obs = 84
    Optimization : ML Residual df = 73
    Scale parameter = 1
    Deviance = 80.20500562 (1/df) Deviance = 1.098699
    Pearson = 71.00797014 (1/df) Pearson = .9727119

    Variance function: V(u) = u+(1)u^2 [Neg. Binomial]
    Link function : g(u) = ln(u) [Log]

    AIC = 5.629379
    Log likelihood = -225.4339137 BIC = -243.2446

    ------------------------------------------------------------------------------
    | OIM
    inc1000 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    _Iq3RF1_2 | -.5920396 .3099836 -1.91 0.056 -1.199596 .0155171
    _Iq3RF1_3 | .3788479 .3493202 1.08 0.278 -.305807 1.063503
    _Iage_grp_1 | .9523037 .4453947 2.14 0.033 .079346 1.825261
    _Iage_grp_2 | -4.379099 1.337118 -3.28 0.001 -6.999803 -1.758395
    _Iage_grp_3 | -1.639636 .6600511 -2.48 0.013 -2.933312 -.3459597
    _Iage_grp_4 | -3.685952 1.512576 -2.44 0.015 -6.650546 -.721358
    inc_type | -3.278245 .6011499 -5.45 0.000 -4.456478 -2.100013
    _IageXinc_~1 | (omitted)
    _IageXinc_~2 | 5.701549 1.390779 4.10 0.000 2.975672 8.427426
    _IageXinc_~3 | 2.357005 .7785509 3.03 0.002 .831073 3.882936
    _IageXinc_~4 | 3.255374 1.596704 2.04 0.041 .1258912 6.384857
    _cons | 4.277992 .5850281 7.31 0.000 3.131358 5.424626
    ------------------------------------------------------------------------------

    . estat ic

    -----------------------------------------------------------------------------
    Model | Obs ll(null) ll(model) df AIC BIC
    -------------+---------------------------------------------------------------
    . | 84 . -225.4339 11 472.8678 499.6068
    -----------------------------------------------------------------------------
    Note: N=Obs used in calculating BIC; see [R] BIC note










    Last edited by Jungseok Lee; 31 Mar 2015, 20:41.

  • #2
    Two side points:

    Your code and output would be much easier to read if you used code tags. See pt. 12 of the FAQ,

    Unless you are using an ancient version of Stata, do not use xi. Just use factor variable notation. See -help fvvarlist-

    I don't really understand what the units of analysis are here. Do you have 84 regions, or what? What are the independent variables? You say you did collapsing -- so are these mean values or what? If you did the analysis right I have a feeling BIC and AIC are ok but I don't really understand how you came up with your variables or what they are.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    Stata Version: 17.0 MP (2 processor)

    EMAIL: rwilliam@ND.Edu
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      Dear Richard Williams,

      Thank you very much for your response.
      I edited the regression outputs below following your advice.
      The dependent variable is incidence rates (/1000) for 84 surveillance locations (some are in the same country, some are in different countries). The raw DHS data for the independent variables are the number of people who responded "yes" for certain categories (i.e. using tap water for their drinking water source). I am interested in looking at the proportions of the respondents who answered "yes" for each category out of total respondents. To get the proportions, I first create dummy variables for each category of the variable and sum them up by collapsing. When collapsing, the DHS sample weight (hv005) is used. The proportions are then calculated by dividing the summed frequency of each dummy variable by the total frequency. These proportions were used as independent variables for a count model.
      I hope this clarifies more.

      Code:
      . xi: glm inc1000 i.q3RF1 i.age_grp*inc_type, fam(nb)
      i.q3RF1           _Iq3RF1_1-3         (naturally coded; _Iq3RF1_1 omitted)
      i.age_grp         _Iage_grp_1-5       (naturally coded; _Iage_grp_5 omitted)
      i.age_~p*inc~pe   _IageXinc_t_#       (coded as above)
      note: _IageXinc_t_1 omitted because of collinearity
      
      Iteration 0:   log likelihood = -228.99003 
      Iteration 1:   log likelihood = -225.47151 
      Iteration 2:   log likelihood = -225.43393 
      Iteration 3:   log likelihood = -225.43391 
      
      Generalized linear models                          No. of obs      =        84
      Optimization     : ML                              Residual df     =        73
                                                         Scale parameter =         1
      Deviance         =  80.20500562                    (1/df) Deviance =  1.098699
      Pearson          =  71.00797014                    (1/df) Pearson  =  .9727119
      
      Variance function: V(u) = u+(1)u^2                 [Neg. Binomial]
      Link function    : g(u) = ln(u)                    [Log]
      
                                                         AIC             =  5.629379
      Log likelihood   = -225.4339137                    BIC             = -243.2446
      
      ------------------------------------------------------------------------------
                   |                 OIM
           inc1000 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
         _Iq3RF1_2 |  -.5920396   .3099836    -1.91   0.056    -1.199596    .0155171
         _Iq3RF1_3 |   .3788479   .3493202     1.08   0.278     -.305807    1.063503
       _Iage_grp_1 |   .9523037   .4453947     2.14   0.033      .079346    1.825261
       _Iage_grp_2 |  -4.379099   1.337118    -3.28   0.001    -6.999803   -1.758395
       _Iage_grp_3 |  -1.639636   .6600511    -2.48   0.013    -2.933312   -.3459597
       _Iage_grp_4 |  -3.685952   1.512576    -2.44   0.015    -6.650546    -.721358
          inc_type |  -3.278245   .6011499    -5.45   0.000    -4.456478   -2.100013
      _IageXinc_~1 |  (omitted)
      _IageXinc_~2 |   5.701549   1.390779     4.10   0.000     2.975672    8.427426
      _IageXinc_~3 |   2.357005   .7785509     3.03   0.002      .831073    3.882936
      _IageXinc_~4 |   3.255374   1.596704     2.04   0.041     .1258912    6.384857
             _cons |   4.277992   .5850281     7.31   0.000     3.131358    5.424626
      ------------------------------------------------------------------------------
      
      . estat ic
      
      -----------------------------------------------------------------------------
             Model |    Obs    ll(null)   ll(model)     df          AIC         BIC
      -------------+---------------------------------------------------------------
                 . |     84           .   -225.4339     11     472.8678    499.6068
      -----------------------------------------------------------------------------
                     Note:  N=Obs used in calculating BIC; see [R] BIC note
      Last edited by Jungseok Lee; 31 Mar 2015, 22:40.

      Comment


      • #4
        So, it sounds like the units of analysis are the 84 surveillance units. Values for the units were obtained in different ways -- some were obtained from the surveillance data set, others were computed using other data sets -- but in any event you have variable values for each surveillance unit. My guess is that BIC and AIC are ok, but somebody who has worked with these data may have much more expertise on these matters.
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        Stata Version: 17.0 MP (2 processor)

        EMAIL: rwilliam@ND.Edu
        WWW: https://www3.nd.edu/~rwilliam

        Comment

        Working...
        X