Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data mining - Choosing the right model

    Currently I am trying to choose the right set of independent variables and right model.

    My model is

    ologit happiness i.external i.income i.d i.male i.GNIcap age age2 i.nchildren i.empl refincome i.marital i.health i.year, nolog

    Variables used are:

    happiness is categorical 1 not happy at all- 4 very happy

    external - external religiosity, 1 yes 0 no - dummy

    income - income scale 1 lower step 11 higher step - categorical

    d - religious denominations - categorical, 11 main denominations.

    male - dummy

    GNIcap - categorical - low income countries, medium income countries and high income countries.

    age age2

    nchildren - number of children 1,2,3,4,5,6,7,8 and more.

    empl - employment status - 1 full time 10 not employed - categorical

    refincome - reference income

    marital = marital status - 0 not married, 1 married

    health - health status - very poor -0 , very good 4

    year



    I did sensitivity check, started with the following baseline model and added 1 variable to baseline model each time, finally end up with full model.

    ologit happiness, nolog

    ologit happiness i.external, nolog

    ologit happiness i.external i.income, nolog

    ..........
    ......
    ...........
    ologit happiness i.external i.income i.d i.male i.GNIcap age age2 i.nchildren i.empl refincome i.marital i.health i.year, nolog

    By adding more variables to the baseline model, BIC and AIC are decreasing and Pseudo R2 is increasing.

    It seems suspicious to me and I have feeling that even I would add 10 more variables to the model, BIC and AIC would decrease and Pseudo R2 would increase, showing the best model is with all variables.

    The question is, am I right? or there is something suspicious in these results (AIC BIC and Pseudo R2).



  • #2
    Emin.
    rather than hunting for the "best" model, I would take a look at the literature in your research field to see what others did in the past when presented with the same research topic.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      It seems suspicious to me and I have feeling that even I would add 10 more variables to the model, BIC and AIC would decrease and Pseudo R2 would increase, showing the best model is with all variables.
      The question is, am I right? or there is something suspicious in these results (AIC BIC and Pseudo R2).
      Theoretically speaking, given "all" explanatory variables are available, we should expect that the best model may (leaving out collinearity and other issues) eventually be the one with them all.

      That said, BIC will decrease up to the point the model still "improves" by adding "explanatory" predictors. Whenever the predictors are not explanatory, the BIC won't decrease further, at least significantly. What is more, there is always the principle of parsimony to reflect about.

      Best regards,

      Marcos

      Comment


      • #4
        What is your sample size? Especially since everything is specified as categorical and your predictors are powerful demographics that impact just about every aspect of life, your findings aren't too surprising. But you need to be able to answer the question "why?" for the effects to be truly interesting.

        Not to worship at the altar of .05, but are very many of the effects significant in terms of the z-scores? Which ones? Significance isn't the end of the story, but it is a crude indicator that something is interesting, and its absence points to weak information, even if BIC goes down.

        Why are six kids optimal, not five, not seven? Why are Mormons the happiest?

        Do some of the variables gain or lose significance (or even more interesting, flip signs) in the presence or absence of other variables (suppression, mediation)? How happy are people with lots of kids and no money or vice versa (interaction effects)?

        Does the model vary by country? If you have that GNICap variable in there, you might be able to do something with a two level model, and probably should.

        Having a large pseudo R^2 is nice, but understanding the results is probably more important.

        Comment

        Working...
        X