Data mining - Choosing the right model

Emin Gurbanov

Join Date: Jan 2017

Posts: 14
#1

Data mining - Choosing the right model

16 Apr 2017, 13:17

Currently I am trying to choose the right set of independent variables and right model.

My model is

ologit happiness i.external i.income i.d i.male i.GNIcap age age2 i.nchildren i.empl refincome i.marital i.health i.year, nolog

Variables used are:

happiness is categorical 1 not happy at all- 4 very happy

external - external religiosity, 1 yes 0 no - dummy

income - income scale 1 lower step 11 higher step - categorical

d - religious denominations - categorical, 11 main denominations.

male - dummy

GNIcap - categorical - low income countries, medium income countries and high income countries.

age age2

nchildren - number of children 1,2,3,4,5,6,7,8 and more.

empl - employment status - 1 full time 10 not employed - categorical

refincome - reference income

marital = marital status - 0 not married, 1 married

health - health status - very poor -0 , very good 4

year

I did sensitivity check, started with the following baseline model and added 1 variable to baseline model each time, finally end up with full model.

ologit happiness, nolog

ologit happiness i.external, nolog

ologit happiness i.external i.income, nolog

..........
......
...........
ologit happiness i.external i.income i.d i.male i.GNIcap age age2 i.nchildren i.empl refincome i.marital i.health i.year, nolog

By adding more variables to the baseline model, BIC and AIC are decreasing and Pseudo R2 is increasing.

It seems suspicious to me and I have feeling that even I would add 10 more variables to the model, BIC and AIC would decrease and Pseudo R2 would increase, showing the best model is with all variables.

The question is, am I right? or there is something suspicious in these results (AIC BIC and Pseudo R2).
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17680
#2

17 Apr 2017, 03:18

Emin.
rather than hunting for the "best" model, I would take a look at the literature in your research field to see what others did in the past when presented with the same research topic.

Kind regards,
Carlo
(Stata 19.0)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

17 Apr 2017, 07:44

It seems suspicious to me and I have feeling that even I would add 10 more variables to the model, BIC and AIC would decrease and Pseudo R2 would increase, showing the best model is with all variables.
The question is, am I right? or there is something suspicious in these results (AIC BIC and Pseudo R2).

Theoretically speaking, given "all" explanatory variables are available, we should expect that the best model may (leaving out collinearity and other issues) eventually be the one with them all.

That said, BIC will decrease up to the point the model still "improves" by adding "explanatory" predictors. Whenever the predictors are not explanatory, the BIC won't decrease further, at least significantly. What is more, there is always the principle of parsimony to reflect about.

Best regards,

Marcos
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#4

17 Apr 2017, 16:56

What is your sample size? Especially since everything is specified as categorical and your predictors are powerful demographics that impact just about every aspect of life, your findings aren't too surprising. But you need to be able to answer the question "why?" for the effects to be truly interesting.

Not to worship at the altar of .05, but are very many of the effects significant in terms of the z-scores? Which ones? Significance isn't the end of the story, but it is a crude indicator that something is interesting, and its absence points to weak information, even if BIC goes down.

Why are six kids optimal, not five, not seven? Why are Mormons the happiest?

Do some of the variables gain or lose significance (or even more interesting, flip signs) in the presence or absence of other variables (suppression, mediation)? How happy are people with lots of kids and no money or vice versa (interaction effects)?

Does the model vary by country? If you have that GNICap variable in there, you might be able to do something with a two level model, and probably should.

Having a large pseudo R^2 is nice, but understanding the results is probably more important.
Comment

Announcement

Data mining - Choosing the right model

Comment

Comment

Comment