Time dummies in logistic model

Helen Subbotina

Join Date: Apr 2015

Posts: 18
#1

Time dummies in logistic model

01 May 2015, 06:06

Why adding the time dummy variables to the logistic model can lead to completely different results? The coefficient without them is significant (-0.31, st.err. 0.18) and with them, not significant (-0.93, st.err. -0.16). F-test on joint significance of time dummies fails to reject the null hypothesis of them being equal to zero. Can I then drop them out?

Thanks in advance!
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17682
#2

01 May 2015, 08:42

Helen:
as per FAQ, please report what you typed and what Stata gave you back. This approach is likely to increase your chances of getting helpful replies.

Kind regards,
Carlo
(Stata 19.0)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29971
#3

01 May 2015, 08:52

What these results are telling you as that the main effect you are interested in is confounded with time. In other words, part of what you find in the analysis that excludes the time indicators arises because both your predictor of interest and your outcome are changing over time, and this "parallel" change in those variables accounts for some part of the effect you are seeing in the analysis excluding time.

The decision of what variables to include in a model is always a difficult one. There are many ways of thinking about it. But relying on the statistical significance of the variables' effects is one of the least helpful approaches. Basically what you have here is a situation where your predictor of interest is correlated with time in some way. If you were to do a joint significance test on the time indicators plus the effect of interest I am pretty confident you will find it is statistically significant, even though neither the time dummies themselves nor the effect of interest is. This counter-intuitive situation arises frequently when there are confounded variables in a model. It does not justify a decision to include, nor to exclude, the variables of interest. The considerations sometimes involve whether one of the variables might be viewed as causing the other. In that case, the causal variable stays and the caused variable is usually removed. It is almost never the case that any variable can be regarded as causing the passage of time, so time is seldom removed. When there is no causality involved, other issues such as importance in the knowledge domain are considered. And often, the best decision is to simply leave all the confounded variables in the model and concede that you cannot distinguish their effects in the data at hand.

Presumably your main effect of interest stays in the model no matter what: it is what you are trying to test, so is, by definition, important. You have to consider what the role of time is and why it is confounded with your main effect. This is a substantive, not a statistical question. I will say this: in general when a variable's effect is confounded with that of time, it usually means that time is doing at least some "heavy lifting" and the real effect of interest is what remains after adjustment for time. There may be some exotic situations where that is not true, but they are rare. If you are going to exclude time from your model, you usually need a very convincing, non-statistical reason for doing so.
1 like
Comment
Helen Subbotina

Join Date: Apr 2015

Posts: 18
#4

01 May 2015, 09:03

Hello Carlo, Thank you for your reply. Sure.
In brief, I am estimating whether person's particular occupation type (health professional) is associated with better health (obviously, I won't be able to establish causality, just an interesting association). So my dependent variable is binary (tells if a person has assessed health as good) and my independent variable is binary as well (whether a person is a health professional) + I control for age, gender, income, region, marital status, education and mental health status + I compare them not to all population, but comparable occupation groups

I use xtlogit, and I can't use FE, since the variable of interest is time invariant. Hausman as well tells me to use RE.

When I don't include time dummies, the result is like this:

Random-effects logistic regression Number of obs = 10532
Group variable: pid Number of groups = 2414

Random effects u_i ~ Gaussian Obs per group: min = 1
avg = 4.4
max = 18

Integration method: mvaghermite Integration points = 12

Wald chi2(15) = 86.61
Log likelihood = -3251.8641 Prob > chi2 = 0.0000

----------------------------------------------------------------------------------
goodphyshealth | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-----------------+----------------------------------------------------------------
doctor | -.3173799 .1835699 -1.73 0.084 -.6771702 .0424104
age | -.0280063 .0230962 -1.21 0.225 -.0732741 .0172615
age2 | .000361 .0002709 1.33 0.183 -.00017 .000892
male | .0848254 .0754586 1.12 0.261 -.0630708 .2327217
doctor_male | .3238354 .2742086 1.18 0.238 -.2136037 .8612745
married | -.086733 .0793738 -1.09 0.275 -.2423027 .0688368
loginc | .0933487 .0568998 1.64 0.101 -.0181729 .2048703
mincome | -.0326042 .0894684 -0.36 0.716 -.2079589 .1427506
degree | .2956007 .1226476 2.41 0.016 .0552159 .5359855
qfeduc | .1441525 .1287766 1.12 0.263 -.1082451 .3965501
goodmentalhealth | .5580173 .0767535 7.27 0.000 .4075832 .7084514
londonse | -.1954298 .3413358 -0.57 0.567 -.8644356 .473576
wales | -.5247227 .3483207 -1.51 0.132 -1.207419 .1579734
scotland | -.4253596 .3451509 -1.23 0.218 -1.101843 .2511237
restofengland | -.2227374 .3392737 -0.66 0.511 -.8877017 .4422268
_cons | 1.979097 .7258415 2.73 0.006 .5564735 3.40172
-----------------+----------------------------------------------------------------
/lnsig2u | -2.263095 .6090566 -3.456824 -1.069366
-----------------+----------------------------------------------------------------
sigma_u | .3225337 .0982207 .1775662 .585855
rho | .0306515 .0180963 .0094929 .0944721
----------------------------------------------------------------------------------
Likelihood-ratio test of rho=0: chibar2(01) = 2.90 Prob >= chibar2 = 0.044

When I include them, the result is this:

Random-effects logistic regression Number of obs = 10532
Group variable: pid Number of groups = 2414

Random effects u_i ~ Gaussian Obs per group: min = 1
avg = 4.4
max = 18

Integration method: mvaghermite Integration points = 12

Wald chi2(32) = 144.49
Log likelihood = -1089.6235 Prob > chi2 = 0.0000

----------------------------------------------------------------------------------
goodphyshealth | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-----------------+----------------------------------------------------------------
doctor | -.092866 .591121 -0.16 0.875 -1.251442 1.06571
age | -.0356555 .0582591 -0.61 0.541 -.1498412 .0785303
age2 | .0001519 .0006734 0.23 0.822 -.001168 .0014717
male | .3358208 .2187452 1.54 0.125 -.0929119 .7645534
doctor_male | 1.010405 .9741065 1.04 0.300 -.8988089 2.919619
married | -.2351332 .2139954 -1.10 0.272 -.6545565 .18429
loginc | -.2233043 .1672629 -1.34 0.182 -.5511335 .1045249
mincome | .5340018 .2298316 2.32 0.020 .08354 .9844635
degree | .2413499 .3291523 0.73 0.463 -.4037768 .8864765
qfeduc | .2480461 .3427532 0.72 0.469 -.4237378 .91983
goodmentalhealth | 1.659469 .1597083 10.39 0.000 1.346446 1.972491
londonse | -1.089049 .936742 -1.16 0.245 -2.92503 .7469316
wales | -.3802509 .9701431 -0.39 0.695 -2.281696 1.521195
scotland | -.427174 .9566449 -0.45 0.655 -2.302164 1.447816
restofengland | -1.188125 .9302398 -1.28 0.202 -3.011361 .6351117
|
wave |
2 | .1415519 .4239992 0.33 0.738 -.6894712 .972575
3 | -.0541978 .4326877 -0.13 0.900 -.9022501 .7938544
4 | .1854903 .4467809 0.42 0.678 -.6901843 1.061165
5 | .5422539 .4683151 1.16 0.247 -.3756269 1.460135
6 | -.1483319 .424937 -0.35 0.727 -.9811931 .6845293
7 | .3008613 .4477111 0.67 0.502 -.5766365 1.178359
8 | .2976796 .435437 0.68 0.494 -.5557612 1.15112
9 | -31.75022 5736.376 -0.01 0.996 -11274.84 11211.34
10 | -.315892 .3962065 -0.80 0.425 -1.092442 .4606584
11 | .1624341 .4189889 0.39 0.698 -.6587691 .9836374
12 | -.4498536 .4011242 -1.12 0.262 -1.236043 .3363354
13 | .232706 .4407363 0.53 0.598 -.6311212 1.096533
14 | -.3832072 .4034376 -0.95 0.342 -1.17393 .407516
15 | .4780337 .4578339 1.04 0.296 -.4193042 1.375372
16 | .7552409 .4882707 1.55 0.122 -.2017521 1.712234
17 | .346514 .4583948 0.76 0.450 -.5519233 1.244951
18 | -.061003 .4323137 -0.14 0.888 -.9083222 .7863163
|
_cons | 3.397752 2.00877 1.69 0.091 -.5393643 7.334869
-----------------+----------------------------------------------------------------
/lnsig2u | 1.432102 .1723007 1.094399 1.769805
-----------------+----------------------------------------------------------------
sigma_u | 2.046336 .1762926 1.728406 2.422748
rho | .5600228 .0424544 .4759065 .6408275
----------------------------------------------------------------------------------
Likelihood-ratio test of rho=0: chibar2(01) = 213.75 Prob >= chibar2 = 0.000

The F-test:
. testparm dwav*

( 1) [goodphyshealth]dwav1 = 0
( 2) [goodphyshealth]dwav2 = 0
( 3) [goodphyshealth]dwav3 = 0
( 4) [goodphyshealth]dwav4 = 0
( 5) [goodphyshealth]dwav5 = 0
( 6) [goodphyshealth]dwav6 = 0
( 7) [goodphyshealth]dwav7 = 0
( 8) [goodphyshealth]dwav8 = 0
( 9) [goodphyshealth]dwav9 = 0
(10) [goodphyshealth]dwav10 = 0
(11) [goodphyshealth]dwav11 = 0
(12) [goodphyshealth]dwav12 = 0
(13) [goodphyshealth]dwav13 = 0
(14) [goodphyshealth]dwav14 = 0
(15) [goodphyshealth]dwav15 = 0
(16) [goodphyshealth]dwav16 = 0
(17) [goodphyshealth]dwav17 = 0

chi2( 17) = 20.10
Prob > chi2 = 0.2690
Comment
Helen Subbotina

Join Date: Apr 2015

Posts: 18
#5

01 May 2015, 09:15

Dear Clyde, thank you very much for this explanation. I agree, my main goal is not to justify on the statistical significance, but rather search for an appropriate theory to suggest, that could explain the link between my variables of interest and then further to build an appropriate model. The joint significance test really shows statistically significant results.

Can I then confirm, if I understood well the application of your last paragraph recommendation? Considering the model I described above, I will need to search for the reasons why time might influence the change in occupation status of becoming a healthcare professional?

Thank you very much!
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17682
#6

01 May 2015, 09:33

Helen:
time doesn't seem to play a relevant role in your model, but this may be an interesting result as well.
Moreover, the fact that -goodmentalhealth- is significant makes sense: it is often the case that mens sana in corpore sano (or the other way round)holds true.
As an aside, for the future please post what you typed and what Stata gave you back via Code delimiters (# button among Advanced editor options): it dramatically improves the format of the reported Stata session. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Helen Subbotina

Join Date: Apr 2015

Posts: 18
#7

01 May 2015, 09:40

Thank you very much, Carlo. Sorry for mess with presenting data, I didn't know such an option exists.
I am not sure I understood your point regarding " it is often the case that mens sana in corpore sano (or the other way round)holds true". Did you mean reverse causality?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17682
#8

01 May 2015, 10:25

Helen:
no, I simply meant correlation between physical fitness and good mental health.

Kind regards,
Carlo
(Stata 19.0)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29971
#9

01 May 2015, 10:46

So, looking at #1 and #4, there are some other problems.

You don't actually have a variable that designates health professionals (unless you are using the term "doctor" loosely). Actually, even if you are using doctor to cover other health professionals, the model still doesn't work the way you claim in #1, because you also have a variable called doctor_male, which I am guessing is an interaction term between doctor and male. If that's the case, then the effect for the doctor variable does not describe the effect of being a "doctor," it describes the effect of being a female "doctor." More on this below.

You also have variables called age and age2 (is that the square of age?). Within any given subject there will be a strong correlation between age and wave of the survey. While there are sometimes reasons to enter both age and period (wave, calendar time) effects into a model, it makes the interpretation more complicated. You need to think about this. The outcome being self-reported physical health, age would seem to be a necessary variable as it is a major determinant of risk for many kinds of illness. On the other hand, when you are not dealing with specific diseases but just overall health, the role of time period (wave) is questionable here. I'm not saying it's wrong. I'm just saying you need to think about it and justify it. Overall health may in fact experience secular trends, particularly during times of economic upheaval, or intercurrent epidemics, wars, etc. But does any of that apply to England during the era covered by your survey? In the absence of some specific reason to think that there are effects like that, my inclination would be to include age but not time in the model.

Now to the issue of Stata coding. It appears that you have created your own indicator ("dummy") variables for a number of things, as well as calculating an interaction and a quadratic term. Not only is that unnecessary effort, it also precludes the possibility of using the -margins- command to interpret your regression results. So, I'm going to assume here that doctor, male, married and goodmentalhealth are all indicator variables, that doctor_male is the interaction of doctor and male, and that you have a variable called region, from which you created separate indicators londonse, wales, scotland, and restofengland. I also assume there is a variable called wave, from which you created dwav1-dwav17. So then, your model can be coded much more simply using factor variable notation (-help fvvarlist-) as:

Code:

xtlogit goodphyshealth i.doctor##i.male c.age##c.age i.married loginc mincome degree qfeduc i.goodmentalhealth i.region i.wave

If you decide to omit the wave variable, just leave out i.wave. If you decide you don't need a quadratic term in age, just use c.age instead of c.age##c.age.

Then, post estimation you can get things like correctly-adjusted predicted probabilities (conditional on the random effect being 0) of good health for all four combinations of "doctor"/non-"doctor" and sex:

Code:

margins doctor#male, predict(pu0)

If you need a significance test of the effect of being a "doctor" (regardless of sex), it is a joint test:

Code:

test 1.doctor 1.doctor#1.male

See the margins section of [R] for a full explanation of all the things you can look at following a regression when you have made use of factor-variables.
2 likes
Comment
Helen Subbotina

Join Date: Apr 2015

Posts: 18
#10

01 May 2015, 11:47

Dear Clyde,
Thank you so much for this useful explanation. Yes, all assumptions you've made are true about my model.
I was wrongfully interpreting the doctor_male interaction term before and was not familiar with factor variables. Thank you very much!
Comment
Helen Subbotina

Join Date: Apr 2015

Posts: 18
#11

01 May 2015, 11:54

I was using the doctor term to refer to physicians, and I have a separate model, where I was testing the same thing for nurses, with other variable created for them
Comment
Helen Subbotina

Join Date: Apr 2015

Posts: 18
#12

03 May 2015, 06:06

Dear Clyde and Carlo,

I wanted to ask a small question on margins command regarding the RE xtlogit regression above.

After my RE xtlogit regression I run margins, dydx(*) post, because I am interested not in the predicted probabilities, but rather the coefficients on the variables in the model and their significance. The formula works well, although it produces one awkward result, where the coefficient on good mental health status is >1. This cannot be true, since it is a dummy and the dependent variable is a dummy too. What can be the problem with the command I use?

Thank you very much in advance!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29971
#13

03 May 2015, 06:40

Well, you did not show us the exact command that led to this, but I have a theory about what the problem is.

If you wrote

Code:

margins, dydx(whatever)

you got the marginal effect in the logit metric, not in the probability metric. That's because the default predicted outcome for -margins- after -xtlogit- is xb, not p. And there is no reason why the marginal effect of an indicator variable on xb can't be greater than one. (Note: This applies to all of your predictor variables, not just goodmentalhealth. So you are fortunate that you got this result for that variable, or you probably would never have recognized that all of the others are wrong, too.)

If you want the marginal effect in the probability metric, then you have a problem. Stata does not -predict- the expected outcome probability after -xtlogit-. The expected outcome probability, p, is a complicated integral over the distribution of random effects that has to be evaluated numerically and it is not built into the -xtlogit- postestimation command suite. Instead, you can get pu0, which is the expected outcome probability [and, with -dydx()- the marginal effect on that] conditional on the random effect being zero. For most purposes that is good enough. The syntax would be:

Code:

margins, dydx(whatever) predict(pu0)

If you really need the marginal effect on the expected outcome probability, you will have to estimate your model with -melogit- instead of -xtlogit-. Note that the estimation calculations for -melogit- and -xtlogit- are done somewhat differently and you may get slightly different results (but not different enough to matter for practical purposes). If you decide to go this route, the -predict- option for -margins- would be -predict(mu)-. The syntax of -melogit- is rather different from that of the -xt- commands, so be sure to consult the online help and [ME] manual when you code this if you are not familiar with it already. Also, depending on the size of your data set and the complexity of your model, the calculations may be time-consuming.

Last edited by Clyde Schechter; 03 May 2015, 06:44.
1 like
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4964
#14

03 May 2015, 09:25

I think Clyde is probably right. As a sidelight, if you have Stata 14, the help for margins is much better than it used to be, because it tells you for each command what the possible margins options are. So, if you have 14, type

help xtlogit_postestimation##margins

If you do xtlogit with random effects, the default option is xb, with pu0 being the alternate option you can specify.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment

Announcement

Time dummies in logistic model

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment