Comparing Logit Regression Marginal Effects - Different Years, Same Variables

Kim Veloso

Join Date: Jun 2018

Posts: 19
#1

Comparing Logit Regression Marginal Effects - Different Years, Same Variables

01 Jul 2018, 21:08

Dear everyone,

I am using 3 labor force surveys years 1994, 2000, and 2012 of 1 country and running logit regressions using the same variables in all 3 datasets.

Code:

logit y i.sex i.age_grp i.sector i.education i.marital i.urban if working==1 [pw=weight], cluster(fsu)

I am now in the middle of interpreting the results (specifically marginal effects). However, I am unsure whether I can make some conclusions on whether the effect of a certain variable X has improved/worsened the probability of y=1 over the years.

I read in Mood (2010 p.73) (reference below), that it is problematic to compare coefficients/odds ratios across different logit models (even when using the same independent variables) due to potential differences in the predictions of effects of the models and unobserved heterogeneity.

Even if the models include the same variables, they need not predict the outcome equally well in all the compared categories, so different ORs or LnORs in groups, samples, or points in time can reflect differences in effects, but also differences in unobserved heterogeneity. This is an important point because sociologists frequently compare effects across, e.g. sexes, ethnic groups, nations, surveys, or years.

Does this problem extend to the comparisons of marginal effects across the 3 survey years as well?
For example, is it wrong for me to say the following statement when interpreting the Average Marginal Effects:

"In 1994, the effect of being male relative to being female increases the probability of y=1 by 15 percentage points. However, this has declined by 2012 where the effect of being male relative to being female only increases the probability of y=1 by 7 percentage points. This may imply that gender equality has improved over the years."

Is it more appropriate to only say:

"In 1994, the effect of being male relative to being female increases the probability of y=1 by 15 percentage points, whereas in 2012, the effect of being male relative to being female increases the probability of y=1 by 7 percentage points."

Thank you very much for your help!

Best,
Kim

Reference:
Mood, C. 2010. Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It. European Sociological Review, 26(1), 67-82.
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

02 Jul 2018, 12:07

You didn't get a quick answer. You'll increase your chances of a helpful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex.

I don't see a particular problem in either statement. They seem pretty close except for the conjecture about the interpretation of the difference. You might consider suest or stacking the three datasets and using factor variable notation to allow different parameters for the different surveys to allow tests on parameters (but I have not read Mood so I may be missing his point). You should also see Maarten Buis's presentation at http://www.maartenbuis.nl/presentations/london15a.pdf. He has a working paper as well. http://www.maartenbuis.nl/wp/oddsratio.html
1 like
Comment
Kim Veloso

Join Date: Jun 2018

Posts: 19
#3

03 Jul 2018, 19:30

Thank you so much for your pointers, Mr. Bromiley! Also, I apologize for the long post in advance...

I just thought that perhaps I am not allowed to make a conclusion on the effects of the variables from two different years (and two separate datasets) because of problems with unobserved heterogeneity (and also, the logit models may not predict the effects of the variables equally well). For example, maybe if added another variable to both logit regressions, the results could show that the effect of being male relative to female has actually increased the probability of y=1 by 2012 instead of decreased...

I have taken a look at Maarten Buis's paper and presentation, and they are very helpful. I like that he provides a different perspective regarding Carina Mood's (2010) paper.

Regarding the issues of comparability across years and models (with the same variables) and unobserved heterogeneity:
I am thinking of following Buis (2017)'s approach of interpreting my dependent variable as a chance or a "degree of plausibility" i.e. an assessment of how likely it is that an event occurs conditional on the information (explanatory variables) included in the model. Therefore, the effects of the variables on the probability of y=1 would be simply defined by the choice of explanatory variables included in the model rather than something that exists outside the model.

Basically, I would argue that I am not uncertain about the outcomes of the workers in the datasets because I already know whether y=1 is a success or a failure. Instead, I will predict the probabilities of hypothetical workers with the similar characteristics as the workers in the datasets. Ultimately, all persons with the same characteristics share the same chance of y=1 such that the logistic regression model interpreted in terms of chance, as Buis puts it, is a “population averaged model”.

Also, it seems that Average Marginal Effects are indeed comparable across, models, groups, samples etc. because as Mood (2010) states, AMEs are not (not more than marginally) affected by unobserved heterogeneity.

I wonder if the same can be said about Predictive Margins....

Thank you very much once again!
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3459
#4

04 Jul 2018, 01:09

If you want to follow my approach you should compare odds ratios and not marginal effects. In that case I would do all three years in one model, and add interaction terms for year. That way you also have a test of the differences (well ratios) between years. My paper is still under review, but a paper making a similar argument has been accepted for publication, so if you need a reference see: https://doi.org/10.1177/0049124117747306

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
2 likes
Comment
Kim Veloso

Join Date: Jun 2018

Posts: 19
#5

04 Jul 2018, 23:40

Originally posted by Maarten Buis View Post

If you want to follow my approach you should compare odds ratios and not marginal effects. In that case I would do all three years in one model, and add interaction terms for year. That way you also have a test of the differences (well ratios) between years. My paper is still under review, but a paper making a similar argument has been accepted for publication, so if you need a reference see: https://doi.org/10.1177/0049124117747306

Thank you very much for your pointers, Mr. Buis! (Sorry once again for the long post and the many questions...)

I have stacked the 3 datasets and now have a variable called year taking values 1994, 2000, and 2012.

However, can you please kindly clarify what you mean by adding "interaction terms for year"?

My logit regressions (odds ratio option added) look like this for now (as I am unsure how to go about including "interaction terms for year" at the moment):

em_occ is the binary dependent variable referring to a favorable labor market outcome and takes value 1 if successful and 0 otherwise. My goal is to study how worker characteristics influence the likelihood of em_occ=1.

Code:

logit em_occ i.age_grp i.sex i.education i.sector i.urbrur i.marital i.sector##i.education if working==1 & year==1994 [iw=weight], or // I run the same regressions for 2000 and 2012 and store the respective estimates. I must use iweights if I want to use suest later. estimates store In1994

I also do as Mr. Bromiley recommended and run suest but this seems to only compare the coefficients

Code:

suest In1994 In2000 In2012, cluster(fsu) //or option is not allowed test [In1994_em_occ = In2000_em_occ = In2012_em_occ]

Stata Output: The test is significant.

Code:

chi2( 32) = 441.08 Prob > chi2 = 0.0000

I have two more follow-up questions, if I may:

1. Does it not make sense for me to show and interpret Average Marginal Effects anymore?

2. Can I still show Predictive Margins? For example:

Code:

margins sector#sex margins sector#education, at(age_grp==3 education==3 urbrur==1 marital==1)

Thank you so much once again! I highly appreciate your help.

Last edited by Kim Veloso; 05 Jul 2018, 00:39.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3459
#6

05 Jul 2018, 01:26

You already know how to add interactions, as you added one already in your model, so I don't understand your question

The fact that you have to use iweights in order to use suest should make you very worried. iweights are primarily there for programmers who want to trick a program into doing something for which it was initially not inteded. So the programmer, in this case you, is now responsible for determining that this is appropriate. So this "method" does not work for you, and you really have to use interactions instead.

As to AMEs versus odds ratios this is an open discussion. You can choose either, but if you follow my argument, then AMEs are not that attractive anymore, but predicted probabilities are fine. If you find Mood's et al.s argument more compelling then you don't want to use odds ratios and AMEs are one possibility. However, for such arguments I actually find a linear probability model more honest. In short there is no consensus that you can rely on, so this is really up to you.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Kim Veloso

Join Date: Jun 2018

Posts: 19
#7

05 Jul 2018, 17:18

Thank you very much for you prompt response, Mr. Buis.

I apologize for not being clear enough regarding "interaction terms for year" as I had simply never done interaction terms with a time variable before; I have now just followed what has been done in this post: (https://www.statalist.org/forums/for...ssion-analyses). This allows me to use pweights.

Code:

gen t1994=0 if year!=1994 replace t1994=1 if year==1994 gen t2000=0 if year!=2000 replace t2000=1 if year==2000 gen t2012=0 if year!=2012 replace t2012=1 if year==2012 logit em_occ i.t1994##i.sector i.t2000##i.sector i.t2012##i.sector i.t1994##i.education i.t2000##i.education i.t2012##i.education /// i.t1994##i.age_grp i.t2000##i.age_grp i.t2012##i.age_grp i.t1994##i.sex i.t2000##i.sex i.t2012##i.sex /// i.t1994##i.urbrur i.t2000##i.urbrur i.t2012##i.urbrur i.t1994##i.marital i.t2000##i.marital i.t2012##i.marital /// if working==1 [pw=weight], cluster(fsu) or

However, all of the t2012 variables end up being ommitted because of collinearity.

Do you think it is okay for me to run 3 separate logit regressions for the 3 labor force survey years instead and then compare the odds ratios of the variables across the 3 models/years?

Thank you once again.

Last edited by Kim Veloso; 05 Jul 2018, 17:27.
Comment

Maarten Buis

Join Date: Mar 2014
Posts: 3459

06 Jul 2018, 01:37

You don't have to create indicator (dummy) variables when you use factor variables, so easier would be:

Code:

logit em_occ i.year##(i.age_grp i.sex i.education i.sector i.urbrur i.marital i.sector##i.education) ///
    if working==1 [pw=weight], cluster(fsu) or

If you have three years, then you have only two indicator variables. So one will be dropped, in your case t2012

You seem to want the effects (odds ratios) of your explanatory variables for each year rather than the changes in effects. That is easy enough, see this example (marital status in this example is year in your project):

Code:

// open example data
sysuse nlsw88, clear

// prepare the data
gen byte marst = !never_married + married if !missing(never_married, married)
label variable marst "marital status"
label define marst 0 "never married"    ///
                   1 "widowed/divorced" ///
                   2 "married"
label value marst marst

//the regular interaction model
logit union i.marst##(c.grade i.south), or

// we can use that to create the odds ratio of south for married people:
di .5844216 * .7012213

// alternatively we can do so like this
lincom 1.south + 1.south#2.marst, eform

// We can get that directly for all variables using this alternative specification of the model
logit union ibn.marst i.marst#(c.grade i.south), or nocons

(For more on examples I sent to the Statalist see: http://www.maartenbuis.nl/example_faq )

Also see this Stata tip: https://www.stata-journal.com/articl...article=st0250

Translating this to your project:

Code:

logit em_occ ibn.year i.year#(i.age_grp i.sex i.education i.sector i.urbrur i.marital i.sector##i.education) ///
    if working==1 [pw=weight], cluster(fsu) or nocons

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------

Comment

Kim Veloso

Join Date: Jun 2018

Posts: 19
#9

06 Jul 2018, 02:53

Thank you very much, Mr. Buis, for all your help and patience.

Here's an example output from Stata when I ran the second command you gave me.

I would like to know if I am interpreting this correctly....

For example:

"In year 1994, the odds of em_occ is about 34 times greater for those in sector 2 than those in sector 1.
These odds have decreased in years 2000 and 2012 where the odds of em occ is only about 15 and 2 times respectively, greater for those in sector 2 than those in sector 1."

Thank you very much!
Comment

Announcement