Add control variables for states instead of state fixed effects

George Tim

Join Date: Mar 2025

Posts: 17
#16

19 Mar 2025, 09:08

Originally posted by Jeff Wooldridge View Post

With daily data and 10 stores, I took this to mean N = 10 and T is pretty large. The five states doesn't directly factor in. If so, the Driscoll-Kraay standard errors are entirely appropriate. I think you should include state fixed effects. You're interested in the coefficient on the treatment dummy, which happens to be an interaction. Without controls, this is the usual difference-in-differences estimator using the sample averages of the treated and control, before and after. D-K gives a convenient way of obtaining a standard error.

Thank you! Since I don't have pre-treatment data, I can't do diff-in-diff and only a twoways fixed effect model. In this case, I will have to add control variables. One possible control is fuel prices. The policy dummy affects weekly fuel prices (not the other way around) as well and fuel prices also affects the dependent variable (weekly retail food price). While choosing controls, do they need to be correlated to the main interaction variable or is it okay if it is correlated to only one of dummies in the interaction variable? Can I use fuel price as a control?
Comment
Pintu Batra

Join Date: May 2025

Posts: 4
#17

09 May 2025, 01:29

I am using a cross sectional dataset of 1,00,000 individuals with the information of their incomes (in Indian Rupee), education (in years), male(=1 if male and 0 otherwise), current age (in years). I am trying to estimate the relationship between income (dependent variable) and education (independent variable). I am confused between using the two strategies.

1. Include sex as a binary variable and current age as a control. In this case, my regression command is as follows:

Code:

reg income education i.male age

(1)

2. As an alternative to this, I am told to use current age as fixed effects. In a nutshell, I would create age dummies and then include them in my regression model.

Code:

areg income education i.male i.age

(2).

Here are my doubts:

1. I am finding it difficult to understand the interpretation of the coefficients of education and male dummy in the two models.
2. How are they different from each other?
3. Which model should be preferred between the two?

I am also attaching the picture of the outputs as well.

Thanks in advance !!!

Attached Files
Comment
Mukesh Punia

Join Date: May 2020

Posts: 101
#18

09 May 2025, 04:11

1. If education is a ratio/level variable, then it is an increase in income with an increase in an additional year of education. But, check carefully your education variable, whether it is the year of schooling (i.e. 0-20 ..) or the level of education (i.e. no-schooling, primary, secondary, etc.)

2. In the case of age fixed effect, each age group has its own intercept, and simply using age as a continuous variable means an additional age increase or decrease in income.

3. Someone more experienced may suggest something better.

* while sing areg you use

Code:

areg income education i.male, ab(age)

Best regards,
Mukesh
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#19

09 May 2025, 08:32

Elaborating a bit on what Mukesh Punia says regarding age vs i.age, when you use age your model constrains the relationship between age and income to be linear (after taking into account education and sex). By contrast, when you use i.age the relationship between age and income can be anything: income at any given age is independent of income at any other age, as each year of age gets its own intercept.

Fortunately, you did not use -absorb(age)- in your code,* so we can actually see the age coefficients. Just looking at them casually, while there is plenty of noise involved, they do appear to increase more or less linearly with age up to about age 50 and then seemingly plateau. This suggests that a model constraining the relationship to be linear would be mis-specified. Now, looking at the R² statistics for both models, we see that the change is small. And we also notice that the coefficients for sex and education don't differ much across the two models. So the difference may not be very important. One could quantify that by getting AIC and BIC statistics for each models (-estat ic-).

*I do not mean that it would be wrong to do so. I'm simply pointing out that by not using -absorb(age)- we were able to actually see the single-year age coefficients which made it possible to judge whether a linear age-income relationship seems right. If you used -absorb(age)- you would get the same results, except that the age coefficients would not have been shown and it would be more of a guess.
Comment
Pintu Batra

Join Date: May 2025

Posts: 4
#20

19 May 2025, 01:41

Thank you Mukesh Punia and Clyde Schechter for your responses.

I understand the difference of including age as a control and as fixed effect in different models.
If I believe that that age is linearly related to income then I should use age a control. However, it is likely to be a strong assumption in the current context as the data from the second model (baed on fixed effects) doesn't show that.

Moreover, I have a confusion related to the interpretation of the coefficient of the edcuation variable (2875.5 vs 2851.39). Clyde Schechter, you are indeed right that there is no major difference in terms of their magnitude. However, I want to understand if the interpretation of the education variable in the two models is as follows.

Model 1:
An extra year of education seems to increase predicted income by INR 2875.5 keeping age and sex constant. Moreover, the relationship between age and income is assumed to be linear here.

Model 2 (Fixed effects model):
An extra year of education seems to increase predicted income by INR 2851.39 keeping sex constant and time invariant differences between different age groups. Since, we are including age in form of dummies - this model makes comparison of education and income between individuals within the same age cohort and gives an average estimate of the association 2851.3. In other words, it will also control for time invariant differences between the different age cohorts. However, we will not be able to claim that "time invariant differences between the different age cohorts are controlled for" in the first model as we have added as a control instead of dummies.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#21

19 May 2025, 08:31

Your interpretations are substantially correct.

I don't really like the final sentence, however. First, generically, although it is very widespread and I have given up all hope of stamping it out, this use of the word "control" is just plain wrong. This is observational data: you are adjusting for age, not controlling it. In fact, even in experimental data, since there is no way to experimentally assign age, it is never possible to "control" for age. Its effects are adjusted for, but not controlled.

Using a fine-grained age specification, such as indicator ("dummy") variables for single-year age groups produces a more accurate adjustment for age effects because it is a more correct specification of the age-outcome relationship than introducing age as a continuous variable and an assumption of a linear relationship that the data itself appears to reject. You can still say of the second model that it adjusts for age effects, it's just that it does so in less accurate way.
1 like
Comment
Pintu Batra

Join Date: May 2025

Posts: 4
#22

19 May 2025, 22:48

Thank you @Clyde for highlighting that we should use a word 'adjust' instead of 'control' given that we are not able to control for age in a literal sense.

I would like to take this conversation one step ahead and ask you about the adjustment for time trends. I have seen some papers that include a variable capturing district (or province) level data from the past census data and interact it with survey years of the current dataset. The objective is to adjust for initial district level characteristics that can have effect on on income levels (outcome variable) over time. For instance, the current dataset was surveyed from 2019-21 (few districts in 2019, few districts in 2020, and remaining districts in 2021). If I get district level educational infrastructure from the 2011 census and merge with the current data set (surveyed in 2019-21), then how do I adjust for district characteristic as per the 2011 census interacted with survey year:

Should I use the following equation?

Code:

reg income education i.male i.age c.infrastructure#i.survey_year

or

Code:

reg income education i.male i.age c.infrastructure#c.survey_year

Infrastructure variable is a continous variable capturing number of schools in various districts in 2011. 'survey_year' is categorical variable taking three values 2019, 2020, and 2021.

Please let me know if you need more information.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#23

20 May 2025, 08:59

While there are statistical considerations that go into this decision, there are also substantive ones. I can only advise you on the statistical part of it.

There are two statistical issues. The first is whether it is plausible to model the effect of the infrastructure on income as varying linearly over the period of your study. Since your data set only covers 3 years, this is, at least, not far-fetched. After all, most relationships we study in the real world are at least approximately linear over short time intervals. So I would not immediately react negatively to seeing c.infrastructure#i.year. Whether it is plausible gets into substantive questions about how infrastructure relates to income and are outside the scope of my knowledge. The other statistical issue is about degrees of freedom. c.infrastucture#i.year will soak up two degrees of freedom (2 indicators for 3 years), and c.infrastructure#c.year only one. Do you have enough sample size to support the extra degrees of freedom. Now, again, because there are only 3 years in question, the difference between the two approaches is only a single degree of freedom, so your sample size does not have to be especially large for that. You don't say anything about your sample size, but the rule of thumb is that you would like to have 50 or more observations for each numerator degree of freedom in your model to reduce the extent of overfitting the noise. (100 is even better, and maybe 25 will due in a severe pinch--these are just rules of thumb, remember.)

You will notice that both of these problems are mitigated by the fact that you are talking about only three years. (The fact that the baseline data is from 2011 doesn't matter here: the relationship may be highly nonlinear between 2011 and 2018. But as long as it is a reasonable approximation to linear in 2019-2021, you are OK to use c.infrastructure#c.year--and that three year period is short enough that approximate linearity often holds.) Three years is a short enough period that approximate linearity may well hold, and three years do not add an enormous number of degrees of freedom to the model, so that sample size considerations will only bite if you are starting from a sample size that is borderline to start with.

So from a statistical point of view, unless your sample size is insufficient, I would expect either approach to be OK. So other considerations would probably determine your decision. For example, even if c.infrastructure#c.year gives a reasonable approximation, with i.year there is no approximation involved and no assumption about linearity required. On the other hand, with c.infrastructure#c.year, you get a single interaction term, and explaining a single interaction coefficient to a non-technical audience (if you will need to do that) is easier than explaining a pair of interaction terms (one for each of 2020 and 2021).

The other substantive question that comes to my mind is whether 2011 is too far in the past to serve your purpose. Perhaps by 2019 whatever effect 2011 educational infrastructure has modifying the subsequent chronological trend in income, will have already worn off? This is an economic/sociologic/psychologic question that I am totally unqualified to answer.

By the way, I would probably use ##, not # in either case, and I would probably center the infrastructure variable.
Comment
Pintu Batra

Join Date: May 2025

Posts: 4
#24

21 May 2025, 22:30

Thank you for your inupts.
As mentioned in #17, we have more than 1,00,000 observations. Degrees of freedom is not a major concern in our dataset.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment