i.year vs c.year

Le Ng

Join Date: Sep 2020

Posts: 6
#1

i.year vs c.year

22 Mar 2023, 17:19

Good day everyone,
I am studying the relationship between the unemployment rate and wage rates, and I am running into a problem where the sign of coefficients does not come out as expected (i.e., the estimated coefficient of the unemployment rate should be negative, but I got a positive). Specifically, the dep.var is the log of wages, and the indep.vars are the unemployment rate and other variables as well, of course. The codes that gave the wrong side are as follows:

Code:

xtset ID year xtreg, ln_wage unemp_rate x1 x2 x3 i.year, re vce(cluster ID)

However, when I changed from -i.year- to -c.year

Code:

xtset ID year xtreg, ln_wage unemp_rate x1 x2 x3 c.year, re vce(cluster ID)

the estimated coefficient of the unemployment rate is negative, which it should be.

I know that my current model is not flawless, and that may cause the wrong sign. However, let's say we put everything else aside and focus on -i.year- vs -c.year-. Which one should I go for? Should I go for the conventional -i.year- and accept the wrong sign? Or should I go with -c.year-, which, to be honest, I have never used in my life; however, in return, I have the correct sign. And I do know what the differences are between -i.year- and -c.year-, just in case you are wondering.

I have much appreciated it if you skilled and experienced people could give me some advice.
Thank you. I hope you have a good day.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#2

22 Mar 2023, 18:01

Well, first let me make the general point that you should never select a model on the basis of its providing the results that you want or expect. Indeed, ideally, one choses the model without even knowing what results it gives.

So, what is the difference between i.year and c.year. When you use c.year, you are stipulating that there is a linear relationship between the outcome variable and year. In particular, the yearly variation in the outcome cannot oscillate up and down: it must steadily increase (or steadily decrease) at all times, and always at the same rate. By contrast, if you use i.year, each year's shock to the outcome is independent of every other year's. They can change arbitrarily in both amount and direction from one year to another.

Now, the interesting thing is, if you run a model with i.year you can get a pretty good sense of whether c.year is reasonable in your model. Because you can look at the coefficients of successive years. If the changes from year to year are always (or nearly always, with minor occasional exceptions) increasing, or always (...) decreasing, and if the difference between coefficients of consecutive years is more or less the same from one year to the next, then c.year can be a reasonable component of your model. But if the coefficients of the i.year variables jump around haphazardly, then, no, a c.year model makes no sense.

Finally, it would have been helpful if you had shown the outputs of both regressions. If the coefficient of the unemp_rate went from a small positive to a small negative number, then perhaps that is really no change at all. After all, these coefficients are just noisy estimates of the "true" values and are subject to variation from many sources, just as the raw data themselves are.
Comment
Le Ng

Join Date: Sep 2020

Posts: 6
#3

22 Mar 2023, 20:15

Thank you for your wonderful insight, Clyde!

I know we should never select a model on the basis of its providing the results we expect; however, there are some kinds of common relationships between variables that should be in a certain way (for example, wage and work hours, or wage and the level of schooling, etc.) Getting something that we do not expect certainly will catch us off guard; and therefore, make us question our work.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#4

22 Mar 2023, 21:33

Getting something that we do not expect certainly will catch us off guard; and therefore, make us question our work.

Yes, and if your expectation is well founded based on previous good-quality research, then getting an unexpected finding should, indeed, grab your attention. And questioning whether you have done the modeling correctly is part of that process. My only point is that the fact that you got an unexpected result and that some other model gives you a more expected one is not a basis for choosing between those two models. If the inspection of the i.year coefficients is consistent with a linear time trend, then using c.year is OK and is not a problem. If it doesn't, then c.year should be off the table. But the decision about which model to use on a point like this should be based on things like that, not on whether the coefficient of a different variable is in line with your expectations. Especially when the difference is in this direction: an i.year model is never a mis-specification. At worst it is inefficient and unnecessary. A c.year model has to be justified be good evidence that the time trend is, in fact, linearly related to the outcome.

In evaluating your analysis there are a number of points you should be thinking about:
Is there something about the sample in which you are working that is different from the usual, perhaps it is a selected group for which the usual results would not apply.

Do I have good measures of the constructs? Do the data come from reliable sources of information?

Even if the source data is good, have the data been correctly handled in the management that created the analytic data set. It is very easy to make mistakes along the way. Does the data set contain every observation it should, and none that it shouldn't? Are derived variables correctly calculated? Are the overall and group-level N's in the regression output what they are supposed to be?

In observational studies, it is always important to consider whether omitted variable bias may be a problem here.

Less commonly, it is also possible that you have inappropriately included variables that must be omitted: variables on the causal path between unemployment rate and ln_wage, or colliders of that relationship.

Could observations with missing values in model variables, which are necessarily omitted from the regression, be leaving behind a biased sample?

Why are you using a log-transformed outcome? Is it expected that the relationship of unemployment rate should be linear in the log of wage rate and not the wage rate itself? If not, then it is entirely possible that the combination of an "unexpected" positive coefficient on the unemployment rate in combination with one of the x's in your model might in fact be the best fitting linear model of a relationship that is inherently non-linear. I know that people commonly log-transform variables to reduce the range of a variable or to remove heteroscedasticity--but the single most important requirement for a linear regression to give correct results is that the relationship actually be linear. If you fit a linear model to a non-linear relationship, there is no assurance that the resulting coefficients are even decent estimates of the corresponding effects. They can be completely wrong, even having the wrong sign.

If I sat here long enough, I could rattle off many more considerations to ponder when you get an unexpected result. But these are probably the commonest sources of major errors.
Comment

Announcement

Comment

Comment

Comment