Two questions about marginal effects (xtreg)

Adam Bergal

Join Date: Jan 2016

Posts: 16
#1

Two questions about marginal effects (xtreg)

09 Nov 2020, 05:49

Hello!

I want to evaluate how sales of a specific product change as its price increases (due to a totally exogenous input reason). In total, I have monthly data from a few hundred stores, spread across the country. I've used xtreg to evaluate the parameters of interest:

Code:

xtreg y x x^2 x^3 z x*z x^2*x x^3z i.month, re vce(cluster id)

I also test (and plot) marginal effects, using the following command:

Code:

xtreg y c.x##c.x##c.x c.z##c.x##c.x##c.x i.month, re vce(cluster id) test c.xc.x#c.x c.x#c.x#c.x margins, a

Code:

t(z=1 x=(1.5(0.1)3)) dydx(x) marginsplot

I have looked around but haven't found a good answer to the following questions:
Can I run any test to find out if two patterns of marginal effects, for different level of z, are significantly different from each other? (not only at a given value of x, but for all x)

Can I include two margin plots (with different levels of z) in the same graph?

Help would be greatly appreciated!

/Adam
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

09 Nov 2020, 12:25

Code:

margins, at(z = (1 2) x = (1.5(0.1)3)) dydx(x) marginsplot, xdimension(x)

will plot two curves of marginal effect vs x, one for z = 1 and the other for z = 2, on the same graph.

I don't quite get your first question. Taken literally, you are asking for an infinite number of hypothesis tests in a single statistic. A tall order, and I don't think it can be done. (If somebody else out there knows a way to do this, please do speak up!) What you can do is see whether the average marginal effects of x at different levels of z differ:

Code:

margins, dydx(x) at(z = (1 2)) pwcompare(effects)
1 like
Comment
Adam Bergal

Join Date: Jan 2016

Posts: 16
#3

15 Nov 2020, 09:26

Originally posted by Clyde Schechter View Post

Code:

margins, at(...dydx(x) at(z = (1 2)) pwcompare(effects)

Thank you for your reply Clyde, your suggestion for two plots in the same graph worked perfectly! Regarding the other question, you are correct, I was hoping there was a way to test the entire model at the same time. That would help me answer the question: "Does these groups, with different level of z, respond differently to changes in price". They do differ at some price points (using the code you suggested), but not others, so I find it a bit tricky to interpret what that really means (might just be random fluctuations rather than a systematic difference).

If you don't mind, I had two additional questions:
I was wondering if you know how there can be (pretty large) changes in significance in some variable when I use "i.date" instead of "i.month" & "i.year" (together), I thought "i.month" and "i.year" together controlled for the same thing as "i.date". If not, what is the main difference and what is the best way to decide which one to use?

I was also wondering if you know why some independent variables become significant when transforming y to ln(y), I thought that the Ln transformation only changed the interpretation from level to percentages?

Sincerely
/Adam

Last edited by Adam Bergal; 15 Nov 2020, 09:37.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

15 Nov 2020, 10:12

First, it is better to avoid the "s-word." The American Statistical Association has recommended that the concept of statistical significance be abandoned. See https://www.tandfonline.com/doi/full...5.2019.1583913 for the "executive summary" and https://www.tandfonline.com/toc/utas20/73/sup1 for all 43 supporting articles. Or https://www.nature.com/articles/d41586-019-00857-9 for the tl;dr. There are many reasons for this. But among them is that the concept leads people to misundertandings that provoke the kinds of questions you are posing here. In particular, it is insufficiently taught that the difference between ddifferent levels of statistical significance is, itself, not statistically significant. In fact it isn't even really meaningful in any useful way. If you want to see if a different model is producing different results, you need to look at how the models' predicted results differ.

Turning to your specific questions:

1. No, those are not doing the same thing. When you use i.month and i.year you are modeling the occurrence of year-on-year shocks and also a seasonal (monthly) variation superimposed on that. So the January effect is the same in every year, and the 2019 effect is the same in every month of 2019. When you use i.date, you are modeling fully independent shocks at every separate date. These are entirely different things.

2. No, using a log-transformed variable is a radical change to the model. While it does change the interpretation from additive to multiplicative differences, it can profoundly alter whether or not the relationship between the predictors and the outcome is linear. If you look at the graph of y = ln(x) over a wide range of x, it is a highly non-linear function. One of the insufficiently taught principles of regression modeling is that the most important assumption that must be met to use linear regression is that the relationship has to be linear. If y is linearly related to x, then ln(y) cannot be, and vice versa. (The exception is over very small ranges of x, where ln, like all differentiable functions, is approximately linear for short distances.) So, the commonly taught practice of choosing y or ln(y) as the outcome based on a preference for absolute or relative differences, is just plain wrong (unless the range of values of x is narrow). At most one of those models can be correct. We obsess about things like normality of residuals, which really doesn't actually matter except in unusual circumstances, but routinely overlook the one, key requirement: linearity.
2 likes
Comment
Adam Bergal

Join Date: Jan 2016

Posts: 16
#5

16 Nov 2020, 03:06

Originally posted by Clyde Schechter View Post

First, it is better to avoid the "s-word."...

Thank you Clyde, the Nature article was very interesting and the"executive summary" is now on my reading list as well.

The point about ln(y) is now perfectly obvious after your explanation. The difference between i.date and i.month/i.year was also interesting, I can't help to wonder why practitioners use i.month/i.year if i.date control for "everything" they do (and more). Do you have any thoughts on that?

I found myself having two additional questions, if you pleas:
I have data for one month in 2014, complete data for 15-17 then six month in 2019, if I use yearly dummies should I only remove the dummy for 2014 or should I remove 2014 (one month) and 2015 (twelve month) and keep the the rest to control for year-on-year chocks? I can only find examples where they assume that you have data for a full (or almost full) starting year, the reason I ask is because I get some non-trivial differences in my estimations if I remove the dummies for 2014 and 15 compared to only removing the dummy for 2014.

Should one always control for time (seasonality) in panel data or are there some (non-extreme) situations where the inclusion of time-dummies should be avoided (or maybe just include month but not years, or vice versa)?

Sincerely
/Adam

Last edited by Adam Bergal; 16 Nov 2020, 03:18.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

16 Nov 2020, 15:00

The difference between i.date and i.month/i.year was also interesting, I can't help to wonder why practitioners use i.month/i.year if i.date control for "everything" they do (and more). Do you have any thoughts on that?

The commonest reason for using i.month and i.year instead of i.date is to explicitly model the month-to-month "seasonal" pattern. In my own field, epidemiology, this comes up a lot. For example, we often will want to adjust our results to reflect the seasonal pattern of influenza if we are studying an outcome that is related to or in some way affected by the annual flu-epidemics. The pattern of uptick in December through March (or some variant on that depending on the geographic location) would not be captured by a series of i.date shocks. I don't know if there are seasonal things like that in your domain, but I suspect there are. I would imagine, for example that many goods are largely seasonal in demand and the prices would also vary accordingly, no?

I have data for one month in 2014, complete data for 15-17 then six month in 2019, if I use yearly dummies should I only remove the dummy for 2014 or should I remove 2014 (one month) and 2015 (twelve month) and keep the the rest to control for year-on-year chocks? I can only find examples where they assume that you have data for a full (or almost full) starting year, the reason I ask is because I get some non-trivial differences in my estimations if I remove the dummies for 2014 and 15 compared to only removing the dummy for 2014.

If you are using i.date shocks you should omit one and only one date. If you are using i.month i.year shocks, you should omit exactly one of the twelve months, and exactly one year. Actually, the safest thing to do is not to do this explicitly but to just let Stata do it automatically when you use factor-variable notation. If, due to colinearity problems it is necessary to omit more than that, Stata will handle that for you (and will give you a message that it has done so.) Bear in mind, also, that it makes no substantive difference whatsoever which month and year (or which date if you are doing i.date) gets omitted. The coefficients for the time variables will change, but nothing else will. And the coefficients for the time variables are, in any case, not meaningful in their own right, so there is no reason to prefer one way of doing it over another.

So, in particular, since it appears you are using i.date's here, you should remove exactly one of them. Removing both a 2014 month and a 2015 month is an error (unless necessary to resolve colinearity, which does not appear to be the case here), and that will give you wrong estimates.

Should one always control for time (seasonality) in panel data or are there some (non-extreme) situations where the inclusion of time-dummies should be avoided (or maybe just include month but not years, or vice versa)?

Again, I can't speak for the situation in finance, or marketing, or economics. But in epidemiology situations with all manner of different adjustments for time (including none at all) arise. What you do depends on your scientific understanding of the real-world data generating process. Infectious diseases are mostly seasonal and failure to adjust accordingly would be a blunder. But cancers are not seasonal and including a seasonal adjustment would risk overfitting the model and reducing the precision of model estimates. Most diseases exhibit either upward or downward trends in incidence and mortality over the years, but some are stable long term (e.g. schizophrenia) and when modeling those diseases including time variables would result in overfitting the model and packing additional noise into the estimates.
2 likes
Comment
Adam Bergal

Join Date: Jan 2016

Posts: 16
#7

17 Nov 2020, 09:14

Originally posted by Clyde Schechter View Post

The commonest reason for using i.month and i.year instead of i.date is to explicitly model the month-to-month "seasonal" pattern. In my own field, epidemiology, this comes up a lot. For example, we often will want to adjust our results to reflect the seasonal pattern of influenza if we are studying an outcome that is related to or in some way affected by the annual flu-epidemics. The pattern of uptick in December through March (or some variant on that depending on the geographic location) would not be captured by a series of i.date shocks. I don't know if there are seasonal things like that in your domain, but I suspect there are. I would imagine, for example that many goods are largely seasonal in demand and the prices would also vary accordingly, no?

...

Most diseases exhibit either upward or downward trends in incidence and mortality over the years, but some are stable long term (e.g. schizophrenia) and when modeling those diseases including time variables would result in overfitting the model and packing additional noise into the estimates.

Thank you again for a great explanation, and you are correct, there are clearly seasonal patterns and I believe that monthly dummies is the correct way to model that. Regarding years, there are no obvious (ex post) reason to include yearly dummies and I'm a bit worried about overfitting the model (when I include yearly dummies I get less precise estimates). Is there any test for that, for example, a model specification test (for xtreg)? Using regular regression one can run some form of Likelihood-ratio test but that doesn't work for xtreg.

I was also trying to use bootstrapped standard errors (for robustness) but for some reason STATA just stops at a random iteration during the process (I think it is because an iteration fails to converge), I've tried to include iterate(50) but I just get an error message. Do you know how to force the process to continue? This is the code I'm using (year are yearly dummies):

Code:

xtreg y x x^2 x^3 z x*z x^2*z x^3z i.month 2015 2016 2017 2018, re vce(bootstrap, reps(50000))
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

17 Nov 2020, 16:31

I'm not aware of any model specification test for -xtreg- that would be suitable for this purpose--but if somebody else is, I hope he or she will speak up. In terms of the spotting overfitting, I would first consider the number of observations you have, the number of panels, and the number of variables in the model when you include the year indicators. Overfitting is a problem when the number of observations and panels aren't large enough for the number of predictors. If you're in doubt, though, you could switch from -xtreg, re- to -mixed-. That fits the same model, but using a different estimation process The estimates will usually be essentially identical to what you get from -xtreg, re-. And then you could look at -estat ic- after estimating the model both with and without the year indicators. The AIC or BIC that -estat- puts out will give you some idea whether including the year indicators is a good idea.

With regard to the bootstrapped standard errors, I don't think there is a convergence issue here because -xtreg, re- is not an iterative estimation procedure. It may be, however, that some iteration of the bootstrap itself is pulling a sample that has a colinearity problem or is otherwise unsuitable for your -xtreg-. One thing that might help: 50,000 reps for a bootstrap seems excessive. Usually a several hundred is enough. And you might then avoid stumbling on an unusable sample. I do not know any way to force it to keep going if it hits a snag. Generally that's an undesirable thing to do anyway--if there's a problem it's better to know about it than to sweep it under the rug, unless its a problem whose occurrence is quite foreseeable and does not in any way invalidate the final results if it is skipped over.
Comment
Adam Bergal

Join Date: Jan 2016

Posts: 16
#9

29 Nov 2020, 07:23

Originally posted by Clyde Schechter View Post

I'm not aware of any model specification test for -xtreg- t..r.

Thank you Clyde, It seems that overfitting is highly unlikely given that I have more than 10 000 observation from a few hundred stores. And for some reason bootstrap worked on my laptop but not on my stationary computer, even though it is about 10-times slower. No clue why that was.

I copied my regression coefficients to Excel, so I could plug in values for the covariates and get the predicted values. Two questions:
I was just wondering what the correct way to incorporate both monthly and yearly dummies. Do I just take the sum of all dummies (both monthly and yearly) and divide them by twelve before I include them as a constant?

Is there a way to predict a specific value in STATA (xtreg) e.g. if I want to know what the value of y is when, for example, covariates are x=1, z=2 & w=50?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

29 Nov 2020, 10:58

For question 1, you don't treat monthly and yearly dummies any differently from the way you treat any other variable. You have to incorporate all those coefficients in the prediction model. When applying it to a particular date, the coefficients for that month and year get added in.

For question 2, since x, z, and w are not a complete set of variables in your model, you cannot get "the predicted value" from them; prediction requires values for all model variables. You might, however, be interested in the average predicted value (averaging over the distribution of all the other model variables) when x = 1, z = 2 and w = 50. -margins, at(x = (1) z = (2) w = (50))- will give you that. Another type of prediction people are sometimes interested in is the predicted value for the specified values of x, z, and w with all other variables set to their sample means: -margins, at(x = (1) z = (2) w = (50)) atmeans- will do that.
Comment
Adam Bergal

Join Date: Jan 2016

Posts: 16
#11

10 Dec 2020, 04:45

Originally posted by Clyde Schechter View Post

For question 1, you don't treat monthly and yearly dummies any differently from the way you treat any other variable. You have to incorporate all those coefficients in the prediction model. When applying it to a particular date, the coefficients for that month and year get added in.

For question 2, since x, z, and w are not a complete set of variables in your model, you cannot get "the predicted value" from them; prediction requires values for all model variables. You might, however, be interested in the average predicted value (averaging over the distribution of all the other model variables) when x = 1, z = 2 and w = 50. -margins, at(x = (1) z = (2) w = (50))- will give you that. Another type of prediction people are sometimes interested in is the predicted value for the specified values of x, z, and w with all other variables set to their sample means: -margins, at(x = (1) z = (2) w = (50)) atmeans- will do that.

Regarding Q1, I'm not exactly sure how one would treat monthly and yearly dummies in that regard. Do I just add all their coefficients together and add as an constant when I try to predict values? Or should I use an average of all the yearly and monthly dummies respectively? I'm not trying to predict the value at a specific date, just the predicted value when the covariates have a certain value (after controlling for time).

Regarding Q2: Perfect!

A final question: For some reason I lose some observations when I include yearly dummies. I have no missing values anywhere. I do, however, use one month lag in my y-variable and the missing values corresponds exactly to one month worth of observations so I would guess (hope) that the lagged variable is is the cause of the missing values. Do you think that is the case and if so, why are there not the same amount of missing observations when I (only) use monthly dummies?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#12

12 Dec 2020, 21:07

I don't know how to answer your "Regarding Q1" response other than to refer you to "Regarding Q2," which you felt was satisfactory. You don't add anything up, unless you have some masochistic desire to do this all by hand. You can do that, but it is tedious and error prone. I recommend strongly against it, strongly enough that I don't want to encourage you by showing you in any detail how it would be done. We have the -margins- command for this purpose. You include the year and month indicators in your regression, and then in the -margins- command you specify the particular values of month and year you are interested in (or specify nothing and they will be covered by -atmeans-.)

Yes, the lag operator is causing the sample size to shrink. First principle of regression: only those observations with no missing values in any of the variables mentioned in the regression command are used. Now think about the observations for the earliest month in your data. A lagged value would refer to the observations from the month before that. But since this is the first month, there is no month before that, so the lagged value is missing. And that missing value causes the observation to be excluded from the regression's estimation sample. That said, I cannot think of any reason that the same thing would not happen with yearly variables. So I think for more concrete advice you need to fire up the -dataex- command to show an example of the data, and then run the same regressions on that example and post both the regression commands themselves and the outputs.

If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
Adam Bergal

Join Date: Jan 2016

Posts: 16
#13

05 Apr 2021, 03:28

Originally posted by Clyde Schechter View Post

I don't know how to answer your "Regarding Q1" response other than to refer you to "Regarding Q2," which you felt was satisfactory. You don't add anything up, unless you have some masochistic desire to do this all by hand. You can do that, but it is tedious and error prone. I recommend strongly against it, strongly enough that I don't want to encourage you by showing you in any detail how it would be done. We have the -margins- command for this purpose. You include the year and month indicators in your regression, and then in the -margins- command you specify the particular values of month and year you are interested in (or specify nothing and they will be covered by -atmeans-.)

Yes, the lag operator is causing the sample size to shrink. First principle of regression: only those observations with no missing values in any of the variables mentioned in the regression command are used. Now think about the observations for the earliest month in your data. A lagged value would refer to the observations from the month before that. But since this is the first month, there is no month before that, so the lagged value is missing. And that missing value causes the observation to be excluded from the regression's estimation sample. That said, I cannot think of any reason that the same thing would not happen with yearly variables. So I think for more concrete advice you need to fire up the -dataex- command to show an example of the data, and then run the same regressions on that example and post both the regression commands themselves and the outputs.

If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Hi again Clyde, and thank you so much for your help. The reason we lost some observation when using i.year instead of yearly dummies turned out to be a slight coding error when the data was created. Now everything checks out perfectly. Regarding Q1, we used the appropriate stata-command (per your suggestion) and also used stata to create the appropriate graphs, for ease of replication and minimizing the risk of coding errors.

A final question, is there a way to do a "model comparison test" between this panel data regression using a squared independent variable (and its interactions) and the same regression that also include a cubic independent variable (and its interactions) that is, is there a way of determining if the model with the cubic term is a "better" than model that only includes a squared variable (and the regular term of course) using panel data? We're looking for an objective criterion for choosing which model to use for further analysis.

Sincerely
/Adam
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#14

05 Apr 2021, 15:04

Well, it is a core point of my philosophy of statistics that there is no such thing as an objective criterion for choosing anything. Assuming that what you mean is that you want a cut-and-dried test statistic for comparing models, my advice for this situation, where the models are nested, is a likelihood ratio test.

Now, there is a slight problem. -xtreg- does not calculate or save log likelihoods. But since you are using the random effects model, you can just rerun both the quadratic and cubic models using -mixed- instead. -mixed- estimates the same model as -xtreg, re- and the results will be nearly identical to what you got with -xtreg, re-, the slight differences being due to numerical and rounding issues. You store the estimates after each of the models is run, and then use the -lrtest- command specifying the estimates you stored. So, something like this:

Code:

mixed outcome c.term##c.term other_variables || id: estimates store quadratic mixed outcome c.term##c.term##c.term other_variables || id: estimates store cubic lrtest quadratic cubic

Note: You cannot use clustered standard errors for this, since an actual likelihood calculation is required.

If you are highly averse to not using clustered standard errors, you have three options. One is to add the -force- option to the -lrtest- command. The second is a different approach. Again, using -mixed-, rerun each model and run -estat ic- after each one. Then look at the change in the AIC or BIC. The model with the lowest value of AIC or BIC is preferred. (Usually you will reach the same conclusion regardless of your choice of AIC or BIC. Occasionally they give results in opposite directions, and there is no uniform agreement as to how to handle that situation. Generally that situation only arises when the differences between the models are small anyway.) And the third approach is just to look at the test statistic for the coefficient of the cubic term in the output of the cubic regression.
1 like
Comment
Adam Bergal

Join Date: Jan 2016

Posts: 16
#15

18 May 2021, 06:16

Originally posted by Clyde Schechter View Post

Well, it is a core point of my philosophy of statistics that there is no such thing as an objective criterion for choosing anything. Assuming that what you mean is that you want a cut-and-dried test statistic for comparing models, my advice for this situation, where the models are nested, is a likelihood ratio test.

Now, there is a slight problem. -xtreg- does not calculate or save log likelihoods. But since you are using the random effects model, you can just rerun both the quadratic and cubic models using -mixed- instead. -mixed- estimates the same model as -xtreg, re- and the results will be nearly identical to what you got with -xtreg, re-, the slight differences being due to numerical and rounding issues. You store the estimates after each of the models is run, and then use the -lrtest- command specifying the estimates you stored. So, something like this:...

Thank you, Clyde, your comments have been truly helpful. I just have some final questions about the choice between fixed effect and random effect model. Initially I used the Hausman test (without clustered errors) and got a non-significant but negative Chi-square. I then added the command "Housman fe re, sigmamore" which gave me a positive chi square value that was a little lower but still not significant (indicating that I could use a random effect model).

Hausman does, however, not work with Clustred errors so I had to use the "xtoverid" command (took me quite some time to figure out that I could not use i.month but had to generate and include separate monthly dummies for it to work). Now I got a Chi-square of over 400 (p=0000) indicating that I should use a fixed effect model (looking at the coefficients, however, they are vertically identical regardless if I use fixed- or random effects). The problem is that I would like to use a random effect model to test the effect of some time-invariant control variables. So my question is if that is informative (possible?) given the above information (stated another way: what is the main problem using random effect if xtoverid suggest using fixed effect)?

Sincerely
/Adam
Comment

Announcement

Two questions about marginal effects (xtreg)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment