First differencing of categorical predictors; differences between "xi:reg D.(y x1*x2)" and " reg D.y x1##x2)"

Erik Mikael Johnson

Join Date: Mar 2022

Posts: 22
#1

First differencing of categorical predictors; differences between "xi:reg D.(y x1*x2)" and " reg D.y x1##x2)"

17 Jun 2023, 09:36

Hi everyone!

I'm estimating the effect of 3 phases of the Covid-pandemic on the natural log of reported crime in a first differenced regression, using monthly measures and accounting for seasonality, with 290 panels and 59 time points (strongly balanced).
Hence, my two predictors are indicators and my dependent variable is continuous. Initially, I also had one time-invariant indicator but removed it as I understand that they are unnecessary in a first differenced regression.
In addition, I included an interaction term as I found evidence for a functional form problem without it.

IVs:
pc = 0 1 2 3

se = 1 2 3 4 5 6 7 8 9 10 11 12

I specified two similar regressions which provided slightly different coefficients, with the same sign but some differed in significance, from what I understand is essentially the same specification (see https://www.stata.com/statalist/arch.../msg00606.html )

Code:
xi: reg D.(lnrc i.pc*i.se), nocons tsscons vce(cl id)

Code:
reg D.lnrc i.pc##i.se, nocons tsscons vce(cl id)

I assume that I must have some kind of knowledge gap here or I've made some other mistake.

The slightly different values of the coefficients I can grasp to some extent but it is the difference in significance that baffles me.

Succinctly; Are these two specifications essentially the same in principle, or if not, what are the consequential differences of using the models?

Best regards
Erik Johnson
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

17 Jun 2023, 10:16

With -xi: reg D.(lnrc i.pc*i.se), nocons tsscons vce(cl id)- you are taking first differences of all of the variables in the model. With -reg D.lnrc i.pc##i.se, nocons tsscons vce(cl id)-, you are taking the first difference of the dependent variable, but not the independent variables. So no reason to think those will turn out the same.
Comment
Erik Mikael Johnson

Join Date: Mar 2022

Posts: 22
#3

17 Jun 2023, 11:28

Okay, Tanks Clyde for your rapid answer .

A better formulation of my ignorance:
I was under the impression that taking FD of the indicator variables using - xi: reg D.() - would result in time dummies comparing the average log difference of that time phase with the time phase of the reference category, similar to that of - reg D.y i.x1 i.x2 - since indicators indicate the span of observations of D.y to be compared. This was how I interpreted the post in https://www.stata.com/statalist/arch.../msg00606.html.

As my ambition is to estimate the effect of the Covid-pandemic on reported crime while accounting for the effect of seasonality, I wonder what the consequence in interpretation would be when specifying the two different models?
I have a bit of trouble grasping the difference of them.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

17 Jun 2023, 11:50

With -xi: reg D.(everything)- you are modeling the change since preceding period in log reported crime as a function of the change in x1 and x2 over the same interval.

With reg D.lnrc i.x1 i.x2 you are modeling the change since preceding period in log reported crime as a function of the current levels of x1 and x2.
Comment
Erik Mikael Johnson

Join Date: Mar 2022

Posts: 22
#5

17 Jun 2023, 12:27

Perfect, thanks!

So essentially it would make more sense to specify the first model - xi: reg D. (everything), (i.e., change as a function of change) ?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#6

17 Jun 2023, 14:03

I can't help you with this question. It's a substantive question in your field, not a statistical one. My field is epidemiology, and I have no knowledge about factors that influence reported crime rates. There may be others from your discipline who follow this thread and respond. If not, take this to a colleague in your discipline for an answer.
Comment
Erik Mikael Johnson

Join Date: Mar 2022

Posts: 22
#7

17 Jun 2023, 14:26

Thanks anyway for your suggestions and time
Comment

Erik Mikael Johnson

Join Date: Mar 2022
Posts: 22

18 Jun 2023, 05:19

Okay seems like I fell into my rabbit hole again:

It seems to me that specifying the model with - xi:reg D.(everything) - creates bivariate time dummies according to the specification of the original categorical variables.
As such, as my predictors only indicate a span of time points of D.y and I only have categorical predictors, I think the "current levels" can't be anything other than D.y with - reg D.y i.x1 i.x2.

The difference between the model specifications does make sense for me if the predictors would indicate a span of observations for another metric (for instance age or unemployment rate). In a case such as this; yes, I believe the - xi: D. everything - model " is appropriate.
However in my case - where my predictors indicate a span of the same metric (i.e., a first differenced distribution) - the models should essentially be equivalent. Evidently, the results do significantly differ.

As the Stata manual recommends one not to use the xi: command when it's not necessary ( https://www.stata.com/manuals/rxi.pdf), I am inclined to believe that the difference between the models is an artefact of the xi: command.

Is my logic sound or have I forgotten something?

Here is an example of the variables; Float refers to variables of the model - reg D.y i.x1 i.x2 - and Byte to the time dummies created by the xi command.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(fd_lnrc pc se) byte(_Ipc_1 _Ipc_2 _Ipc_3 _Ise_2 _Ise_3 _Ise_4 _Ise_5)
         . 0  1 0 0 0 0 0 0 0
-1.3862944 0  2 0 0 0 1 0 0 0
 1.0986123 0  3 0 0 0 0 1 0 0
 .28768206 0  4 0 0 0 0 0 1 0
 .55961573 0  5 0 0 0 0 0 0 1
 -1.252763 0  6 0 0 0 0 0 0 0
         0 0  7 0 0 0 0 0 0 0
  .4054651 0  8 0 0 0 0 0 0 0
 .28768206 0  9 0 0 0 0 0 0 0
 .22314358 0 10 0 0 0 0 0 0 0
-.22314358 0 11 0 0 0 0 0 0 0
 .22314358 0 12 0 0 0 0 0 0 0
-.51082563 0  1 0 0 0 0 0 0 0
         0 0  2 0 0 0 1 0 0 0
 .51082563 0  3 0 0 0 0 1 0 0
-.51082563 0  4 0 0 0 0 0 1 0
 -.4054651 0  5 0 0 0 0 0 0 1
  .6931472 0  6 0 0 0 0 0 0 0
-.28768206 0  7 0 0 0 0 0 0 0
 -.4054651 0  8 0 0 0 0 0 0 0
end
label values sc sclable
label def pclable 0 "Pre_Covid", modify
label values se selabel
label def selabel 1 "January", modify
label def selabel 2 "February", modify
label def selabel 3 "March", modify
label def selabel 4 "April", modify
label def selabel 5 "May", modify
label def selabel 6 "June", modify
label def selabel 7 "July", modify
label def selabel 8 "August", modify
label def selabel 9 "September", modify
label def selabel 10 " October", modify
label def selabel 11 "November", modify
label def selabel 12 "December", modify
label var _Ipc_1 "pc==1" 
label var _Ipc_2 "pc==2" 
label var _Ipc_3 "pc==3" 
label var _Ise_2 "se==2" 
label var _Ise_3 "se==3" 
label var _Ise_4 "se==4" 
label var _Ise_5 "se==5"

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#9

18 Jun 2023, 10:05

In -xi: reg D.(y i.x1*i.x2)- you are regressing the change over one time period in y against the change in x1 and the change in x2 over the same time period (and their interaction).

In -reg D.y x1##x2, you are regressing the change over one time period in y against the value of x1 at the end of that time period and the value of x2 at the end of that time period (and their interaction).

These are by no means equivalent. Now, if you try to use factor variable notation and try to do -reg D.(y x1##x2)-, Stata will refuse, because the D operator is not compatible with factor variable notation. (The differences in question can be negative, and therefore cannot be represented in factor variable notation.) Omitting -xi- here and going to factor-variable notation forces you to, willy nilly, change the model altogether. So if what you want is to regress the change in y against the changes in x1 and x2 and their interaction, then this is one of those unusual situations where you must use -xi:- to do it. The salient difference between xi and factor variable notation for your problem is that the former will work happily with the D. operator but the latter will not. It's not that they do something different: it's that you've stumbled on one of the situations that factor variable notation simply won't do at all.

Last edited by Clyde Schechter; 18 Jun 2023, 10:16.
Comment
Erik Mikael Johnson

Join Date: Mar 2022

Posts: 22
#10

18 Jun 2023, 12:29

Okay, so let me see if I understand and thank you for your patience.

If x1 is a factor variable indicating 4 consecutive years (e.g. 2019 - 2022) and x2 is a factor variable representing the months in a year while using monthly time points.

In xi: reg D.(y i.x1 i.x2) I would be estimating the average difference in first-order change of y between the years (2020, 2021 & 2022) and their reference year (2019) while accounting for the average difference of first-order change of y of the months of the year and their reference category (January) across all time points.

In reg D.y i.x1 i.x2 I would be estimating the difference between the average first-order change of the reference year (2019) and the last observation of y (December) within the other years (2020, 2021 & 2022) while accounting for the difference of the average first-order change of the reference month (January) and the last observation of the other months (i.e., the months within 2022)

Is this a correct interpretation?

Did not see your edit until after I posted. Perfect! Now I know I need to use xi prefix. Thank you very much Clyde.

Last edited by Erik Mikael Johnson; 18 Jun 2023, 12:51.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#11

18 Jun 2023, 12:49

I'm not sure if that's correct or not. I had not contemplated that x1 and x2 would be year and month variables. And using x1##x2 is an odd way of creating just a monthly date. What is particularly confusing to me here is that the D operator requires that you previously -xtset- or -tsset- your data with a time variable. Neither x1 nor x2 can be the time variable for that purpose, because each will have repeated observations. So I don't know what your time variable is, which means I can't know what the D operator is actually doing here.

In a more typical situation, there would be a time variable, perhaps a monthly date variable (-gen mdate = ym(year, month)-) used in -xtset- and x1 and x2 would be other things. In that situation the -xi: reg D.(y i.x1*i.x2)- command regresses the change in y from, say, November 2019 to December 2019, against the changes in x1, x2, and their interaction, over that same time period. And -reg D.y x1##x2- regresses the change in y from November 2019 to December 2019 against the December 2019 values of x1, x2, and their interaction.
Comment
Erik Mikael Johnson

Join Date: Mar 2022

Posts: 22
#12

18 Jun 2023, 13:58

The example I provided was hypothetical, sorry for the confusion I caused.
The actual model, I used xtset by panel id and the monthly date.

The specification of x1 refers to (2018-2020m3 | 2020m4-2020m12| 2021 | 2022), x2 refers to the month of the year. The interaction term was included as I found evidence of a functional form problem in the model reg D.y i.x1 i.x2. and allowed me to examine if there was an effect by the date I hypothesized we could observe an effect of the Covid-pandemic using the margins command as well as other important monthly dates in regards to restrictions. But now I know that was not really a fruitful enterprise.

Thank you again for your responses, you were very helpful.

By the way, do you know of a way of examining the interaction effect without using the margins command as it does not work with xi prefix ?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#13

18 Jun 2023, 14:25

As this is a linear model, the interaction effect itself is just the coefficient of the interaction term. Well, actually you have a bunch of interaction terms because your x1 and x2 have many levels. But their coefficients, collectively are the interaction effect. Although with so many levels it seems almost silly, you can "test for the presence of any interaction" by running -test- on all of them simultaneously.

Then, you would also want to know about the marginal effects of the various levels of x1 and each level of x2 and vice versa. So for a given level of x1 and a given level of x2, the marginal effect of that level of x1 conditional on that level of x2 is the coefficient of that level of x1 + the coefficient of the interaction term with those two levels. Use -lincom- to do this so you also get standard errors, confidence intervals, and test statistics.
Comment
Erik Mikael Johnson

Join Date: Mar 2022

Posts: 22
#14

19 Jun 2023, 11:16

Super! Tanks again Clyde.
Comment
Erik Mikael Johnson

Join Date: Mar 2022

Posts: 22
#15

19 Jun 2023, 13:42

Okay, so I'm running into some problems using the test, testparm and lincom commands.

When I specify the dummy variables created by the xi prefix command (i.e., xi:reg D.(lnab i.sc*i.se)) to be tested (e.g. test _IscXse_1_2-_IscXse_3_11) stata returns" _IscXse_1_2 not found". r(111)

The only way to make the commands work is when I use - xi: testparm D.(i.sc*i.se) - but that tests the joint significance of the whole model.

According to the Stata manual concerning the - xi- command I should be able to run - testparm _IscXse*- to test the joint significance of the interaction but stata can't find that interaction term either (i.e., r(111))

Using the describe command I can see that the dummies do in fact exist

I really do not know what I'm doing wrong here.

I do not know if this is pertinent to my issue, but I'm using Stata/SE 17.0.
Comment

Announcement