Indicator variable for a difference-in-difference regression

Helen Chang

Join Date: Apr 2018

Posts: 104
#1

Indicator variable for a difference-in-difference regression

14 Oct 2018, 15:20

Hi,

I need to create an indicator variable (Post_reduction) that equals one if the industry has experienced a reduction in tax by year t and remains one afterwards to perform a difference-in-difference regression (reg DepVar Post_reduction ControlVars).

I already have the industry-year level variable, Tax_reduction (continuous variable). All other control variables are at firm-year level. But I am not sure how to create the dummy variable I described above.

Thanks.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30354
#2

14 Oct 2018, 16:11

Neither am I. Because even the best descriptions of data in words do not adequately convey how the data are laid out, or reveal other aspects of the data that can be relevant to getting the correct code. So rather than writing code for imaginary data that may or may not actually look like yours, I recommend that you instead post an example of your data. Use the -dataex- command to do that.

If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment

Helen Chang

Join Date: Apr 2018
Posts: 104

14 Oct 2018, 22:29

Question:
How to create an indicator variable (Post_reduction) that equals one if the firm has experienced a reduction in tax (tax_reduction<0, industry-level data) by year t and remains one afterwards to perform a difference-in-difference regression (reg DepVar Post_reduction ControlVars)?

Here is my data generated by -dataex-

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long gvkey double fyear float(Pres tax_reduction)
  7486 2002 14.823856   -.3263651
  4036 2002 14.978984   -.3263651
 65399 2002 13.138913   -.3263651
 10195 2002 14.883848   -.3263651
 10195 2002 13.371557   -.3263651
136648 2002 12.855524   -.3263651
 10301 2002  15.15642   -.3263651
  4036 2002 13.070777   -.3263651
  6845 2002 14.439482   -.3263651
 10195 2002  11.66701   -.3263651
136648 2002 12.927713   -.3263651
136648 2002  13.61676   -.3263651
 10443 2002 16.167473   -.3263651
 10443 2002 14.590302   -.3263651
  4036 2002  14.70596   -.3263651
117861 2002 14.634058   -.3263651
 10195 2002 14.595104   -.3263651
 64690 2002 16.554222   -.3263651
 65399 2002 15.749013   -.3263651
 28742 2002 12.529348   -.3263651
  4036 2002   14.7102   -.3263651
136648 2002 13.493553   -.3263651
117861 2002 14.634058   -.3263651
 65399 2002 13.102103   -.3263651
 64690 2002  11.46107   -.3263651
 10443 2002 15.766868   -.3263651
 65399 2002 13.078642   -.3263651
117861 2002 14.302656   -.3263651
 28742 2002 12.900052   -.3263651
 10301 2002 13.841702   -.3263651
 10301 2002 14.549408   -.3263651
117861 2002  11.80875   -.3263651
  5959 2003  15.22802 -.014016276
  4058 2003  14.64683 -.014016276
  6268 2003 13.322983 -.014016276
122380 2003  16.30509 -.014016276
  5959 2003 14.915737 -.014016276
  5087 2003 15.074863 -.014016276
  5959 2003 14.800774 -.014016276
122380 2003 14.875957 -.014016276
 21542 2003 14.440063 -.014016276
 10008 2003 16.563625 -.014016276
  4058 2003  14.64683 -.014016276
 13135 2003 15.171614 -.014016276
 10008 2003 16.563625 -.014016276
122380 2003 15.757184 -.014016276
  6268 2003 14.368305 -.014016276
 21542 2003 13.406506 -.014016276
  6268 2003 14.325147 -.014016276
 10386 2003 17.130293 -.014016276
 10386 2003 16.808203 -.014016276
 21542 2003 13.406506 -.014016276
 21542 2003 14.440063 -.014016276
  4058 2003  15.21278 -.014016276
 13135 2003 15.452063 -.014016276
 13135 2003 14.138277 -.014016276
 13135 2003  15.31574 -.014016276
  5087 2003 15.074863 -.014016276
122380 2003  15.64309 -.014016276
  4058 2003 14.539025 -.014016276
 21542 2003  13.96282 -.014016276
 21542 2003  13.96282 -.014016276
  5959 2003 14.446494 -.014016276
  4058 2003  14.90928 -.014016276
 10008 2003 16.436855 -.014016276
 10008 2003 16.774778 -.014016276
 10386 2003   16.9077 -.014016276
  6268 2003 14.686067 -.014016276
  6529 2005 16.054104           0
 27760 2005 14.875098           0
 15331 2004 15.349134           0
  1327 2004 16.362663           0
133246 2002  18.17169           0
 12215 2005  11.46107           0
137434 2005  13.12832           0
 15708 2005 15.662613           0
  4194 2004 15.752105           0
 15267 2005   15.0349           0
  7985 2005  14.28027           0
 13619 2003 17.399347           0
 11315 2003 15.119536           0
 28027 2003 17.197975           0
  5791 2003 15.444143           0
 28742 2005 16.605608           0
 62977 2004 17.072285           0
  1686 2005 14.928443           0
 62730 2004 16.302052           0
 24040 2002  17.74148           0
138707 2003  15.28439           0
  6532 2005 14.471322           0
  3144 2004 14.072618           0
  3647 2003 13.972542           0
 30170 2005 16.227709           0
 65026 2003 14.684546           0
 27920 2004 15.200225           0
  2817 2003  14.59948           0
264414 2004 16.280807           0
 65552 2003 17.889587           0
 60801 2004 17.470547           0
 10499 2005 13.955663           0
end

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30354
#4

15 Oct 2018, 09:10

Well, if this is representative of your data, you have a serious problem. In this data, you never have a single firm (I assume gvkey identifies firms here) that has both pre-tax reduction and post-tax reduction observations. You won't be able to to do a DID analysis with that kind of data. You must have both pre- and post- intervention observations on the same firms to do a DID analysis.

It is also peculiar that for some gvkeys you have multiple observations in a single year. That's not fatal to your plans, nor even necessarily a problem at all, but it's odd and I wonder if that reflects some error in the data management that created this data set.

All of that said, assuming that this is just a very unrepresentative sample of your data and that you really have a suitable data set, here's how you would go about it. I'm interpreting your tax_reduction variable as: 0 means no tax reduction, and a negative number means that there was a tax reduction. There are no example of a positive number in that variable--I'm just going to assume that a positive number there would not represent a tax reduction. With all that in mind:

Code:

by gvkey (fyear), sort: gen post = sum(tax_reduction < 0) replace post = !!post
Comment
Helen Chang

Join Date: Apr 2018

Posts: 104
#5

16 Oct 2018, 00:10

Thanks for providing the codes! I will double check my data management process to see whether I made a mistake. I have merged multiple data sets in order to create this data, so maybe the merging causes multiple observations for some gvkeys/firms in a single year. Thank you for pointing it out. If this is the case, should I just delete the duplicates (duplicates drop gvkey fyear,force) and set it as a panel data ( xtset gvkey fyear )?

Regarding the DID design, I followed prior paper and I have attached the screenshot of that paper. Please let me know if I misunderstood its research design.
Again, thank you very much for your help!!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30354
#6

16 Oct 2018, 07:26

If this is the case, should I just delete the duplicates (duplicates drop gvkey fyear,force) and set it as a panel data ( xtset gvkey fyear )?

I would be very reluctant to do that. The multiple observations for gvkey fyear are not complete duplicates. They do all have the same value of tax reduction (at least in your example), but the variable Pres differs among them. And perhaps your real data has yet other variables like Pres. If Pres, and all other similar variables, are irrelevant to your problem, then -duplicates drop gvkey fyear, force- would be acceptable. But if those variables are relevant, then you can't just pick one to keep from each group arbitrarily! (And that's what -duplicates drop whatever, force- does.) You need to understand why you have multiple observations for (some) combinations of gvkey and fyear. You need to understand whether that is appropriate, or represents a problem. If it is appropriate, just leave it alone. If it's a problem, then you need to figure out which of the observations are correct and which are not; or perhaps the correct solution is to combine them into single observations by averaging them, or taking the first, or the largest, or something like that. You need to understand your data before taking any actions with it.

The screenshot you posted is not readable on my computer, so I cannot advise you about it.
Comment

Announcement

Indicator variable for a difference-in-difference regression

Comment

Comment

Comment

Comment

Comment