What fixed effects and cluster group to use on "three dimensional" panel data?

Anton Kraft

Join Date: Nov 2017

Posts: 10
#1

What fixed effects and cluster group to use on "three dimensional" panel data?

12 Feb 2019, 03:23

Hello everyone,

I have an (unbalacend) panel data set of German companies with investments in multiple countries over ten years and want to analyze the effect of certain country-level variables on the investment amount of German firms. The following gives you an example of the structure of the data.

Please note that it is just example data. Investor_id refers to the ID of the German investor, invest_amount is the invested amount held by the German company in that country, GDP refers to the gross domestic product of the country in which this investment is seated, GDP_capita is the GDP per capita and EU_dummy is a dummy for the country being part of the EU (=1) or not (=0). My actual data set contains 8 more country-level variables.

Code:

.list, sepby(year) noobs abbrev(16) +---------------------------------------------------------------------------------+ | year country investor_id invest_amount GDP GDP_capita EU_dummy | |---------------------------------------------------------------------------------| | 2017 China 45 1300 1.00e+07 4320 0 | | 2017 France 45 100 400000 5675 1 | | 2017 France 86 670 400000 5675 1 | |---------------------------------------------------------------------------------| | 2018 China 45 1500 1.10e+07 4520 0 | | 2018 France 45 105 390000 5575 1 | | 2018 France 86 660 390000 5575 1 | +---------------------------------------------------------------------------------+

I want to use a regression with fixed effects as is done in the related literature. However, I am struggling with the data due to its "three dimensions": the observations are not only on a year-level and on a company-level but also on a country-level, since a German company can have investments in more than one country at a time. Because of this I cannot simply use xtset and xtreg as in

Code:

xtset investor_id year

because those two variables are no unique identifiers, since a German investor (investor_id) can have multiple investments in one year (that is because I look at multiple countries). However, it works to use xtset only on the investors and use fixed effects for the years.

Code:

xtset investor_id xtreg log_invest_amount GDP GDP_capita EU_dummy i.year, fe

But my question now is if this would really be the correct approach for this kind of data? In the literature to related research projects I have seen many different approaches of how to use fixed effects on this kind of data set. Unfortunately, most of the time there is no explanation at all and the papers simply state what they did without explaining why. I could also not find an explanation that I understood in statistic books (mostly looking in chapters on fixed effects and clustering).

Here are the alternatives to the above fixed effects for investor_id and year that I could also use and that I encountered in the literature:

1) Create a new country-investor-level variabel

Code:

egen ic_id = group(investor_id country) xtset ic_id year xtreg log_invest_amount GDP GDP_capita EU_dummy, fe

2) Create a new country-time-level variable

Code:

egen it_id = group(investor_id year) xtset it_id xtreg log_invest_amount GDP GDP_capita EU_dummy, fe

3) Create a new country-time-level variable

Code:

egen ct_id = group(country year) xtset ct_id xtreg log_invest_amount GDP GDP_capita EU_dummy, fe

Does anyone know and can also reason which of the above approaches would be most appropriate from a theoretical perspective for my data?

Furthermore, I believe that the standard errors have to be made cluster robust. Here I encounter the exact same problem as with the fixed effects and don't know what to define as the group that should be clustered after (is it the investors? or the countries? or the investors in a given country?). And in the literature it seems again very ambigious what to use and most often it comes without an explanation.

So my second question is what is your thought on which cluster variable to use? Country, investor_id, ic_id, it_id or ct_id from above?

I am looking forward to your helpful and interesting input

For the sake of completeness I want to mention that I posted a somewhat related question in 2017 on Statalist when I was for the first time working on a related project, but it was a different approach/model and question and my Stata and statistics knowledge was on a different level back then: https://www.statalist.org/forums/for...led-panel-data

Best regards,
Anton
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10213

12 Feb 2019, 15:44

I have an (unbalacend) panel data set of German companies with investments in multiple countries over ten years and want to analyze the effect of certain country-level variables on the investment amount of German firms.

I do not see what role the investment destination plays in terms of answering your research question. Notice that in your example data, all variables are invariant for a given investor_id and year, except for the investment amount. Therefore, you should simply aggregate the investment amount and proceed with standard panel data analysis. Here is one approach that enables you to see which investors had invested in more than one country.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float year str29 country float(investor_id invest_amount GDP GDP_capita EU_dummy)
2017 "China"  45 1300 1.00e+07 4320 0
2017 "France" 45  100   400000 5675 1
2017 "France" 86  670   400000 5675 1
2018 "China"  45 1500 1.10e+07 4520 0
2018 "France" 45  105   390000 5575 1
2018 "France" 86  660   390000 5575 1
end

*PLACE MULTIPLE COUNTRY NAMES FOR THE SAME INVESTOR AND YEAR IN 1 OBSERVATION
bys investor_id year (country): replace country= country[_n-1] + " " + country if _n > 1
bys investor_id year (country): replace country = country[_N]

*COLLAPSE SUMMING INVESTMENT AMOUNTS
collapse (mean) investor_id GDP GDP_capita (max) EU_dummy (sum) invest_amount, by(country year)

*XTSET DATA
xtset investor_id year

Result:

Code:

. l

     +---------------------------------------------------------------------------+
     | year        country   invest~d       GDP   GDP_ca~a   EU_dummy   invest~t |
     |---------------------------------------------------------------------------|
  1. | 2017   China France         45   5200000     4997.5          1       1400 |
  2. | 2018   China France         45   5695000     5047.5          1       1605 |
  3. | 2017         France         86    400000       5675          1        670 |
  4. | 2018         France         86    390000       5575          1        660 |
     +---------------------------------------------------------------------------+

ADDED IN EDIT: My mistake, GDP refers to the destination country, so you need a new id (firm-destination country). You may need to cluster your standard errors at the investor_id level as investments made by the same firm are likely not independent.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float year str29 country float(investor_id invest_amount GDP GDP_capita EU_dummy)
2017 "China"  45 1300 1.00e+07 4320 0
2017 "France" 45  100   400000 5675 1
2017 "France" 86  670   400000 5675 1
2018 "China"  45 1500 1.10e+07 4520 0
2018 "France" 45  105   390000 5575 1
2018 "France" 86  660   390000 5575 1
end

egen newid= group(investor_id country), label
xtset newid year

Last edited by Andrew Musau; 12 Feb 2019, 16:31.

Comment

Anton Kraft

Join Date: Nov 2017

Posts: 10
#3

13 Feb 2019, 03:56

Hi Andrew,

thank you for your reply. Yes the country-level variables all refer to the destination countries. The idea is to see what country characteristics are likely to attract investments. So for example, I expect GDP (market size) of a country to have a positive influence on investments from Germany.

The approach you recommend in your edit is the first one of the possible approaches that I mentioned in my original post. As with the other two approaches I have read this one before in papers, however, I am looking for an explanation why I should do this and not one of the other approaches. Could you please explain why it is theoretically justified or appropriate to create a new variable on the investor-destination country level and use that for a fixed effects transformation?
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10213

13 Feb 2019, 04:55

When you xtset your data in Stata, the first variable refers to the panel identifier and the second variable refers to the time identifier.

Code:

xtset panelvar timevar

So in your first approach, a panel is identified by firm and destination country whereas the time variable is year. The only problem that arises is that the same firm can invest in more than one country in the same year. To capture this dependence, we need to cluster the standard errors not at the firm and destination country level but at the firm (investor_id) level. Fixed effects exploits the within variation in the data (for the same firm, how does investment amount vary over time). Your other suggestions do not correctly identify the time variable.

To elaborate further, for panel data, you need at most one observation for each year per id.

Code:

. input float year str29 country float(investor_id invest_amount GDP GDP_capita EU_dummy)

          year                        country  investo~d  invest_~t        GDP  GDP_cap~a   EU_dummy
  1.
. 2017 "China"  45 1300 1.00e+07 4320 0
  2.
. 2017 "France" 45  100   400000 5675 1
  3.
. 2017 "France" 86  670   400000 5675 1
  4.
. 2018 "China"  45 1500 1.10e+07 4520 0
  5.
. 2018 "France" 45  105   390000 5575 1
  6.
. 2018 "France" 86  660   390000 5575 1
  7.
. end


. egen ic_id = group(investor_id country)

.
. egen it_id = group(investor_id year)

.
. egen ct_id = group(country year)

. l ic_id it_id ct_id year, sep(6)

     +------------------------------+
     | ic_id   it_id   ct_id   year |
     |------------------------------|
  1. |     1       1       1   2017 |
  2. |     2       1       3   2017 |
  3. |     3       3       3   2017 |
  4. |     1       2       2   2018 |
  5. |     2       2       4   2018 |
  6. |     3       4       4   2018 |
     +------------------------------+

Only the first approach satisfies this.

Last edited by Andrew Musau; 13 Feb 2019, 05:09.

Announcement