Advice on changing data to split sample into size categories

Anton Kraft

Join Date: Nov 2017
Posts: 10

Advice on changing data to split sample into size categories

22 Jan 2019, 09:07

Hello everyone,

I have a panel data set with data on the investments of German firms into other countries. The data are mainly about the target investment enterprises, but it also has some data on the German investor company and sometimes also on the corporote group of the German investor in case the German investor is part of a corporate group. So it has the following structure:

year	country	investor	investment	corporate group of investor	equity of investment	sales of investor	sales of corporate group
2017	country1	inv_ID01	target_ID01	group_ID01	x	x	x
2017	country1	inv_ID01	target_ID02	group_ID01	x	x	x
2017	country2	inv_ID02	target_ID03	.	x	x	.
2017	country2	inv_ID03	target_ID04	group_ID02	x	x	x
2018	country1	inv_ID01	target_ID01	group_ID01	x	x	x
...	...	...	...	...	...	...	...

I want to collapse the data in a way that I only see the aggregated equity that a German investor has in a given year and country. I already know that I can do that by using

Code:

collapse (sum) equity, by(year country investor)

This gives me the following new table:

year	country	investor	equity of investment
2017	country1	inv_ID01	x
2017	country2	inv_ID02	x
2017	country2	inv_ID03	x
2018	country1	inv_ID01	x
...	...	...	...

But now comes the problem I can't solve. Rather than looking at the whole sample, I would like to split this above collapsed sample into three different size categories. The category criteria should be the sales of either the German investor or of the corporate group that the German investor belongs to in case that it does belong to a corporate group.

The idea behind it is the question whether the results of my main analysis change depending on the size of the investor. However, only looking at the investor sales will not give a true picture of the size, because sometimes these investors are only holdings with 0 sales, so it is necessary to look at the whole corporate group. I tried many approaches by now but couldn't figure out a solution.

Does anyone have an idea how to solve this problem?

Best regards,
Anton

Tags: None

David Benson

Join Date: Oct 2018

Posts: 489
#2

22 Jan 2019, 11:26

Well, you won't actually split your dataset, what you will do is have a second dataset with German firms (the parent companies, for example group_ID01) and their sales by year and then merge that (by firm_id year) into your collapsed dataset (your 2nd table above). And then run your regressions or do your other analysis.

As a side note, I would not collapse sales into size categories (i.e. small, medium, and large) in your analysis, because essentially you are throwing away information. I would use the actual sales or ln_sales (adjusted for inflation) in your actual regression analysis. In conveying the magnitude of your results you may want to pick a convenient category (i.e. "a firm with $1 billion in annual sales is 3x as likely to make equity investments as a firm with $500M in sales...") but I wouldn't recommend it for your raw analysis.

Last edited by David Benson; 22 Jan 2019, 11:32.
2 likes
Comment
Anton Kraft

Join Date: Nov 2017

Posts: 10
#3

23 Jan 2019, 02:01

Hi David, thank you for the quick reply. It's a good point that I should cut the categories into more meaningful criteria like you mentioned above.

I was also thinking that I will have to aggregate the sales first and merge it with my 2nd table afterwards. However, the problem is more how to get to that point. I cannot just collapse the sales data of e.g. the corporate group, because not every German investor also belongs to a corporate group. Like I mentioned, I need the sales data of the corporate group in case the German investor is in a corporate group and the sales data of only the German investor if it is a stand alone company. So I always have an "inv_ID", but not always a "group_ID" (I tried to show that by putting a dot for "missing" into the fourth row :D ).

If I have a group_ID, I would like to use group sales. If I only have an inv_ID I would like to use the investor sales. But this gets complicated because most of the time multiple investors share the same group_ID, because there is a German ultimate parent with multiple affiliates that have their seperate investments in one or multiple countries. I think I have to work with the creation of some additional variables and also tried a couple of approaches already, but none worked or I stoped because it got to complicated and confusing.
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#4

23 Jan 2019, 10:54

To add to David's helpful comment, you'll increase your chances of a useful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex. It would also help if the sample included the problems you're handling. The investors with 0 sales and groups issues are more substantive issues than Stata issues - the right solution will depend a lot on what you're trying to accomplish with the analysis and what procedures you intend to use.

You might consider creating a new variable that equals the group if a group exists and the inv_ID if the group doesn't exist. While you can collapse if that works best (i.e., the observation unit for the analysis is the collapsed data), often you would use egen with a bysort prefix to calculate totals or means that are then part of the full data set.
Comment

David Benson

Join Date: Oct 2018
Posts: 489

23 Jan 2019, 14:09

Hi Anton,

It sounds like there are two issues here (correct me if I am wrong):
1) How to aggregate sales to the correct level (group vs standalone company)
2) How to merge that in to your portfolio-level dataset.

From this table (copied from post #1), do you already have the correct sales numbers where there are x's? If so, it would be pretty straightforward to use Phil's approach in #4.

year	country	investor	investment	corporate group of investor	equity of investment	sales of investor	sales of corporate group
2017	country1	inv_ID01	target_ID01	group_ID01	x	x	x
2017	country1	inv_ID01	target_ID02	group_ID01	x	x	x
2017	country2	inv_ID02	target_ID03	.	x	x	.
2017	country2	inv_ID03	target_ID04	group_ID02	x	x	x
2018	country1	inv_ID01	target_ID01	group_ID01	x	x	x
...	...	...	...	...	...	...	...

Code:

* I added a couple of more obs to 2018 to make sure this would work
dataex year country investor_id investment investor_corp_group invest_amt sales_investor sales_corp_group
// Data shared via  -dataex-. To install: ssc install dataex
clear
input int year str7 country str8 investor_id str11 investment str10 investor_corp_group float invest_amt byte sales_investor int sales_corp_group
2017 "Germany" "inv_ID01" "target_ID01" "group_ID01"   5 60 150
2017 "Germany" "inv_ID01" "target_ID02" "group_ID01" 7.5 60 150
2017 "Austria" "inv_ID02" "target_ID03" ""             2 50   .
2017 "Austria" "inv_ID03" "target_ID04" "group_ID02"   3 85 130
2018 "Germany" "inv_ID01" "target_ID01" "group_ID01"   6 60 150
2018 "Austria" "inv_ID05" "target_ID03" ""           4.5 55   .
2018 "Austria" "inv_ID03" "target_ID04" "group_ID02"   2 90 170
end
------------------ copy up to and including the previous line ------------------

Code:

sort investor_id country investor_corp_group year
list, sepby( investor_id) noobs abbrev(16)

  +----------------------------------------------------------------------------------------------------------------+
  | year   country   investor_id    investment   investor_corp_~p   invest_amt   sales_investor   sales_corp_group |
  |----------------------------------------------------------------------------------------------------------------|
  | 2017   Germany      inv_ID01   target_ID02         group_ID01          7.5               60                150 |
  | 2017   Germany      inv_ID01   target_ID01         group_ID01            5               60                150 |
  | 2018   Germany      inv_ID01   target_ID01         group_ID01            6               60                150 |
  |----------------------------------------------------------------------------------------------------------------|
  | 2017   Austria      inv_ID02   target_ID03                               2               50                  . |
  |----------------------------------------------------------------------------------------------------------------|
  | 2017   Austria      inv_ID03   target_ID04         group_ID02            3               85                130 |
  | 2018   Austria      inv_ID03   target_ID04         group_ID02            2               90                170 |
  |----------------------------------------------------------------------------------------------------------------|
  | 2018   Austria      inv_ID05   target_ID03                             4.5               55                  . |
  +----------------------------------------------------------------------------------------------------------------+

* Creating sales_to_use and investor_to_use to make it easier to merge
gen has_parent = ( investor_corp_group !="")
gen sales_to_use = sales_corp_group if sales_corp_group!=.
replace sales_to_use = sales_investor if sales_corp_group==.
gen investor_to_use = investor_corp_group if has_parent==1
replace investor_to_use = investor_id if has_parent==0

* I didn't include country, target_id, or investment_amt in this list so it could fit on a single page
list year investor_id investor_corp_group sales_investor sales_corp_group has_parent sales_to_use investor_to_use, noobs sepby(investor_id) abbrev(12)

  +------------------------------------------------------------------------------------------------------------+
  | year   investor_id   investor_c~p   sales_inve~r   sales_corp~p   has_parent   sales_to_use   investor_t~e |
  |------------------------------------------------------------------------------------------------------------|
  | 2017      inv_ID01     group_ID01             60            150            1            150     group_ID01 |
  | 2017      inv_ID01     group_ID01             60            150            1            150     group_ID01 |
  | 2018      inv_ID01     group_ID01             60            150            1            150     group_ID01 |
  |------------------------------------------------------------------------------------------------------------|
  | 2017      inv_ID02                            50              .            0             50       inv_ID02 |
  |------------------------------------------------------------------------------------------------------------|
  | 2017      inv_ID03     group_ID02             85            130            1            130     group_ID02 |
  | 2018      inv_ID03     group_ID02             90            170            1            170     group_ID02 |
  |------------------------------------------------------------------------------------------------------------|
  | 2018      inv_ID05                            55              .            0             55       inv_ID05 |
  +------------------------------------------------------------------------------------------------------------+

Comment

Anton Kraft

Join Date: Nov 2017

Posts: 10
#6

27 Jan 2019, 13:05

Hello David,

sorry for the late reply, I didn't have time to continue on this part of the project earlier. Today I read through your last post and implemented your suggestion. It worked! I also struggled for a while with the subsequent merge of the new table to the data that I will later use for the analysis, because some sales data was counted multiple times due to the collapsing / aggregating, but I solved it.

Thank you very much for your advices and putting so much effort into helping me out.

Best regards,
Anton
1 like
Comment

Announcement

Advice on changing data to split sample into size categories

Comment

Comment

Comment

Comment

Comment