Using values from parent industry for extrapolation

satya otil

Join Date: Sep 2019
Posts: 76

Using values from parent industry for extrapolation

17 Jun 2021, 09:22

Dear all,

I am currently working with a country-industry-year dataset, and have the following problem. For some countries, there is missing data for some industries. My solution to this problem is to use data from their parent industries, for example if industries within manufacturing have missing data, I want to use data for the aggregate manufacturing industry. This approach works as typically the more aggregate parent industries have data. An example of the dataset is added to hopefully make my problem clearer:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str3 cou str7 ind int year str6(ind_par ind_par2 ind_par3) float x_share
"ABC" "D01T99" 1990 ""       ""       ""                1
"ABC" "D01T03" 1990 "D01T99" ""       ""        .05566159
"ABC" "D90T99" 1990 "D01T99" ""       ""        .05768238
"ABC" "D69T82" 1990 "D01T99" ""       ""        .07298723
"ABC" "D05T09" 1990 "D01T99" ""       ""        .01224463
"ABC" "D84T88" 1990 "D01T99" ""       ""        .21284407
"ABC" "D28"    1990 "D10T33" "D01T99" ""                .
"ABC" "D41T43" 1990 "D01T99" ""       ""        .07382749
"ABC" "D45T47" 1990 "D45T56" "D01T99" ""        .16468546
"ABC" "D23"    1990 "D19T23" "D10T33" "D01T99"  .00676179
"ABC" "D68"    1990 "D01T99" ""       ""       .009778853
"ABC" "D10T12" 1990 "D10T33" "D01T99" ""        .02407338
"ABC" "D16T18" 1990 "D10T33" "D01T99" ""       .017090254
"ABC" "D10T33" 1990 "D01T99" ""       ""         .1407735
"ABC" "D35T39" 1990 "D01T99" ""       ""       .014827263
"ABC" "D49T53" 1990 "D45T56" "D01T99" ""        .05526453
"ABC" "D31T33" 1990 "D10T33" "D01T99" ""       .009405102
"ABC" "D24T25" 1990 "D10T33" "D01T99" ""        .02544084
"ABC" "D19T23" 1990 "D10T33" "D01T99" ""       .019596357
"ABC" "D64T66" 1990 "D01T99" ""       ""        .04571036
"ABC" "D29T30" 1990 "D10T33" "D01T99" ""       .015155792
"ABC" "D58T63" 1990 "D01T99" ""       ""        .02584345
"ABC" "D26T27" 1990 "D10T33" "D01T99" ""                .
"ABC" "D45T56" 1990 "D01T99" ""       ""         .2778193
"ABC" "D55T56" 1990 "D45T56" "D01T99" ""        .05786932
"ABC" "D13T15" 1990 "D10T33" "D01T99" ""       .013555143

end

The variable ind_par denotes the parent industry of the respective industry, which is denoted by ind. So the parent industry of D01T03 is D01T99. Then ind_par2 goes one parent level higher if possible, etc. D01T99 is the highest level. As you can see, the industry "D28" has missing data, but parent industry D10T33 does have data. So what I would like to do is use the annual growth rate in the share of their parent industry to plug in values for the industries for the years where this data is missing, i.e. extrapolation. I unfortunately do not know how I can do this. If anyone has an idea on how I can do this, it would be much appreciated.

Best,

Satya

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30061
#2

17 Jun 2021, 12:43

Code:

// CREATE A REFERENCE FRAME TO SERVE AS A SOURCE FOR REPLACEMENT VALUES frame put ind year x_share if !missing(x_share), into(source) frame change source by ind year (x_share), sort: assert x_share[1] == x_share[_N] duplicates drop frame change default rename ind_par ind_par1 gen imputed_share = x_share forvalues i = 1/3 { frlink m:1 ind_par`i' year, frame(source ind year) replace imputed_share = frval(source, x_share) if missing(imputed_share) drop source }

This code will, in principle, climb all the way up the hierarchy through ind_par3 until it finds a value of x_share to substitute for the missing value. But in the example data, this process always ends at ind_par1--there is never any observation that requires going higher than that level.

The code requires version 16 or later because it uses frames.

The code requires that if there are multiple observations for a given ind and year, the value of x_share is the same in all of them (including the possibility of being missing in all of them). If that is not true, then it wouldn't be possible to know which value of x_share to use. Because this assumption is critical, it is verified in the code. If the code breaks at the -assert- command, that assumption is violated, and you should then chase down the violations in the data and fix them.

Finally, I'm not sure this kind of imputation of missing values is really appropriate to your data and analyses, but I'll leave it to others to suggest alternative approaches.
2 likes
Comment
satya otil

Join Date: Sep 2019

Posts: 76
#3

18 Jun 2021, 05:33

Dear Clyde,

Thank you (once again) for a helpful answer to my problem. I understand your concern for the approach of imputing missing values, it is noted. The code is clear as well, only issue is that since I have more than one country, there are multiple observations for a given ind and year, such that the value of x_share is not the same in all of them since different countries have different values. Alternatively, it may be the case that it is missing for some countries but not all of them. Does this code not work then anymore?

Best,

Satya
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30061

18 Jun 2021, 12:00

Well, if the combination of country, ind, and year leads to either uniquely identified observations or at least consistent values of x_share, then you just need to slightly modify the code:

Code:

//  CREATE A REFERENCE FRAME TO SERVE AS A SOURCE FOR REPLACEMENT VALUES
frame put cou ind year x_share if !missing(x_share), into(source)
frame change source
by cou ind year (x_share), sort: assert x_share[1] == x_share[_N]
duplicates drop

frame change default
rename ind_par ind_par1
gen imputed_share = x_share
forvalues i = 1/3 {
    frlink m:1 cou ind_par`i' year, frame(source cou ind year)
    replace imputed_share = frval(source, x_share) if missing(imputed_share)
    drop source
}

Comment

Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#5

18 Jun 2021, 15:44

The assumption on the consistency of x_share by industry-year-country as well as the concern on the imputation method (instructed by Professor Clyde in #2) are 2 critical issues that you need to pay attention to.

In ackowledging that, I just contribute another approach, which manually and specifically serves to "interprete" your imputation idea in #1. This code should work properly if (to emphasise again) the value of x_share is the same for each country-year-industry, even there are duplicated observations. The usage of only basic build-in commands allow the code to work in previous versions of Stata 16.

Code:

ren ind_par ind_par1 forval i = 1/3 { gen ind`i'Temp = cond(x_share ==., ind_par`i', ind) egen x_share`i'Temp = max(x_share), by(cou year ind`i'Temp) } egen Wanted = rowfirst(x_share*) drop *Temp
1 like
Comment
satya otil

Join Date: Sep 2019

Posts: 76
#6

20 Jun 2021, 07:02

Dear Clyde and Romalpa,

Thank you both for your help. The code works, and I will also keep your comments in mind.

Best,

Satya
Comment

Announcement

Using values from parent industry for extrapolation

Comment

Comment

Comment

Comment

Comment