Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using values from parent industry for extrapolation

    Dear all,

    I am currently working with a country-industry-year dataset, and have the following problem. For some countries, there is missing data for some industries. My solution to this problem is to use data from their parent industries, for example if industries within manufacturing have missing data, I want to use data for the aggregate manufacturing industry. This approach works as typically the more aggregate parent industries have data. An example of the dataset is added to hopefully make my problem clearer:


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str3 cou str7 ind int year str6(ind_par ind_par2 ind_par3) float x_share
    "ABC" "D01T99" 1990 ""       ""       ""                1
    "ABC" "D01T03" 1990 "D01T99" ""       ""        .05566159
    "ABC" "D90T99" 1990 "D01T99" ""       ""        .05768238
    "ABC" "D69T82" 1990 "D01T99" ""       ""        .07298723
    "ABC" "D05T09" 1990 "D01T99" ""       ""        .01224463
    "ABC" "D84T88" 1990 "D01T99" ""       ""        .21284407
    "ABC" "D28"    1990 "D10T33" "D01T99" ""                .
    "ABC" "D41T43" 1990 "D01T99" ""       ""        .07382749
    "ABC" "D45T47" 1990 "D45T56" "D01T99" ""        .16468546
    "ABC" "D23"    1990 "D19T23" "D10T33" "D01T99"  .00676179
    "ABC" "D68"    1990 "D01T99" ""       ""       .009778853
    "ABC" "D10T12" 1990 "D10T33" "D01T99" ""        .02407338
    "ABC" "D16T18" 1990 "D10T33" "D01T99" ""       .017090254
    "ABC" "D10T33" 1990 "D01T99" ""       ""         .1407735
    "ABC" "D35T39" 1990 "D01T99" ""       ""       .014827263
    "ABC" "D49T53" 1990 "D45T56" "D01T99" ""        .05526453
    "ABC" "D31T33" 1990 "D10T33" "D01T99" ""       .009405102
    "ABC" "D24T25" 1990 "D10T33" "D01T99" ""        .02544084
    "ABC" "D19T23" 1990 "D10T33" "D01T99" ""       .019596357
    "ABC" "D64T66" 1990 "D01T99" ""       ""        .04571036
    "ABC" "D29T30" 1990 "D10T33" "D01T99" ""       .015155792
    "ABC" "D58T63" 1990 "D01T99" ""       ""        .02584345
    "ABC" "D26T27" 1990 "D10T33" "D01T99" ""                .
    "ABC" "D45T56" 1990 "D01T99" ""       ""         .2778193
    "ABC" "D55T56" 1990 "D45T56" "D01T99" ""        .05786932
    "ABC" "D13T15" 1990 "D10T33" "D01T99" ""       .013555143
    
    end
    The variable ind_par denotes the parent industry of the respective industry, which is denoted by ind. So the parent industry of D01T03 is D01T99. Then ind_par2 goes one parent level higher if possible, etc. D01T99 is the highest level. As you can see, the industry "D28" has missing data, but parent industry D10T33 does have data. So what I would like to do is use the annual growth rate in the share of their parent industry to plug in values for the industries for the years where this data is missing, i.e. extrapolation. I unfortunately do not know how I can do this. If anyone has an idea on how I can do this, it would be much appreciated.

    Best,

    Satya

  • #2
    Code:
    //  CREATE A REFERENCE FRAME TO SERVE AS A SOURCE FOR REPLACEMENT VALUES
    frame put ind year x_share if !missing(x_share), into(source)
    frame change source
    by ind year (x_share), sort: assert x_share[1] == x_share[_N]
    duplicates drop
    
    frame change default
    rename ind_par ind_par1
    gen imputed_share = x_share
    forvalues i = 1/3 {
        frlink m:1 ind_par`i' year, frame(source ind year)
        replace imputed_share = frval(source, x_share) if missing(imputed_share)
        drop source
    }
    This code will, in principle, climb all the way up the hierarchy through ind_par3 until it finds a value of x_share to substitute for the missing value. But in the example data, this process always ends at ind_par1--there is never any observation that requires going higher than that level.

    The code requires version 16 or later because it uses frames.

    The code requires that if there are multiple observations for a given ind and year, the value of x_share is the same in all of them (including the possibility of being missing in all of them). If that is not true, then it wouldn't be possible to know which value of x_share to use. Because this assumption is critical, it is verified in the code. If the code breaks at the -assert- command, that assumption is violated, and you should then chase down the violations in the data and fix them.

    Finally, I'm not sure this kind of imputation of missing values is really appropriate to your data and analyses, but I'll leave it to others to suggest alternative approaches.

    Comment


    • #3
      Dear Clyde,

      Thank you (once again) for a helpful answer to my problem. I understand your concern for the approach of imputing missing values, it is noted. The code is clear as well, only issue is that since I have more than one country, there are multiple observations for a given ind and year, such that the value of x_share is not the same in all of them since different countries have different values. Alternatively, it may be the case that it is missing for some countries but not all of them. Does this code not work then anymore?

      Best,

      Satya

      Comment


      • #4
        Well, if the combination of country, ind, and year leads to either uniquely identified observations or at least consistent values of x_share, then you just need to slightly modify the code:

        Code:
        //  CREATE A REFERENCE FRAME TO SERVE AS A SOURCE FOR REPLACEMENT VALUES
        frame put cou ind year x_share if !missing(x_share), into(source)
        frame change source
        by cou ind year (x_share), sort: assert x_share[1] == x_share[_N]
        duplicates drop
        
        frame change default
        rename ind_par ind_par1
        gen imputed_share = x_share
        forvalues i = 1/3 {
            frlink m:1 cou ind_par`i' year, frame(source cou ind year)
            replace imputed_share = frval(source, x_share) if missing(imputed_share)
            drop source
        }

        Comment


        • #5
          The assumption on the consistency of x_share by industry-year-country as well as the concern on the imputation method (instructed by Professor Clyde in #2) are 2 critical issues that you need to pay attention to.

          In ackowledging that, I just contribute another approach, which manually and specifically serves to "interprete" your imputation idea in #1. This code should work properly if (to emphasise again) the value of x_share is the same for each country-year-industry, even there are duplicated observations. The usage of only basic build-in commands allow the code to work in previous versions of Stata 16.
          Code:
          ren ind_par ind_par1
          
          forval i = 1/3 {
              gen ind`i'Temp = cond(x_share ==., ind_par`i', ind)
              egen x_share`i'Temp = max(x_share), by(cou year ind`i'Temp)
          }
          
          egen Wanted = rowfirst(x_share*)
          
          drop *Temp

          Comment


          • #6
            Dear Clyde and Romalpa,

            Thank you both for your help. The code works, and I will also keep your comments in mind.

            Best,

            Satya

            Comment

            Working...
            X