Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Expanding an unbalanced dataset based on regions

    Dear Statalist Users,

    Below is an example of an unbalanced panel dataset (please see Input example) where I observe countries at region (variable name is gadm) and year level (variable name is year). My goal is to come up with a balanced panel dataset as seen in the Output example. In my specific example, I have 5 distinct regions. However, each region does not necessarily have an observation in each year between 2000 and 2005.

    Thanks a lot.

    Best Regards,
    Cansu

    INPUT:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int year str8 country str9 gadm int grandtotal
    2000 "country1" "region1_1" 100
    2000 "country1" "region1_2" 120
    2000 "country1" "region1_3" 198
    2005 "country1" "region1_1" 200
    2005 "country1" "region1_2" 197
    2002 "country2" "region2_1" 209
    2002 "country2" "region2_2" 201
    end

    OUTPUT:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int year str8 country str9 gadm int grandtotal
    2000 "country1" "region1_1" 100
    2001 "country1" "region1_1"   0
    2002 "country1" "region1_1"   0
    2003 "country1" "region1_1"   0
    2004 "country1" "region1_1"   0
    2005 "country1" "region1_1" 200
    2000 "country1" "region1_2" 120
    2001 "country1" "region1_2"   0
    2002 "country1" "region1_2"   0
    2003 "country1" "region1_2"   0
    2004 "country1" "region1_2"   0
    2005 "country1" "region1_2" 197
    2000 "country1" "region1_3" 198
    2001 "country1" "region1_3"   0
    2002 "country1" "region1_3"   0
    2003 "country1" "region1_3"   0
    2004 "country1" "region1_3"   0
    2005 "country1" "region1_3"   0
    2000 "country2" "region2_1"   0
    2001 "country2" "region2_1"   0
    2002 "country2" "region2_1" 209
    2003 "country2" "region2_1"   0
    2004 "country2" "region2_1"   0
    2005 "country2" "region2_1"   0
    2000 "country2" "region2_2"   0
    2001 "country2" "region2_2"   0
    2002 "country2" "region2_2" 201
    2003 "country2" "region2_2"   0
    2004 "country2" "region2_2"   0
    2005 "country2" "region2_2"   0
    end

  • #2
    Code:
    egen pair = group(country gadm)
    xtset pair year
    tsfill, full
    
    foreach v of varlist country gadm {
        by pair (`v'), sort: replace `v' = `v'[_N]
    }
    sort pair year
    replace grandtotal = 0 if missing(grandtotal)
    will do this.

    But two questions.
    1. Why do you want to do this? There are few commands that require a balanced panel in Stata. Are you sure you need to do this?
    2. Are you sure it is appropriate to fill in grandtotal = 0 for the unobserved values? If the observations for those countries, regions, and years were missing in the data set because the data collection collected all and only those data where grandtotal exceeds 0, then this is fine. But if they are missing because they were, for extraneous reasons, unobserved, then the real values could be anything, and setting them to zero is a serious distortion of the data.

    Comment

    Working...
    X