Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Weighted average and percentiles of subpopulations

    Hey Statalists,

    Thank you for looking at my posts and providing such a great resource through contributions to this forum (it has helped a lot!). I am pretty new to stata and am having trouble calculating the weighted mean and percentiles for subpopulations in my data set.

    About my data:
    Census data from 1970 - 2015

    What I would like to do:
    Calculate the weighted mean, p10, p50, p90 of "wages1999" and using "newwt" as weights for each industry and year. So that in the end I will have e.g. the 10th percentile of wages1991 for industry X in year Y.


    Where I am having trouble:
    1. egen does not seem to allow for weights
    2. pctile does not seem to allow for by's
    3. collapse does not seem to allow for a variety of statistical functions (e.g. collapse (mean) wages1999 (p10) wages1999 (p50) wages1999 (p90) wages1999 (p99) wages1999 [aw = new
    > wt], by(year ind1990code))


    A random sample of 10 observation from my dataset with the relevant variables:
    Code:
    input int year float(ind1990code wages1999 newwt)
    1970  60     12031 194000
    1970 251     13847 261000
    1970 682      2497  36375
    1980 712 25118.775 204000
    1980 732 36616.727 204000
    1980 831  7217.775  81600
    1990 831     38976 274176
    2000 162     26900 125280
    2000 351     57000 361080
    2015 842     55200 118320
    end
    Thank you!

  • #2

    Code:
    clear
    
    input int year float(ind1990code wages1999 newwt)
    1970  60     12031 194000
    1970 251     13847 261000
    1970 682      2497  36375
    1980 712 25118.775 204000
    1980 732 36616.727 204000
    1980 831  7217.775  81600
    1990 831     38976 274176
    2000 162     26900 125280
    2000 351     57000 361080
    2015 842     55200 118320
    end
    
    collapse (mean)wages1999mean=wages1999 (p10)wages1999p10=wages1999 (p50)wages1999p50= wages1999 (p90)wages1999p90=wages1999 (p99)wages1999p99= wages1999 [aw = newwt], by(year ind1990code)
    list
    describe

    Produces the following output:

    Code:
    . clear
    
    . 
    . input int year float(ind1990code wages1999 newwt)
    
             year  ind1990~e  wages1999      newwt
      1. 1970  60     12031 194000
      2. 1970 251     13847 261000
      3. 1970 682      2497  36375
      4. 1980 712 25118.775 204000
      5. 1980 732 36616.727 204000
      6. 1980 831  7217.775  81600
      7. 1990 831     38976 274176
      8. 2000 162     26900 125280
      9. 2000 351     57000 361080
     10. 2015 842     55200 118320
     11. end
    
    . 
    . collapse (mean)wages1999mean=wages1999 (p10)wages1999p10=wages1999 (p50)wages1999p50= wages1999 (p90)wages1999p90=wages1999
    >  (p99)wages1999p99= wages1999 [aw = newwt], by(year ind1990code)
    
    . list
    
         +------------------------------------------------------------------------+
         | year   ind199~e   wages1~n   wages~10   wages~50   wages~90   wages~99 |
         |------------------------------------------------------------------------|
      1. | 1970         60      12031      12031      12031      12031      12031 |
      2. | 1970        251      13847      13847      13847      13847      13847 |
      3. | 1970        682       2497       2497       2497       2497       2497 |
      4. | 1980        712   25118.78   25118.78   25118.78   25118.78   25118.78 |
      5. | 1980        732   36616.73   36616.73   36616.73   36616.73   36616.73 |
         |------------------------------------------------------------------------|
      6. | 1980        831   7217.775   7217.775   7217.775   7217.775   7217.775 |
      7. | 1990        831      38976      38976      38976      38976      38976 |
      8. | 2000        162      26900      26900      26900      26900      26900 |
      9. | 2000        351      57000      57000      57000      57000      57000 |
     10. | 2015        842      55200      55200      55200      55200      55200 |
         +------------------------------------------------------------------------+
    
    . describe
    
    Contains data
      obs:            10                          
     vars:             7                          
     size:           260                          
    -----------------------------------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    -----------------------------------------------------------------------------------------------------------------------------
    year            int     %8.0g                 
    ind1990code     float   %9.0g                 
    wages1999mean   float   %9.0g                 (mean) wages1999
    wages1999p10    float   %9.0g                 (p 10) wages1999
    wages1999p50    float   %9.0g                 (p 50) wages1999
    wages1999p90    float   %9.0g                 (p 90) wages1999
    wages1999p99    float   %9.0g                 (p 99) wages1999
    -----------------------------------------------------------------------------------------------------------------------------
    Sorted by: year  ind1990code
         Note: Dataset has changed since last saved.
    
    . 
    end of do-file

    Comment


    • #3
      Thank you so much, that worked. I wish collapse didn't remove all the observations and would just create new variables. I am sure that can be done, but this suffices for my needs. Thanks again!

      Comment


      • #4
        -collapse- always overwrites the previous data, and there is no option to override that. But you could save your data in a -tempfile- before you -collapse- and then -append- the original data back afterwards if that is what you want.

        That said, if you are planning to do further analysis on this data, having the aggregated data mixed in with the original disaggregated data is likely to get you into trouble. That sort of melange is fine for a spreadsheet display of results intended for human eyes but not for analysis, and should generally be avoided when working in analysis-oriented programs like Stata.

        Comment


        • #5
          I think merge would be better in this case then append. The attempts 1 and 2 in the first message indicate that Anon was trying to supplement observations with the stats for the groups where the observations belong to. So something like this will do:


          Code:
          clear
          
          input int year float(ind1990code wages1999 newwt)
          1970  60     12031 194000
          1970 251     13847 261000
          1970 682      2497  36375
          1980 712 25118.775 204000
          1980 732 36616.727 204000
          1980 831  7217.775  81600
          1990 831     38976 274176
          2000 162     26900 125280
          2000 351     57000 361080
          2015 842     55200 118320
          end
          
          preserve
          
          collapse (mean)wages1999mean=wages1999 (p10)wages1999p10=wages1999 (p50)wages1999p50= wages1999 (p90)wages1999p90=wages1999 (p99)wages1999p99= wages1999 [aw = newwt], by(year ind1990code)
          tempfile tmp
          save `"`tmp'"'
          restore
          
          sort year ind1990code
          merge year ind1990code using `"`tmp'"'
          drop _merge
          
          list
          describe

          Comment


          • #6
            Yes, that would be better, and would also avoid problems for further analysis. It would also be consistent with his wording in #3 that he wanted it to create new variables (-append- would produce additional observations, not new variables.)

            Comment


            • #7
              You guys are great, thanks again!

              Comment

              Working...
              X