Weighted average and percentiles of subpopulations

Anon Guenter

Join Date: Jan 2017

Posts: 3
#1

Weighted average and percentiles of subpopulations

23 Jan 2017, 06:37

Hey Statalists,

Thank you for looking at my posts and providing such a great resource through contributions to this forum (it has helped a lot!). I am pretty new to stata and am having trouble calculating the weighted mean and percentiles for subpopulations in my data set.

About my data:
Census data from 1970 - 2015

What I would like to do:
Calculate the weighted mean, p10, p50, p90 of "wages1999" and using "newwt" as weights for each industry and year. So that in the end I will have e.g. the 10th percentile of wages1991 for industry X in year Y.

Where I am having trouble:
1. egen does not seem to allow for weights
2. pctile does not seem to allow for by's
3. collapse does not seem to allow for a variety of statistical functions (e.g. collapse (mean) wages1999 (p10) wages1999 (p50) wages1999 (p90) wages1999 (p99) wages1999 [aw = new
> wt], by(year ind1990code))

A random sample of 10 observation from my dataset with the relevant variables:

Code:

input int year float(ind1990code wages1999 newwt) 1970 60 12031 194000 1970 251 13847 261000 1970 682 2497 36375 1980 712 25118.775 204000 1980 732 36616.727 204000 1980 831 7217.775 81600 1990 831 38976 274176 2000 162 26900 125280 2000 351 57000 361080 2015 842 55200 118320 end

Thank you!
Tags: None

Sergiy Radyakin

Join Date: Apr 2014
Posts: 1867

23 Jan 2017, 06:58

Code:

clear

input int year float(ind1990code wages1999 newwt)
1970  60     12031 194000
1970 251     13847 261000
1970 682      2497  36375
1980 712 25118.775 204000
1980 732 36616.727 204000
1980 831  7217.775  81600
1990 831     38976 274176
2000 162     26900 125280
2000 351     57000 361080
2015 842     55200 118320
end

collapse (mean)wages1999mean=wages1999 (p10)wages1999p10=wages1999 (p50)wages1999p50= wages1999 (p90)wages1999p90=wages1999 (p99)wages1999p99= wages1999 [aw = newwt], by(year ind1990code)
list
describe

Produces the following output:

Code:

. clear

. 
. input int year float(ind1990code wages1999 newwt)

         year  ind1990~e  wages1999      newwt
  1. 1970  60     12031 194000
  2. 1970 251     13847 261000
  3. 1970 682      2497  36375
  4. 1980 712 25118.775 204000
  5. 1980 732 36616.727 204000
  6. 1980 831  7217.775  81600
  7. 1990 831     38976 274176
  8. 2000 162     26900 125280
  9. 2000 351     57000 361080
 10. 2015 842     55200 118320
 11. end

. 
. collapse (mean)wages1999mean=wages1999 (p10)wages1999p10=wages1999 (p50)wages1999p50= wages1999 (p90)wages1999p90=wages1999
>  (p99)wages1999p99= wages1999 [aw = newwt], by(year ind1990code)

. list

     +------------------------------------------------------------------------+
     | year   ind199~e   wages1~n   wages~10   wages~50   wages~90   wages~99 |
     |------------------------------------------------------------------------|
  1. | 1970         60      12031      12031      12031      12031      12031 |
  2. | 1970        251      13847      13847      13847      13847      13847 |
  3. | 1970        682       2497       2497       2497       2497       2497 |
  4. | 1980        712   25118.78   25118.78   25118.78   25118.78   25118.78 |
  5. | 1980        732   36616.73   36616.73   36616.73   36616.73   36616.73 |
     |------------------------------------------------------------------------|
  6. | 1980        831   7217.775   7217.775   7217.775   7217.775   7217.775 |
  7. | 1990        831      38976      38976      38976      38976      38976 |
  8. | 2000        162      26900      26900      26900      26900      26900 |
  9. | 2000        351      57000      57000      57000      57000      57000 |
 10. | 2015        842      55200      55200      55200      55200      55200 |
     +------------------------------------------------------------------------+

. describe

Contains data
  obs:            10                          
 vars:             7                          
 size:           260                          
-----------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-----------------------------------------------------------------------------------------------------------------------------
year            int     %8.0g                 
ind1990code     float   %9.0g                 
wages1999mean   float   %9.0g                 (mean) wages1999
wages1999p10    float   %9.0g                 (p 10) wages1999
wages1999p50    float   %9.0g                 (p 50) wages1999
wages1999p90    float   %9.0g                 (p 90) wages1999
wages1999p99    float   %9.0g                 (p 99) wages1999
-----------------------------------------------------------------------------------------------------------------------------
Sorted by: year  ind1990code
     Note: Dataset has changed since last saved.

. 
end of do-file

Comment

Anon Guenter

Join Date: Jan 2017

Posts: 3
#3

23 Jan 2017, 07:44

Thank you so much, that worked. I wish collapse didn't remove all the observations and would just create new variables. I am sure that can be done, but this suffices for my needs. Thanks again!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#4

23 Jan 2017, 09:26

-collapse- always overwrites the previous data, and there is no option to override that. But you could save your data in a -tempfile- before you -collapse- and then -append- the original data back afterwards if that is what you want.

That said, if you are planning to do further analysis on this data, having the aggregated data mixed in with the original disaggregated data is likely to get you into trouble. That sort of melange is fine for a spreadsheet display of results intended for human eyes but not for analysis, and should generally be avoided when working in analysis-oriented programs like Stata.
Comment

Sergiy Radyakin

Join Date: Apr 2014
Posts: 1867

23 Jan 2017, 09:36

I think merge would be better in this case then append. The attempts 1 and 2 in the first message indicate that Anon was trying to supplement observations with the stats for the groups where the observations belong to. So something like this will do:

Code:

clear

input int year float(ind1990code wages1999 newwt)
1970  60     12031 194000
1970 251     13847 261000
1970 682      2497  36375
1980 712 25118.775 204000
1980 732 36616.727 204000
1980 831  7217.775  81600
1990 831     38976 274176
2000 162     26900 125280
2000 351     57000 361080
2015 842     55200 118320
end

preserve

collapse (mean)wages1999mean=wages1999 (p10)wages1999p10=wages1999 (p50)wages1999p50= wages1999 (p90)wages1999p90=wages1999 (p99)wages1999p99= wages1999 [aw = newwt], by(year ind1990code)
tempfile tmp
save `"`tmp'"'
restore

sort year ind1990code
merge year ind1990code using `"`tmp'"'
drop _merge

list
describe

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#6

23 Jan 2017, 09:55

Yes, that would be better, and would also avoid problems for further analysis. It would also be consistent with his wording in #3 that he wanted it to create new variables (-append- would produce additional observations, not new variables.)
Comment
Anon Guenter

Join Date: Jan 2017

Posts: 3
#7

23 Jan 2017, 19:43

You guys are great, thanks again!
Comment

Announcement

Weighted average and percentiles of subpopulations

Comment

Comment

Comment

Comment

Comment

Comment