Exporting Mean/Median of Variable X by variables Y and Z

Aaradhya Gupta

Join Date: Jul 2021

Posts: 11
#1

Exporting Mean/Median of Variable X by variables Y and Z

03 Jul 2021, 09:12

I am using data from the IPUMS-CPS ASEC series. I am looking to calculate the median income for each age and for each education category. A sample of the data for only 16 year olds is as follows:

Code:

year age educ inc weight 1992 16 "HS" 1000 20 1992 16 "CG" 1200 23 1993 16 "CG" 1400 24 1994 16 "HS" 1500 19

In short, there is data from multiple years with different educational categories. What I want to do is to calculate median income for 16 year olds (and subsequently for all of the ages from 16-65) by age and education.

I have tried the following:

Code:

bysort age educ: egen inc_median_age_educ = median(inc)

I think this works well, but there are two issues:

1. I cannot include the weights variable in this generation, which makes this wrong. In this FAQ, I am not sure I can include two level variables: https://www.stata.com/support/faqs/d...ry-statistics/. Kindly correct me if I'm wrong.

2. Would there be a way to export this to an external txt/xls file rather than adding an additional column? For now I am using the following:

Code:

#delimit ; tabulate age [aw = weight], summ(inc_median_age_educ) means; /*or*/ /*doesn't work, creates N and mean for age and education separately*/ tabout age educ [aw = weight] using try.xls, replace format(8) ptotal(none) lines (none) sum cells(N wage_median_age_educ mean wage_median_age_educ);

Would there be a way to find this mean/median using tabout since it allows to export without creation of additional columns?

I am using Stata/SE 16.1 on Windows 10.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

03 Jul 2021, 10:21

I think you may want to look at the collapse command. A starting point would be

Code:

collapse (median) median_inc_age_educ=inc [aw = weight], by(age educ)

which replaces the dataset in memory with one observation for each combination of age and educ. You can then use the export command to create a text file or excel worksheet containing this data.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30120
#3

03 Jul 2021, 10:30

You don't say what kind of weights the weight variable is. If it is an fweight or aweight, you can do this:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input int year byte age str4 educ int inc byte weight 1992 16 `""HS""' 1000 20 1992 16 `""CG""' 1200 23 1993 16 `""CG""' 1400 24 1994 16 `""HS""' 1500 19 end capture program drop one_group program define one_group summ inc [aweight = weight], detail // OR fweight ALSO ALLOWED gen weighted_median_inc = `r(p50)' exit end runby one_group, by(age educ)

-runby- is written by Robert Picard and me, and is available from SSC.

If these are iweights or pweights, then I think you would instead have to use -qreg- to estimate the weighted median. But that won't work in your example data because there aren't enough observations for -qreg- to run.

As for exporting the results somewhere else, after you have created the weighted median variable you can do:

Code:

collapse (first) weighted_median_inc, by(age educ)

and then you can export that in the usual ways (-export delimited-, -export excel-, whatever.)

In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Added: Crossed with #2. William Lisowski's solution is better.
Comment

Aaradhya Gupta

Join Date: Jul 2021
Posts: 11

03 Jul 2021, 12:11

Thanks William and Clyde for your replies.
I have two follow-up questions, please look into them if you can.

This is a sample dataset generated using -dataex- and -randomtag-, which I hadn't used previously, sorry about that! If you require more observations, please let me know.
I think the weights are aweights.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int year byte age int educ_new float incwage_defl double asecwt
1971 52 4        0 1266.69
1971 52 1    47234 1419.43
1980 32 4 41101.16  245.14
1980 31 3  26117.1  720.98
1982 44 2  28961.4  365.43
1987 23 2 18337.28 1774.19
1997 50 2        0 3563.42
1998 33 3        0    2435
1999 37 2 26306.28   331.7
2002 39 4    65870 2512.95
2002 41 2    10351  806.42
2008 47 4        0 1707.23
2009 64 1        0 1195.99
2011 19 1        0 1191.23
2012 58 2    66690 1219.54
end
label values age age_lbl
label def age_lbl 19 "19", modify
label def age_lbl 23 "23", modify
label def age_lbl 31 "31", modify
label def age_lbl 32 "32", modify
label def age_lbl 33 "33", modify
label def age_lbl 37 "37", modify
label def age_lbl 39 "39", modify
label def age_lbl 41 "41", modify
label def age_lbl 44 "44", modify
label def age_lbl 47 "47", modify
label def age_lbl 50 "50", modify
label def age_lbl 52 "52", modify
label def age_lbl 58 "58", modify
label def age_lbl 64 "64", modify
label values educ_new neweduc
label def neweduc 1 "Less than high school", modify
label def neweduc 2 "High school", modify
label def neweduc 3 "Some college", modify
label def neweduc 4 "College degree and above", modify

1. I looked into the collapse command, however I would preferably not like to change the dataset itself, but rather just generate tables if possible. -tabulate- and -tabstat- were the options in my knowledge, so if you have any suggestions of what I can look into that somehow keep the dataset intact, please let me know!

2. Since median can not be used with gen, should I be using something like this (egen)?

Code:

capture program drop one_group
program define one_group
    summ incwage_defl [aweight = asecwt], detail // OR fweight ALSO ALLOWED
    egen weighted_median_inc = `median()'
    exit
end

runby one_group, by(age educ)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30120
#5

03 Jul 2021, 12:48

2. Since median can not be used with gen, should I be using something like this (egen)?

No. The code I showed in #3 correctly generates a weighted median variable. `r(p50)' is the way you access the median after -summ, detail-, and after most commands that return a median. The code you propose in #4 will not work: it will just give you a syntax error.

To create the tables you want without modifying the data set (much) you can do this:

Code:

egen age_educ = group(age educ_new), label tabstat incwage_defl [aweight = asecwt], by(age_educ) statistics(p50)

provided your weight is an aweight or fweight.

If the new variable age_educ really irritates you, you can always -drop- it after the tabulation.
1 like
Comment
Aaradhya Gupta

Join Date: Jul 2021

Posts: 11
#6

04 Jul 2021, 01:06

Got it Clyde, thank you!
Thanks for explaining the median syntax, and -tabstat- does the trick.
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 169
#7

25 Mar 2023, 01:29

Dear Clyde Schechter

I want to create a remoteness index between the two countries. That is:

The first term is the simple average of (the natural logarithms of) the mean distance of country i from all of its trading partners except j and the second term is the mean distance of country j from all of its trading partners except i. The data in my case looks like:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str3(iso_i iso_j) double distance "ABW" "AFG" 13257.814453125 "ABW" "AGO" 9516.9130859375 "ABW" "AIA" 983.2682495117188 "ABW" "ALB" 9091.7421875 "AFG" "ABW" 13257.814453125 "AFG" "AGO" 7607.7998046875 "AFG" "AIA" 12292.158203125 "AFG" "ALB" 4342.11669921875 "AGO" "ABW" 9516.9130859375 "AGO" "AFG" 7607.7998046875 "AGO" "AIA" 8901.4638671875 "AGO" "ALB" 5623.82666015625 end

The distance measures are symmetric; the distance between ABW-AFG is the same as the distance between AFG-ABW and so on.

Thanks,
(Ridwan)
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30120

25 Mar 2023, 10:14

This question has nothing to do with the topic of this thread. While it is easy to think of Forum threads as dialogs between a questioner and a responder, they are also read, and searched for, by other Forum members and visitors to the site. Those people will generally choose what to read based on the thread title. By including extraneous material, you make those who came looking for the topic in the thread title waste their time, and you prevent those who are searching for the off-topic material from finding it. In the future, please always start a new thread when you change topic.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str3(iso_i iso_j) double distance
"ABW" "AFG"   13257.814453125
"ABW" "AGO"   9516.9130859375
"ABW" "AIA" 983.2682495117188
"ABW" "ALB"      9091.7421875
"AFG" "ABW"   13257.814453125
"AFG" "AGO"   7607.7998046875
"AFG" "AIA"   12292.158203125
"AFG" "ALB"  4342.11669921875
"AGO" "ABW"   9516.9130859375
"AGO" "AFG"   7607.7998046875
"AGO" "AIA"   8901.4638671875
"AGO" "ALB"  5623.82666015625
end

//  VERIFY NECESSARY ASSUMPTIONS
//  ASSUMPTION 1: ISO_I AND ISO_J ARE ALWAYS DIFFERENT
assert iso_i != iso_j

//  ASSUMPTION 2: EVERY ISO_I ISO_J PAIR HAS SAME DISTANCE AS ISO_J ISO_i
gen pair = cond(iso_i < iso_j, iso_i+iso_j, iso_j+iso_i)
by pair (distance), sort: assert distance[1] == distance[_N]

tempfile copy
save `copy'

//  FOR PAIRS ISO_I ISO_J THAT ONLY OCCUR IN ONE ORDER, CREATE
//  A REVERSED ORDER PAIR
by pair: gen expander = cond(_N == 1, 2, 1)
expand expander
by pair, sort: gen temp = iso_i if expander == 2 & _n == 2
by pair: replace iso_i = iso_j if expander == 2 & _n == 2
by pair: replace iso_j = temp if expander == 2 & _n == 2

drop expander temp
isid iso_i iso_j

rangestat (mean) t1 = distance, by(iso_i) excludeself interval(distance . .)
rangestat (mean) t2 = distance, by(iso_j) excludeself interval(distance . .)
gen wanted = 0.5*(log(t1) + log(t2))

merge 1:1 iso_i iso_j using `copy', keep(match using) nogenerate

There are key assumptions required for this code to work properly, and they are verified at the top of the code.
Note: -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It is available from SSC.

Added: The formula you cite in #7 is incorrect. In both terms, N-1 should be (N-1).

Last edited by Clyde Schechter; 25 Mar 2023, 10:17.

Comment

Ridwan Sheikh

Join Date: Apr 2021

Posts: 169
#9

25 Mar 2023, 13:00

Thank you Clyde Schechter . I am sorry for posting this query here. I shall be very careful in future.

Added: I am sorry for missing it before, but in the original data, one of the observations is country's distance from itself. (it is non-negative number) That is; there exist a pairs where iso_i==iso_j (ABW-ABW; AFG-AFG etc.). Does it have any difference in creating the remoteness index? What do you mean by iso_i and iso_j are always different. (assumption 1) ? I am not clearly getting it

Thanks,
(Ridwan)

Last edited by Ridwan Sheikh; 25 Mar 2023, 13:15.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30120

#10

25 Mar 2023, 17:33

What I meant when I said iso_i and iso_j must be different is precisely what you now say is not true of your data. You have said that you want to compute remoteness of a country from itself with the formula, which my code in #2 would not permit.

Here is modified code that will allow iso_i == iso_j:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str3(iso_i iso_j) double distance
"ABW" "AFG"   13257.814453125
"ABW" "AGO"   9516.9130859375
"ABW" "AIA" 983.2682495117188
"ABW" "ALB"      9091.7421875
"AFG" "ABW"   13257.814453125
"AFG" "AGO"   7607.7998046875
"AFG" "AIA"   12292.158203125
"AFG" "ALB"  4342.11669921875
"AGO" "ABW"   9516.9130859375
"AGO" "AFG"   7607.7998046875
"AGO" "AIA"   8901.4638671875
"AGO" "ALB"  5623.82666015625
"AGO" "AGO"  1234.567890
end

//  VERIFY NECESSARY ASSUMPTIONS

//  ASSUMPTION: EVERY ISO_I ISO_J PAIR HAS SAME DISTANCE AS ISO_J ISO_i
gen pair = cond(iso_i < iso_j, iso_i+iso_j, iso_j+iso_i)
by pair (distance), sort: assert distance[1] == distance[_N]

// ASSUMPTION: NO PAIR APPEARS TWICE IN THE DATA IN THE SAME ORDER
isid iso_i iso_j

tempfile copy
save `copy'

//  FOR PAIRS ISO_I ISO_J THAT ONLY OCCUR IN ONE ORDER, CREATE
//  A REVERSED ORDER PAIR
by pair: gen expander = cond(_N == 1, 2, 1)
expand expander
by pair, sort: gen temp = iso_i if expander == 2 & _n == 2
by pair: replace iso_i = iso_j if expander == 2 & _n == 2
by pair: replace iso_j = temp if expander == 2 & _n == 2

drop expander temp
encode iso_i, gen(n_iso_i)
encode iso_j, gen(n_iso_j)

rangestat (count) N1 = n_iso_j (mean) t1 = distance, by(iso_i) excludeself interval(distance . .)
rangestat (count) N2 = n_iso_i (mean) t2 = distance, by(iso_j) excludeself interval(distance . .)
replace t1 = (N1*t1 - distance)/(N1-1) if iso_i == iso_j
replace t2 = (N2*t2 - distance)/(N2-1) if iso_i == iso_j
gen wanted = 0.5*(log(t1) + log(t2))
by pair, sort: keep if _n == 1
drop N1 N2 t1 t2 n_iso_*

merge 1:m pair using `copy', keep(match using) nogenerate

Comment

Ridwan Sheikh

Join Date: Apr 2021

Posts: 169
#11

27 Mar 2023, 04:07

Thanks Clyde Schechter !
When I run

Code:

gen pair = cond(iso_i < iso_j, iso_i+iso_j, iso_j+iso_i) by pair (distance), sort: assert distance[1] == distance[_N]

It throws an error message: (23,526 contradictions in 25,200 by-groups) assertion is false.
The symmetric distance is a more plausible assumption. In most pairs, every (iso_i, iso_j) pair has the same distance as (iso_j, iso_i). But possibly because of missing values and rounding errors (2 places after decimals), we get an error message that assertion is false. I have briefly check the data (some observations) and it seems that dist_i,j = dist_j,i (symmetric distance) in most of the cases.

Please get back to me, how should we proceed in this case

I am attaching my original data-file here. If you can have a look at it, with these codes, that would be of great help.

Thanks and regards,
(Ridwan)

Attached Files

dist_cepii.dta (689.8 KB, 1 view)

Last edited by Ridwan Sheikh; 27 Mar 2023, 04:42.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30120

#12

27 Mar 2023, 11:01

Well, on your assurance that the differences between the distances are due to missing values and small rounding errors, if this were my project I would verify that these are the only problems, and then use the average of the distances reported for the same pair. To verify that these are the only problems with distance asymmetry, I would set a tolerance level for the distance and -assert- that the difference between them is at most that much. In the code below, I have set the tolerance at 0.05, which means that the two values would round to the same value to 1 decimal place. If you want assurance that they would round to the same value to 2 decimal places, you would use 0.005 instead.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str3(iso_i iso_j) double distance
"ABW" "AFG"   13257.814453125
"ABW" "AGO"   9516.9130859375
"ABW" "AIA" 983.2682495117188
"ABW" "ALB"      9091.7421875
"AFG" "ABW"   13257.814453125
"AFG" "AGO"   7607.7998046875
"AFG" "AIA"   12292.158203125
"AFG" "ALB"  4342.11669921875
"AGO" "ABW"   9516.9130859375
"AGO" "AFG"   7607.7998046875
"AGO" "AIA"   8901.4638671875
"AGO" "ALB"  5623.82666015625
"AGO" "AGO"  1234.567890
end

//  VERIFY NECESSARY ASSUMPTIONS

//  ASSUMPTION: EVERY ISO_I ISO_J PAIR HAS SAME DISTANCE AS ISO_J ISO_i
gen pair = cond(iso_i < iso_j, iso_i+iso_j, iso_j+iso_i)
by pair (distance), sort: egen max_distance = max(distance)
by pair (distance): egen min_distance = min(distance)
assert max_distance - min_distasnce <= 0.05
by pair (distance): egen mean_distance = mean(distance)

// ASSUMPTION: NO PAIR APPEARS TWICE IN THE DATA IN THE SAME ORDER
isid iso_i iso_j

tempfile copy
save `copy'

//  FOR PAIRS ISO_I ISO_J THAT ONLY OCCUR IN ONE ORDER, CREATE
//  A REVERSED ORDER PAIR
by pair: gen expander = cond(_N == 1, 2, 1)
expand expander
by pair, sort: gen temp = iso_i if expander == 2 & _n == 2
by pair: replace iso_i = iso_j if expander == 2 & _n == 2
by pair: replace iso_j = temp if expander == 2 & _n == 2

drop expander temp
encode iso_i, gen(n_iso_i)
encode iso_j, gen(n_iso_j)

rangestat (count) N1 = n_iso_j (mean) t1 = mean_distance, by(iso_i) excludeself interval(distance . .)
rangestat (count) N2 = n_iso_i (mean) t2 = mean_distance, by(iso_j) excludeself interval(distance . .)
replace t1 = (N1*t1 - mean_distance)/(N1-1) if iso_i == iso_j
replace t2 = (N2*t2 - mean_distance)/(N2-1) if iso_i == iso_j
gen wanted = 0.5*(log(t1) + log(t2))
by pair, sort: keep if _n == 1
drop N1 N2 t1 t2 n_iso_*

merge 1:m pair using `copy', keep(match using) nogenerate

By the way, I never download attachments from people I do not know, so I have not looked at your full data set. I appreciate your willingness to make the full data set available. But I think working off the same -dataex- example as before does the trick here. If this code proves unsuitable for your data, please post back with another -dataex- example that exhibits whatever problem arises.

Comment

Ridwan Sheikh

Join Date: Apr 2021

Posts: 169
#13

28 Mar 2023, 04:31

Thank you very much Clyde Schechter .

I tried running your earlier codes in #10 using another measure of distance, and the assertion we set about symmetric distance is valid now. But I am facing a different issue now.
1) When I run the following codes, they did not made any real changes to the data.

Code:

by pair: replace iso_i = iso_j if expander == 2 & _n == 2 (0 real changes made) by pair: replace iso_j = temp if expander == 2 & _n == 2 (0 real changes made)

2) The following code

Code:

rangestat (count) N1 = n_iso_j (mean) t1 = distcap, by(iso_i) excludeself interval(distcap. .)

displayed an error message, factor variables and time-series operators not allowed.

When I run these code in the following dataex example

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str3(iso_i iso_j) double distance "ABW" "ABW" 5.225314617156982 "ABW" "AFG" 13257.814453125 "ABW" "AGO" 9516.9130859375 "ABW" "AIA" 983.2682495117188 "AFG" "ABW" 13257.814453125 "AFG" "AFG" 303.7613525390625 "AFG" "AGO" 7607.7998046875 "AFG" "AIA" 12292.158203125 "AGO" "ABW" 9516.9130859375 "AGO" "AFG" 7607.7998046875 "AGO" "AGO" 419.9666442871094 "AGO" "AIA" 8901.4638671875 "AIA" "ABW" 983.2682495117188 "AIA" "AFG" 12292.158203125 "AIA" "AGO" 8901.4638671875 "AIA" "AIA" 3.7986903190612793 end

everything works fine. All the variables get calculated as coded for. My original data (which is also attached in #11) is exact replica of the dataex example as in above. I am clueless if things workout in dataex why not in the original file. That is why, i thought if you replicates these codes (as in #10) it would be of great help. Sorry for asking something which is not your style of work.

Regards,
(Ridwan)
Comment

Announcement