calculating mean of subgroup of observations.

Emiel Stuckens

Join Date: Apr 2018

Posts: 10
#1

calculating mean of subgroup of observations.

23 Apr 2018, 06:39

Hello everyone,

I came quite far on stata without help but I just cant figure this one out.

I am trying to calculate the mean of the wage growth for people. However, because I have data on multiple years, I am only interested in the wage growth before people go into self-employment, which all of them do. The variable to indicate this is pre_se (this is a 1 when they enter, otherwise its missing). I am also not interested in the first year cause the wage growth obviously will be 0 then. variables:
first_wage: subjects wage during the first wave that they take part in.
wave: indicates the wave number. the survey (bhps) conducted one questionnaire (wave) per year. A person does not have to take part from wave 1.
paynu_dv: gives net wage that the subjects earns per year.
pid: personal identifier: identifies the subject (stays the same over the waves)

My view on it, but that I cannot manage to put into stata:

bys pidp: gen delta_first_wage = (ln(paynu_dv) - ln(first_wage))/(ln(first_wage))

mean delta_first_wage if wave != min(wave) & wave < wave(pre_se)

Could you adapt it so I can implement it?

thanks in advance!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#2

23 Apr 2018, 11:25

No answers here, but some questions.

1. Is your -bys pidp: gen delta_first_wage = (ln(paynu_dv) - ln(first_wage))/(ln(first_wage))- code doing what you want? It doesn't really make sense to me, but as I don't understand what you are trying to calculate, I can't really make a suggestion.

2. What is the variable pre_se? "The variable to indicate this is pre_se (this is a 1 when they enter, otherwise its missing)." also does not make sense to me. It is a 1 when they enter what? The name pre_se suggests that it would be 1 in any wave before self-employment, but apparently that is not what you have done. Also, in general, 1/. variables are difficult to work with in Stata. 1/0 coding is much better and I suggest you change it to that. (If you don't, I almost certainly will in coding a solution to the problem.)

3. Are you looking for a single mean value from the entire data set (restricted to the observations of interest)? Or do you want a mean for each person?

4. If you want help with code, it is always best to show example data. Please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment

Emiel Stuckens

Join Date: Apr 2018
Posts: 10

24 Apr 2018, 02:10

Hi Clyde,

Thank you for taking the time to help me!

1. bys pidp: gen delta_first_wage = (ln(paynu_dv) - ln(first_wage))/(ln(first_wage)) calculates the growth percentage of every year in comparison to the first wage they earned. I got this code from a friend but it should be adapted so it calculates the growth percentage of every year in comparison to the year (wave) before...

2. pre_se only is 1 for the year before they enter self-employment (se). I will indeed adapt the . to a 0 if it enhances the programming.

3. I would like new variable that calculates a mean for every person since I could come in handy for further research. However, if it is significantly easier to just calculate a mean overall, that will also do.

4.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long pid float(wave first_wage pre_se paynu_dv)
10243011  1    536.25 1    536.25
10243011  2    536.25 0 1301.0007
10243011  3    536.25 0 1301.0007
10243011  4    536.25 0         .
10273956  1 1184.9115 1 1184.9115
10273956  4 1184.9115 0 1151.8861
10273956  5 1184.9115 0 1551.1932
10273956  6 1184.9115 0 1651.2703
10273956  7 1184.9115 0 1801.3857
10273956  8 1184.9115 0 1901.4626
10273956  9 1184.9115 0  1899.461
10300082  1  850.6544 0  850.6544
10300082  4  850.6544 0  700.5389
10300082  5  850.6544 0  797.6136
10300082  6  850.6544 0  619.6667
10300082  8  850.6544 0  950.7313
10300082  9  850.6544 1  646.4973
10300082 10  850.6544 0  873.6721
10300082 11  850.6544 0  893.6874
10300082 12  850.6544 0  987.7598
10300082 13  850.6544 0  1269.946
10300082 14  850.6544 0 1607.2363
10300082 15  850.6544 0 1701.3087
10300082 16  850.6544 0  1725.327
10309071  1  495.9345 0  495.9345
end
label values pid pid
label values paynu_dv ba_paynu_dv

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30119

24 Apr 2018, 08:31

So I think you want this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long pid float(wave first_wage pre_se paynu_dv)
10243011  1    536.25 1    536.25
10243011  2    536.25 0 1301.0007
10243011  3    536.25 0 1301.0007
10243011  4    536.25 0         .
10273956  1 1184.9115 1 1184.9115
10273956  4 1184.9115 0 1151.8861
10273956  5 1184.9115 0 1551.1932
10273956  6 1184.9115 0 1651.2703
10273956  7 1184.9115 0 1801.3857
10273956  8 1184.9115 0 1901.4626
10273956  9 1184.9115 0  1899.461
10300082  1  850.6544 0  850.6544
10300082  4  850.6544 0  700.5389
10300082  5  850.6544 0  797.6136
10300082  6  850.6544 0  619.6667
10300082  8  850.6544 0  950.7313
10300082  9  850.6544 1  646.4973
10300082 10  850.6544 0  873.6721
10300082 11  850.6544 0  893.6874
10300082 12  850.6544 0  987.7598
10300082 13  850.6544 0  1269.946
10300082 14  850.6544 0 1607.2363
10300082 15  850.6544 0 1701.3087
10300082 16  850.6544 0  1725.327
10309071  1  495.9345 0  495.9345
end
label values pid pid
label values paynu_dv ba_paynu_dv

xtset pid wave

//    CALCULATE GROWTH RATE OVER PRECEDING WAGE
//    NOTE THAT THIS IS ALWAYS MISSING IN FIRST WAVE OF A PERSON'SDATA
by pid, sort: gen delta_wage = (ln(paynu_dv)-ln(L1.paynu_dv))/ln(L1.paynu_dv)

//    CREATE AN INDICATOR FOR SELF-EMPLOYMENT
by pid (wave): gen self_employed = sum(L1.pre_se)
by pid (wave): replace self_employed = 0 if _n == 1

//    CALCULATE MEAN OF DELTA FOR NON-SELF-EMPLOYED PERIODS, BY PERSON
by pid: egen mean_delta = mean(cond(self_employed, ., delta_wage))

//    AND OVERALL
summ mean_delta if !self_employed

Notes: By -xtset-ing the data we are able to use the L1 operator to refer to the previous wave's values of variables. Note, by the way, that because the formula for delta_wage refers to the lag of paynu_dv, it is always missing value for the first observation in any person's data. Therefore, after that point, it is not necessary to make an explicit exclusion of the first observation in the calculations.

I note that in your example data, many people are pre_se in the very first observation, which means that they have no non-self employed-observations of paynu_dv to take a mean of. (First observation has missing value for paynu_dv itself, and all subsequent observations are self-employed.)

Comment

Emiel Stuckens

Join Date: Apr 2018
Posts: 10

24 Apr 2018, 09:32

I executed the code and it is, unfortunately, not quite right. I want to be able to compare the wage growth of the subjects before their entrepreneurial venture and when they come back. At the moment it looks like not all wage growth before pre_se is included. You are right, in the previous dataex, most of the subjects went straight into self-employment from the wave they entered. That is why I will include a new dataex. Thus it is very important that only the wage growth (except indeed for the first wave) for the years before pre_se are included. I dont really understand the L1 variable so I dont know where it went wrong. Does this L1 refer to the first wave that the subject attended? My guess is that the first step was right but the second one not.

Thank you again for helping me!

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long pid float(wave first pre_se paynu_dv)
10017992 12 0 .  900.6929
10017992 13 0 1         .
10017992 14 0 . 1456.6676
10017992 15 0 . 1501.1548
10017992 16 0 .  1598.235
10017992 17 0 .      1950
10048308  8 0 . 238.33333
10048308 14 0 .         .
10048308 15 0 .         .
10048308 16 0 1         .
10048308 17 0 .         .
10048308 18 0 .         .
10060111  1 0 . 1019.7845
10060111  2 0 . 1050.8083
10060111  3 0 . 1183.9108
10060111  4 0 . 1135.8738
10060111  5 0 . 1181.9092
10060111  6 0 .   1221.94
10060111  7 0 .  1261.971
10060111  8 0 . 1307.0054
10060111  9 0 1  433.3333
10060111 10 0 . 1429.0994
10060111 11 0 . 1538.1832
10060111 12 0 . 1602.2325
10060111 13 0 . 1671.2856
10060111 14 0 .  1491.147
10060111 15 0 . 1775.3657
10060111 16 0 .      84.5
10079653  6 0 . 238.33333
10079653  8 0 .  850.6544
10079653 10 0 .  600.4619
10079653 11 0 .  676.5204
10079653 12 0 .  967.7444
10079653 16 0 1 217.16705
10079653 17 0 .  514.5833
end
label values pid pid
label values paynu_dv ba_paynu_dv

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#6

24 Apr 2018, 10:27

I want to be able to compare the wage growth of the subjects before their entrepreneurial venture and when they come back.

And what in the data identifies "when they come back?" You have said nothing about this up to now.

At the moment it looks like not all wage growth before pre_se is included.

Why not? I ran the same code with both the old and the new data examples, and checked the results with hand calculations of the means of all the pre self-employment wage growth and they are all correct. Can you give an example that comes out wrong?

I dont really understand the L1 variable so I dont know where it went wrong. Does this L1 refer to the first wave that the subject attended? My guess is that the first step was right but the second one not.

L1 is an operator, not a variable. It is used with data that has been -xtset- or -tsset- and L1.whatever refers to the value of whatever in the immediately preceding time period. Read -help xtset- and -help tsvarlist- for more information.
Comment
Emiel Stuckens

Join Date: Apr 2018

Posts: 10
#7

25 Apr 2018, 01:55

I already have got a growth variable for when they return to employment. Unfortunately, I didn't get the formula so I cannot adapt it to solve my problem for the growth before self-employment.
I think I got a little bit confused by the mean_delta variable because indeed the total mean seems to be right.
However, I just noticed that most of the subjects wages drop significantly at pre_se. I think this is because they already enter self-employment at the pre_se and not the year after. It seems like my coach gave me false information. I am really sorry but could you maybe adapt it so it does not include the wave for which pre_se=1 anymore?
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30119

25 Apr 2018, 08:28

Just change

Code:

by pid: egen mean_delta = mean(cond(self_employed, ., delta_wage))
// TO
by pid: egen mean_delta = mean(cond(self_employed | pre_se, ., delta_wage))

Comment

Emiel Stuckens

Join Date: Apr 2018

Posts: 10
#9

25 Apr 2018, 08:53

I do not know why but then I get 0 observations.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#10

25 Apr 2018, 09:15

Oh, that's a perfect example of the trouble you get into when you code a variable as 1/. instead of 1/0. Change the coding of pre_se to 1/0 and the code will perform correctly.
Comment
Emiel Stuckens

Join Date: Apr 2018

Posts: 10
#11

25 Apr 2018, 09:28

Okay, I think the code is now correct! I have a deadline tomorrow so I will know soon enough. Thank you very much for helping me!
Comment
Emiel Stuckens

Join Date: Apr 2018

Posts: 10
#12

25 Apr 2018, 09:41

Aaaand I am already back. I also have a small problem with the code my coach gave me. He had set up a formula to calculate the wage growth from the moment someone entered entrepreneurship (pre_se) until he left and went back to employment (post_se). Now this give a biased result because when someone enters self-employment, he will probably earn a lot less and so the wage growth will be huge when he returns to employment. To fix this, it is better to compare the last wage (of the wave before pre_se) with the one when he returns to employment (one wave after post_se)... Can this be coded?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#13

25 Apr 2018, 09:53

You have never referred to a post_se variable before, although you have alluded to its existence. And it does not appear in any of your sample data. Please post a new example that includes this and illustrates the problem, using -dataex-, of course.
Comment

Emiel Stuckens

Join Date: Apr 2018
Posts: 10

#14

25 Apr 2018, 11:56

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long pid float(wave pre_se post_se paynu_dv)
10017992 12 0 .  900.6929
10017992 13 1 .         .
10017992 14 0 1 1456.6676
10017992 15 0 . 1501.1548
10017992 16 0 .  1598.235
10017992 17 0 .      1950
10048308  8 0 . 238.33333
10048308 14 0 .         .
10048308 15 0 .         .
10048308 16 1 .         .
10048308 17 0 1         .
10048308 18 0 .         .
10060111  1 0 . 1019.7845
10060111  2 0 . 1050.8083
10060111  3 0 . 1183.9108
10060111  4 0 . 1135.8738
10060111  5 0 . 1181.9092
10060111  6 0 .   1221.94
10060111  7 0 .  1261.971
10060111  8 0 . 1307.0054
10060111  9 1 .  433.3333
10060111 10 0 1 1429.0994
10060111 11 0 . 1538.1832
10060111 12 0 . 1602.2325
10060111 13 0 . 1671.2856
10060111 14 0 .  1491.147
10060111 15 0 . 1775.3657
10060111 16 0 .      84.5
10079653  6 0 . 238.33333
10079653  8 0 .  850.6544
10079653 10 0 .  600.4619
10079653 11 0 .  676.5204
10079653 12 0 .  967.7444
10079653 16 1 . 217.16705
10079653 17 0 1  514.5833
end
label values pid pid
label values paynu_dv ba_paynu_dv

sorry totally forgot it!

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#15

25 Apr 2018, 12:08

So, as with pre_se, it is better if post_se is coded 1/0 instead of 1/. The code below makes that change.

Also, this way of doing it will only be correct if each pid has at most one observation with pre_se = 1 and at most one observation with post_se = 1. This assumption is also verified in the code below.

Code:

replace post_se = 0 if missing(post_se) xtset pid wave // VERIFY EACH PID HAS AT MOST ONE PRE-SE AND AT MOSTONE POST-SE OBSERVATION foreach v of varlist pre_se post_se { assert inlist(`v', 0, 1) by pid, sort: gen sum_`v' = sum(`v') by pid: assert sum_`v' <= 1 drop sum_`v' } by pid (wave), sort: egen pre_se_pay = max(cond(F1.pre_se, paynu_dv, .)) by pid (wave), sort: egen post_se_pay = max(cond(L.post_se, paynu_dv, .)) label var pre_se_pay "Wage before self-employment" label var post_se_pay "Wage after return from self_employment"

This calculates two new variables, pre_se_pay and post_se_pay. For each pid, pre_se_pay will contain the chronologically last value of paynu_dv that appears before the observation with pre_se = 1. post_se_pay is defined analogously, taking the value of paynu_dv that appears in the wave immedately after the observation with post_se = 1. You indicated that you want to compare these--which could mean a lot of different things. So I'll let you take it from here to calculate whatever comparison between these is appropriate for your needs.
Comment

Announcement