Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • calculating mean of subgroup of observations.

    Hello everyone,

    I came quite far on stata without help but I just cant figure this one out.

    I am trying to calculate the mean of the wage growth for people. However, because I have data on multiple years, I am only interested in the wage growth before people go into self-employment, which all of them do. The variable to indicate this is pre_se (this is a 1 when they enter, otherwise its missing). I am also not interested in the first year cause the wage growth obviously will be 0 then. variables:
    first_wage: subjects wage during the first wave that they take part in.
    wave: indicates the wave number. the survey (bhps) conducted one questionnaire (wave) per year. A person does not have to take part from wave 1.
    paynu_dv: gives net wage that the subjects earns per year.
    pid: personal identifier: identifies the subject (stays the same over the waves)

    My view on it, but that I cannot manage to put into stata:

    bys pidp: gen delta_first_wage = (ln(paynu_dv) - ln(first_wage))/(ln(first_wage))

    mean delta_first_wage if wave != min(wave) & wave < wave(pre_se)

    Could you adapt it so I can implement it?

    thanks in advance!

  • #2
    No answers here, but some questions.

    1. Is your -bys pidp: gen delta_first_wage = (ln(paynu_dv) - ln(first_wage))/(ln(first_wage))- code doing what you want? It doesn't really make sense to me, but as I don't understand what you are trying to calculate, I can't really make a suggestion.

    2. What is the variable pre_se? "The variable to indicate this is pre_se (this is a 1 when they enter, otherwise its missing)." also does not make sense to me. It is a 1 when they enter what? The name pre_se suggests that it would be 1 in any wave before self-employment, but apparently that is not what you have done. Also, in general, 1/. variables are difficult to work with in Stata. 1/0 coding is much better and I suggest you change it to that. (If you don't, I almost certainly will in coding a solution to the problem.)

    3. Are you looking for a single mean value from the entire data set (restricted to the observations of interest)? Or do you want a mean for each person?

    4. If you want help with code, it is always best to show example data. Please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.

    Comment


    • #3
      Hi Clyde,

      Thank you for taking the time to help me!

      1. bys pidp: gen delta_first_wage = (ln(paynu_dv) - ln(first_wage))/(ln(first_wage)) calculates the growth percentage of every year in comparison to the first wage they earned. I got this code from a friend but it should be adapted so it calculates the growth percentage of every year in comparison to the year (wave) before...

      2. pre_se only is 1 for the year before they enter self-employment (se). I will indeed adapt the . to a 0 if it enhances the programming.

      3. I would like new variable that calculates a mean for every person since I could come in handy for further research. However, if it is significantly easier to just calculate a mean overall, that will also do.

      4.
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input long pid float(wave first_wage pre_se paynu_dv)
      10243011  1    536.25 1    536.25
      10243011  2    536.25 0 1301.0007
      10243011  3    536.25 0 1301.0007
      10243011  4    536.25 0         .
      10273956  1 1184.9115 1 1184.9115
      10273956  4 1184.9115 0 1151.8861
      10273956  5 1184.9115 0 1551.1932
      10273956  6 1184.9115 0 1651.2703
      10273956  7 1184.9115 0 1801.3857
      10273956  8 1184.9115 0 1901.4626
      10273956  9 1184.9115 0  1899.461
      10300082  1  850.6544 0  850.6544
      10300082  4  850.6544 0  700.5389
      10300082  5  850.6544 0  797.6136
      10300082  6  850.6544 0  619.6667
      10300082  8  850.6544 0  950.7313
      10300082  9  850.6544 1  646.4973
      10300082 10  850.6544 0  873.6721
      10300082 11  850.6544 0  893.6874
      10300082 12  850.6544 0  987.7598
      10300082 13  850.6544 0  1269.946
      10300082 14  850.6544 0 1607.2363
      10300082 15  850.6544 0 1701.3087
      10300082 16  850.6544 0  1725.327
      10309071  1  495.9345 0  495.9345
      end
      label values pid pid
      label values paynu_dv ba_paynu_dv

      Comment


      • #4
        So I think you want this:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input long pid float(wave first_wage pre_se paynu_dv)
        10243011  1    536.25 1    536.25
        10243011  2    536.25 0 1301.0007
        10243011  3    536.25 0 1301.0007
        10243011  4    536.25 0         .
        10273956  1 1184.9115 1 1184.9115
        10273956  4 1184.9115 0 1151.8861
        10273956  5 1184.9115 0 1551.1932
        10273956  6 1184.9115 0 1651.2703
        10273956  7 1184.9115 0 1801.3857
        10273956  8 1184.9115 0 1901.4626
        10273956  9 1184.9115 0  1899.461
        10300082  1  850.6544 0  850.6544
        10300082  4  850.6544 0  700.5389
        10300082  5  850.6544 0  797.6136
        10300082  6  850.6544 0  619.6667
        10300082  8  850.6544 0  950.7313
        10300082  9  850.6544 1  646.4973
        10300082 10  850.6544 0  873.6721
        10300082 11  850.6544 0  893.6874
        10300082 12  850.6544 0  987.7598
        10300082 13  850.6544 0  1269.946
        10300082 14  850.6544 0 1607.2363
        10300082 15  850.6544 0 1701.3087
        10300082 16  850.6544 0  1725.327
        10309071  1  495.9345 0  495.9345
        end
        label values pid pid
        label values paynu_dv ba_paynu_dv
        
        xtset pid wave
        
        //    CALCULATE GROWTH RATE OVER PRECEDING WAGE
        //    NOTE THAT THIS IS ALWAYS MISSING IN FIRST WAVE OF A PERSON'SDATA
        by pid, sort: gen delta_wage = (ln(paynu_dv)-ln(L1.paynu_dv))/ln(L1.paynu_dv)
        
        //    CREATE AN INDICATOR FOR SELF-EMPLOYMENT
        by pid (wave): gen self_employed = sum(L1.pre_se)
        by pid (wave): replace self_employed = 0 if _n == 1
        
        //    CALCULATE MEAN OF DELTA FOR NON-SELF-EMPLOYED PERIODS, BY PERSON
        by pid: egen mean_delta = mean(cond(self_employed, ., delta_wage))
        
        //    AND OVERALL
        summ mean_delta if !self_employed
        Notes: By -xtset-ing the data we are able to use the L1 operator to refer to the previous wave's values of variables. Note, by the way, that because the formula for delta_wage refers to the lag of paynu_dv, it is always missing value for the first observation in any person's data. Therefore, after that point, it is not necessary to make an explicit exclusion of the first observation in the calculations.

        I note that in your example data, many people are pre_se in the very first observation, which means that they have no non-self employed-observations of paynu_dv to take a mean of. (First observation has missing value for paynu_dv itself, and all subsequent observations are self-employed.)

        Comment


        • #5
          I executed the code and it is, unfortunately, not quite right. I want to be able to compare the wage growth of the subjects before their entrepreneurial venture and when they come back. At the moment it looks like not all wage growth before pre_se is included. You are right, in the previous dataex, most of the subjects went straight into self-employment from the wave they entered. That is why I will include a new dataex. Thus it is very important that only the wage growth (except indeed for the first wave) for the years before pre_se are included. I dont really understand the L1 variable so I dont know where it went wrong. Does this L1 refer to the first wave that the subject attended? My guess is that the first step was right but the second one not.

          Thank you again for helping me!

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input long pid float(wave first pre_se paynu_dv)
          10017992 12 0 .  900.6929
          10017992 13 0 1         .
          10017992 14 0 . 1456.6676
          10017992 15 0 . 1501.1548
          10017992 16 0 .  1598.235
          10017992 17 0 .      1950
          10048308  8 0 . 238.33333
          10048308 14 0 .         .
          10048308 15 0 .         .
          10048308 16 0 1         .
          10048308 17 0 .         .
          10048308 18 0 .         .
          10060111  1 0 . 1019.7845
          10060111  2 0 . 1050.8083
          10060111  3 0 . 1183.9108
          10060111  4 0 . 1135.8738
          10060111  5 0 . 1181.9092
          10060111  6 0 .   1221.94
          10060111  7 0 .  1261.971
          10060111  8 0 . 1307.0054
          10060111  9 0 1  433.3333
          10060111 10 0 . 1429.0994
          10060111 11 0 . 1538.1832
          10060111 12 0 . 1602.2325
          10060111 13 0 . 1671.2856
          10060111 14 0 .  1491.147
          10060111 15 0 . 1775.3657
          10060111 16 0 .      84.5
          10079653  6 0 . 238.33333
          10079653  8 0 .  850.6544
          10079653 10 0 .  600.4619
          10079653 11 0 .  676.5204
          10079653 12 0 .  967.7444
          10079653 16 0 1 217.16705
          10079653 17 0 .  514.5833
          end
          label values pid pid
          label values paynu_dv ba_paynu_dv

          Comment


          • #6
            I want to be able to compare the wage growth of the subjects before their entrepreneurial venture and when they come back.
            And what in the data identifies "when they come back?" You have said nothing about this up to now.

            At the moment it looks like not all wage growth before pre_se is included.
            Why not? I ran the same code with both the old and the new data examples, and checked the results with hand calculations of the means of all the pre self-employment wage growth and they are all correct. Can you give an example that comes out wrong?

            I dont really understand the L1 variable so I dont know where it went wrong. Does this L1 refer to the first wave that the subject attended? My guess is that the first step was right but the second one not.
            L1 is an operator, not a variable. It is used with data that has been -xtset- or -tsset- and L1.whatever refers to the value of whatever in the immediately preceding time period. Read -help xtset- and -help tsvarlist- for more information.

            Comment


            • #7
              I already have got a growth variable for when they return to employment. Unfortunately, I didn't get the formula so I cannot adapt it to solve my problem for the growth before self-employment.
              I think I got a little bit confused by the mean_delta variable because indeed the total mean seems to be right.
              However, I just noticed that most of the subjects wages drop significantly at pre_se. I think this is because they already enter self-employment at the pre_se and not the year after. It seems like my coach gave me false information. I am really sorry but could you maybe adapt it so it does not include the wave for which pre_se=1 anymore?

              Comment


              • #8
                Just change
                Code:
                by pid: egen mean_delta = mean(cond(self_employed, ., delta_wage))
                // TO
                by pid: egen mean_delta = mean(cond(self_employed | pre_se, ., delta_wage))

                Comment


                • #9
                  I do not know why but then I get 0 observations.

                  Comment


                  • #10
                    Oh, that's a perfect example of the trouble you get into when you code a variable as 1/. instead of 1/0. Change the coding of pre_se to 1/0 and the code will perform correctly.

                    Comment


                    • #11
                      Okay, I think the code is now correct! I have a deadline tomorrow so I will know soon enough. Thank you very much for helping me!

                      Comment


                      • #12
                        Aaaand I am already back. I also have a small problem with the code my coach gave me. He had set up a formula to calculate the wage growth from the moment someone entered entrepreneurship (pre_se) until he left and went back to employment (post_se). Now this give a biased result because when someone enters self-employment, he will probably earn a lot less and so the wage growth will be huge when he returns to employment. To fix this, it is better to compare the last wage (of the wave before pre_se) with the one when he returns to employment (one wave after post_se)... Can this be coded?

                        Comment


                        • #13
                          You have never referred to a post_se variable before, although you have alluded to its existence. And it does not appear in any of your sample data. Please post a new example that includes this and illustrates the problem, using -dataex-, of course.

                          Comment


                          • #14
                            Code:
                            * Example generated by -dataex-. To install: ssc install dataex
                            clear
                            input long pid float(wave pre_se post_se paynu_dv)
                            10017992 12 0 .  900.6929
                            10017992 13 1 .         .
                            10017992 14 0 1 1456.6676
                            10017992 15 0 . 1501.1548
                            10017992 16 0 .  1598.235
                            10017992 17 0 .      1950
                            10048308  8 0 . 238.33333
                            10048308 14 0 .         .
                            10048308 15 0 .         .
                            10048308 16 1 .         .
                            10048308 17 0 1         .
                            10048308 18 0 .         .
                            10060111  1 0 . 1019.7845
                            10060111  2 0 . 1050.8083
                            10060111  3 0 . 1183.9108
                            10060111  4 0 . 1135.8738
                            10060111  5 0 . 1181.9092
                            10060111  6 0 .   1221.94
                            10060111  7 0 .  1261.971
                            10060111  8 0 . 1307.0054
                            10060111  9 1 .  433.3333
                            10060111 10 0 1 1429.0994
                            10060111 11 0 . 1538.1832
                            10060111 12 0 . 1602.2325
                            10060111 13 0 . 1671.2856
                            10060111 14 0 .  1491.147
                            10060111 15 0 . 1775.3657
                            10060111 16 0 .      84.5
                            10079653  6 0 . 238.33333
                            10079653  8 0 .  850.6544
                            10079653 10 0 .  600.4619
                            10079653 11 0 .  676.5204
                            10079653 12 0 .  967.7444
                            10079653 16 1 . 217.16705
                            10079653 17 0 1  514.5833
                            end
                            label values pid pid
                            label values paynu_dv ba_paynu_dv

                            sorry totally forgot it!

                            Comment


                            • #15
                              So, as with pre_se, it is better if post_se is coded 1/0 instead of 1/. The code below makes that change.

                              Also, this way of doing it will only be correct if each pid has at most one observation with pre_se = 1 and at most one observation with post_se = 1. This assumption is also verified in the code below.

                              Code:
                              replace post_se = 0 if missing(post_se)
                              
                              xtset pid wave
                              
                              //    VERIFY EACH PID HAS AT MOST ONE PRE-SE AND AT MOSTONE POST-SE OBSERVATION
                              foreach v of varlist pre_se post_se {
                                  assert inlist(`v', 0, 1)
                                  by pid, sort: gen sum_`v' = sum(`v')
                                  by pid: assert sum_`v' <= 1
                                  drop sum_`v'
                              }
                              
                              by pid (wave), sort: egen pre_se_pay = max(cond(F1.pre_se, paynu_dv, .))
                              by pid (wave), sort: egen post_se_pay = max(cond(L.post_se, paynu_dv, .))
                              label var pre_se_pay "Wage before self-employment"
                              label var post_se_pay "Wage after return from self_employment"
                              This calculates two new variables, pre_se_pay and post_se_pay. For each pid, pre_se_pay will contain the chronologically last value of paynu_dv that appears before the observation with pre_se = 1. post_se_pay is defined analogously, taking the value of paynu_dv that appears in the wave immedately after the observation with post_se = 1. You indicated that you want to compare these--which could mean a lot of different things. So I'll let you take it from here to calculate whatever comparison between these is appropriate for your needs.

                              Comment

                              Working...
                              X