Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using a sampling weight to correct for unbalanced panel data

    Hi,

    I am conducting regressions on panel data. My y variable is wage.

    As the panel is unbalanced, the individuals contributing more wage observations carry more weight in the pooled sample, and the fact they have more wage observations may be be correlated with other variables.

    Is it easy to use a sampling weight equal to the inverse of the probability that the individual is included in the same for all years? I have noticed the literature tends do this to overcome the problem.

    If so, how is this done on Stata?

    Many thanks,

    Ella



  • #2
    Ella:
    see -help weight-.
    However, a preliminary step would consider investigating whether the missingness for some individuals is informative or not.
    Kind regards,
    Carlo
    (Stata 18.0 SE)

    Comment


    • #3
      Weights can be created using variables that are fully observed. In case of panel attrition this could be variables that can reasonably be assumed to remain constant over time, like gender, race and birth year. In that case the weights adjust for differences in response probabilities between the genders, the races and the cohorts, but nothing else. So weights can be useful, but they are no magic solution to all problems.

      The example data below is only for women, so I can't use gender to create the weights.

      Code:
      . // open and look at example data
      . webuse nlswork, clear
      (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
      
      . xtset idcode year
             panel variable:  idcode (unbalanced)
              time variable:  year, 68 to 88, but with gaps
                      delta:  1 unit
      
      . xtdescribe
      
        idcode:  1, 2, ..., 5159                                   n =       4711
          year:  68, 69, ..., 88                                   T =         15
                 Delta(year) = 1 unit
                 Span(year)  = 21 periods
                 (idcode*year uniquely identifies each observation)
      
      Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                               1       1       3         5         9      13      15
      
           Freq.  Percent    Cum. |  Pattern
       ---------------------------+-----------------------
            136      2.89    2.89 |  1....................
            114      2.42    5.31 |  ....................1
             89      1.89    7.20 |  .................1.11
             87      1.85    9.04 |  ...................11
             86      1.83   10.87 |  111111.1.11.1.11.1.11
             61      1.29   12.16 |  ..............11.1.11
             56      1.19   13.35 |  11...................
             54      1.15   14.50 |  ...............1.1.11
             54      1.15   15.64 |  .......1.11.1.11.1.11
           3974     84.36  100.00 | (other patterns)
       ---------------------------+-----------------------
           4711    100.00         |  XXXXXX.X.XX.X.XX.X.XX
      
      .
      . // number of observed years per person
      . bys idcode : gen n = _N
      
      .
      . // look at the average number by birth_yr and race
      . table birth_yr race, c(mean n)
      
      ----------------------------------------
      birth     |             race            
      year      |    white     black     other
      ----------+-----------------------------
             41 | 13.47059         9          
             42 | 8.432494  9.248176          
             43 | 8.744643    9.6514         9
             44 | 9.061225  9.836489  12.72727
             45 | 9.015494  8.800344  9.692307
             46 |  8.95058  9.416667         2
             47 | 8.813744  9.242945  3.307692
             48 | 9.242283  9.464953  6.217391
             49 | 8.712466  8.680653  6.641026
             50 | 7.518686  8.045553  8.866667
             51 | 7.645431  7.824108  7.594594
             52 |     7.64   6.96837      5.24
             53 | 7.280087  6.877095  6.703704
             54 |                  7          
      ----------------------------------------
      
      .
      . // create a variable with those average numbers
      . bys race birth_yr : egen double mean_n = mean(n)
      
      .
      . // there are 15 waves, so the probablity is mean_n/15
      . // the weight is 1/probablity,
      . // so the weight is 15/mean_n
      . gen double w = 15/mean_n
      
      .
      . // look at the weights
      . table birth_yr race, c(mean w)
      
      -------------------------------------------
      birth     |              race              
      year      |     white      black      other
      ----------+--------------------------------
             41 | 1.1135371  1.6666667           
             42 | 1.7788331  1.6219416           
             43 | 1.7153359  1.5541788  1.6666667
             44 | 1.6554054  1.5249344  1.1785714
             45 | 1.6638022  1.7044788   1.547619
             46 | 1.6758691  1.5929204        7.5
             47 | 1.7018876  1.6228594  4.5348837
             48 | 1.6229757  1.5847939  2.4125874
             49 |  1.721671  1.7279807  2.2586873
             50 | 1.9950294  1.8643839  1.6917293
             51 | 1.9619562  1.9171514   1.975089
             52 | 1.9633508  2.1525838  2.8625954
             53 | 2.0604148  2.1811535  2.2375691
             54 |            2.1428571           
      -------------------------------------------
      ---------------------------------
      Maarten L. Buis
      University of Konstanz
      Department of history and sociology
      box 40
      78457 Konstanz
      Germany
      http://www.maartenbuis.nl
      ---------------------------------

      Comment


      • #4
        Originally posted by Maarten Buis View Post
        Weights can be created using variables that are fully observed. In case of panel attrition this could be variables that can reasonably be assumed to remain constant over time, like gender, race and birth year. In that case the weights adjust for differences in response probabilities between the genders, the races and the cohorts, but nothing else. So weights can be useful, but they are no magic solution to all problems.

        The example data below is only for women, so I can't use gender to create the weights.

        Code:
        . // open and look at example data
        . webuse nlswork, clear
        (National Longitudinal Survey. Young Women 14-26 years of age in 1968)
        
        . xtset idcode year
        panel variable: idcode (unbalanced)
        time variable: year, 68 to 88, but with gaps
        delta: 1 unit
        
        . xtdescribe
        
        idcode: 1, 2, ..., 5159 n = 4711
        year: 68, 69, ..., 88 T = 15
        Delta(year) = 1 unit
        Span(year) = 21 periods
        (idcode*year uniquely identifies each observation)
        
        Distribution of T_i: min 5% 25% 50% 75% 95% max
        1 1 3 5 9 13 15
        
        Freq. Percent Cum. | Pattern
        ---------------------------+-----------------------
        136 2.89 2.89 | 1....................
        114 2.42 5.31 | ....................1
        89 1.89 7.20 | .................1.11
        87 1.85 9.04 | ...................11
        86 1.83 10.87 | 111111.1.11.1.11.1.11
        61 1.29 12.16 | ..............11.1.11
        56 1.19 13.35 | 11...................
        54 1.15 14.50 | ...............1.1.11
        54 1.15 15.64 | .......1.11.1.11.1.11
        3974 84.36 100.00 | (other patterns)
        ---------------------------+-----------------------
        4711 100.00 | XXXXXX.X.XX.X.XX.X.XX
        
        .
        . // number of observed years per person
        . bys idcode : gen n = _N
        
        .
        . // look at the average number by birth_yr and race
        . table birth_yr race, c(mean n)
        
        ----------------------------------------
        birth | race
        year | white black other
        ----------+-----------------------------
        41 | 13.47059 9
        42 | 8.432494 9.248176
        43 | 8.744643 9.6514 9
        44 | 9.061225 9.836489 12.72727
        45 | 9.015494 8.800344 9.692307
        46 | 8.95058 9.416667 2
        47 | 8.813744 9.242945 3.307692
        48 | 9.242283 9.464953 6.217391
        49 | 8.712466 8.680653 6.641026
        50 | 7.518686 8.045553 8.866667
        51 | 7.645431 7.824108 7.594594
        52 | 7.64 6.96837 5.24
        53 | 7.280087 6.877095 6.703704
        54 | 7
        ----------------------------------------
        
        .
        . // create a variable with those average numbers
        . bys race birth_yr : egen double mean_n = mean(n)
        
        .
        . // there are 15 waves, so the probablity is mean_n/15
        . // the weight is 1/probablity,
        . // so the weight is 15/mean_n
        . gen double w = 15/mean_n
        
        .
        . // look at the weights
        . table birth_yr race, c(mean w)
        
        -------------------------------------------
        birth | race
        year | white black other
        ----------+--------------------------------
        41 | 1.1135371 1.6666667
        42 | 1.7788331 1.6219416
        43 | 1.7153359 1.5541788 1.6666667
        44 | 1.6554054 1.5249344 1.1785714
        45 | 1.6638022 1.7044788 1.547619
        46 | 1.6758691 1.5929204 7.5
        47 | 1.7018876 1.6228594 4.5348837
        48 | 1.6229757 1.5847939 2.4125874
        49 | 1.721671 1.7279807 2.2586873
        50 | 1.9950294 1.8643839 1.6917293
        51 | 1.9619562 1.9171514 1.975089
        52 | 1.9633508 2.1525838 2.8625954
        53 | 2.0604148 2.1811535 2.2375691
        54 | 2.1428571
        -------------------------------------------
        Thanks for your response Maarten.

        I am trying to see the effect of being a mother on women's wages, and am considering panel attrition problems as non-response (i.e. no value for wage) might be related to the woman not working in a given year due to childcare responsibilities.

        I would like to test if this is this case and if it is, to create a weight equal inverse of the probability that the individual is included in the data.

        So would I use your code to do this?

        And do I do this only once to the data, save it, and then run regressions, or would I have to do it every time?

        Comment


        • #5
          I consider panel attrition as a special case of missing values where a respondent is absent from an entire wave, not just missing values on one particular variable. So I would not refer to your problem as panel attrition.

          If you want to know whether missing values on wage are due to not working, then your first step would be to consult the data manual or code book that comes with your data, and look if they describe how they coded wage for non-working respondents. If that is inconclusive then you can look for a variable in your data that says whether or not that respondent works. If you have such a variable, then I would start with just a cross tabulation of that variable with an indicator variable for whether or not the wage is missing. If that variable does not exist then just don't have the empirical information necessary to check that.

          If you want to ascertain whether or not the respondent works due to childcare responsibilities, then that needs to be asked to respondent and the answers need need to be recorded in the data as a variable. If that was done in your survey, great, if not, then there is nothing you can do.

          Creating weights is typically something you only do once.

          Weights can only adjust for the distribution of fully observed variables, so they are not a solution to your problem. Women not having wages due to not working is the classic example used for introducing the Heckman selection model. So that would be something to look into.
          ---------------------------------
          Maarten L. Buis
          University of Konstanz
          Department of history and sociology
          box 40
          78457 Konstanz
          Germany
          http://www.maartenbuis.nl
          ---------------------------------

          Comment


          • #6
            Originally posted by Maarten Buis View Post
            I consider panel attrition as a special case of missing values where a respondent is absent from an entire wave, not just missing values on one particular variable. So I would not refer to your problem as panel attrition.

            If you want to know whether missing values on wage are due to not working, then your first step would be to consult the data manual or code book that comes with your data, and look if they describe how they coded wage for non-working respondents. If that is inconclusive then you can look for a variable in your data that says whether or not that respondent works. If you have such a variable, then I would start with just a cross tabulation of that variable with an indicator variable for whether or not the wage is missing. If that variable does not exist then just don't have the empirical information necessary to check that.

            If you want to ascertain whether or not the respondent works due to childcare responsibilities, then that needs to be asked to respondent and the answers need need to be recorded in the data as a variable. If that was done in your survey, great, if not, then there is nothing you can do.

            Creating weights is typically something you only do once.

            Weights can only adjust for the distribution of fully observed variables, so they are not a solution to your problem. Women not having wages due to not working is the classic example used for introducing the Heckman selection model. So that would be something to look into.
            Okay, but if just one variable is missing, the whole row wouldn't be included in the regression, would it? So it has the same effect as panel attriton?

            Comment


            • #7
              The mechanism is very different, and with that your options for dealing with it.
              ---------------------------------
              Maarten L. Buis
              University of Konstanz
              Department of history and sociology
              box 40
              78457 Konstanz
              Germany
              http://www.maartenbuis.nl
              ---------------------------------

              Comment


              • #8
                Originally posted by Maarten Buis View Post
                The mechanism is very different, and with that your options for dealing with it.
                So if data is missing across all variables, Stata ignores the observation.

                But if data is missing for one or more variable but not all, Stata does NOT ignore the observation.

                Is this correct?

                Comment


                • #9
                  Ella:
                  no, it is not.
                  In both cases the observation will be listwise deleted by Stata.
                  Kind regards,
                  Carlo
                  (Stata 18.0 SE)

                  Comment


                  • #10
                    That is incorrect. Stata will ignore the observation if it has at least one missing value.

                    The mechanism I was referring to is the mechanism that lead to the value being missing: e.g. refuse to answer a single question, the value not being applicable (wage when one does not have a wage), or not participating in a wave (the interviewer could not find you, or you refused to participate).

                    The that in "that is incorrect" refers to #8. I hadn't seen Carlo's response when typing that answer, and we are in agreement.
                    ---------------------------------
                    Maarten L. Buis
                    University of Konstanz
                    Department of history and sociology
                    box 40
                    78457 Konstanz
                    Germany
                    http://www.maartenbuis.nl
                    ---------------------------------

                    Comment

                    Working...
                    X