Using a sampling weight to correct for unbalanced panel data

Ella Ki

Join Date: Mar 2017

Posts: 39
#1

Using a sampling weight to correct for unbalanced panel data

10 Apr 2017, 20:19

Hi,

I am conducting regressions on panel data. My y variable is wage.

As the panel is unbalanced, the individuals contributing more wage observations carry more weight in the pooled sample, and the fact they have more wage observations may be be correlated with other variables.

Is it easy to use a sampling weight equal to the inverse of the probability that the individual is included in the same for all years? I have noticed the literature tends do this to overcome the problem.

If so, how is this done on Stata?

Many thanks,

Ella
Tags: data, panel
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17072
#2

11 Apr 2017, 01:00

Ella:
see -help weight-.
However, a preliminary step would consider investigating whether the missingness for some individuals is informative or not.

Kind regards,
Carlo
(Stata 18.0 SE)
Comment

Maarten Buis

Join Date: Mar 2014
Posts: 3255

11 Apr 2017, 01:45

Weights can be created using variables that are fully observed. In case of panel attrition this could be variables that can reasonably be assumed to remain constant over time, like gender, race and birth year. In that case the weights adjust for differences in response probabilities between the genders, the races and the cohorts, but nothing else. So weights can be useful, but they are no magic solution to all problems.

The example data below is only for women, so I can't use gender to create the weights.

Code:

. // open and look at example data
. webuse nlswork, clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. xtset idcode year
       panel variable:  idcode (unbalanced)
        time variable:  year, 68 to 88, but with gaps
                delta:  1 unit

. xtdescribe

  idcode:  1, 2, ..., 5159                                   n =       4711
    year:  68, 69, ..., 88                                   T =         15
           Delta(year) = 1 unit
           Span(year)  = 21 periods
           (idcode*year uniquely identifies each observation)

Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                         1       1       3         5         9      13      15

     Freq.  Percent    Cum. |  Pattern
 ---------------------------+-----------------------
      136      2.89    2.89 |  1....................
      114      2.42    5.31 |  ....................1
       89      1.89    7.20 |  .................1.11
       87      1.85    9.04 |  ...................11
       86      1.83   10.87 |  111111.1.11.1.11.1.11
       61      1.29   12.16 |  ..............11.1.11
       56      1.19   13.35 |  11...................
       54      1.15   14.50 |  ...............1.1.11
       54      1.15   15.64 |  .......1.11.1.11.1.11
     3974     84.36  100.00 | (other patterns)
 ---------------------------+-----------------------
     4711    100.00         |  XXXXXX.X.XX.X.XX.X.XX

.
. // number of observed years per person
. bys idcode : gen n = _N

.
. // look at the average number by birth_yr and race
. table birth_yr race, c(mean n)

----------------------------------------
birth     |             race            
year      |    white     black     other
----------+-----------------------------
       41 | 13.47059         9          
       42 | 8.432494  9.248176          
       43 | 8.744643    9.6514         9
       44 | 9.061225  9.836489  12.72727
       45 | 9.015494  8.800344  9.692307
       46 |  8.95058  9.416667         2
       47 | 8.813744  9.242945  3.307692
       48 | 9.242283  9.464953  6.217391
       49 | 8.712466  8.680653  6.641026
       50 | 7.518686  8.045553  8.866667
       51 | 7.645431  7.824108  7.594594
       52 |     7.64   6.96837      5.24
       53 | 7.280087  6.877095  6.703704
       54 |                  7          
----------------------------------------

.
. // create a variable with those average numbers
. bys race birth_yr : egen double mean_n = mean(n)

.
. // there are 15 waves, so the probablity is mean_n/15
. // the weight is 1/probablity,
. // so the weight is 15/mean_n
. gen double w = 15/mean_n

.
. // look at the weights
. table birth_yr race, c(mean w)

-------------------------------------------
birth     |              race              
year      |     white      black      other
----------+--------------------------------
       41 | 1.1135371  1.6666667           
       42 | 1.7788331  1.6219416           
       43 | 1.7153359  1.5541788  1.6666667
       44 | 1.6554054  1.5249344  1.1785714
       45 | 1.6638022  1.7044788   1.547619
       46 | 1.6758691  1.5929204        7.5
       47 | 1.7018876  1.6228594  4.5348837
       48 | 1.6229757  1.5847939  2.4125874
       49 |  1.721671  1.7279807  2.2586873
       50 | 1.9950294  1.8643839  1.6917293
       51 | 1.9619562  1.9171514   1.975089
       52 | 1.9633508  2.1525838  2.8625954
       53 | 2.0604148  2.1811535  2.2375691
       54 |            2.1428571           
-------------------------------------------

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------

Comment

Ella Ki

Join Date: Mar 2017
Posts: 39

12 Apr 2017, 17:57

Originally posted by Maarten Buis View Post

Code:

. // open and look at example data
. webuse nlswork, clear
(National Longitudinal Survey. Young Women 14-26 years of age in 1968)

. xtset idcode year
panel variable: idcode (unbalanced)
time variable: year, 68 to 88, but with gaps
delta: 1 unit

. xtdescribe

idcode: 1, 2, ..., 5159 n = 4711
year: 68, 69, ..., 88 T = 15
Delta(year) = 1 unit
Span(year) = 21 periods
(idcode*year uniquely identifies each observation)

Distribution of T_i: min 5% 25% 50% 75% 95% max
1 1 3 5 9 13 15

Freq. Percent Cum. | Pattern
---------------------------+-----------------------
136 2.89 2.89 | 1....................
114 2.42 5.31 | ....................1
89 1.89 7.20 | .................1.11
87 1.85 9.04 | ...................11
86 1.83 10.87 | 111111.1.11.1.11.1.11
61 1.29 12.16 | ..............11.1.11
56 1.19 13.35 | 11...................
54 1.15 14.50 | ...............1.1.11
54 1.15 15.64 | .......1.11.1.11.1.11
3974 84.36 100.00 | (other patterns)
---------------------------+-----------------------
4711 100.00 | XXXXXX.X.XX.X.XX.X.XX

.
. // number of observed years per person
. bys idcode : gen n = _N

.
. // look at the average number by birth_yr and race
. table birth_yr race, c(mean n)

----------------------------------------
birth | race
year | white black other
----------+-----------------------------
41 | 13.47059 9
42 | 8.432494 9.248176
43 | 8.744643 9.6514 9
44 | 9.061225 9.836489 12.72727
45 | 9.015494 8.800344 9.692307
46 | 8.95058 9.416667 2
47 | 8.813744 9.242945 3.307692
48 | 9.242283 9.464953 6.217391
49 | 8.712466 8.680653 6.641026
50 | 7.518686 8.045553 8.866667
51 | 7.645431 7.824108 7.594594
52 | 7.64 6.96837 5.24
53 | 7.280087 6.877095 6.703704
54 | 7
----------------------------------------

.
. // create a variable with those average numbers
. bys race birth_yr : egen double mean_n = mean(n)

.
. // there are 15 waves, so the probablity is mean_n/15
. // the weight is 1/probablity,
. // so the weight is 15/mean_n
. gen double w = 15/mean_n

.
. // look at the weights
. table birth_yr race, c(mean w)

-------------------------------------------
birth | race
year | white black other
----------+--------------------------------
41 | 1.1135371 1.6666667
42 | 1.7788331 1.6219416
43 | 1.7153359 1.5541788 1.6666667
44 | 1.6554054 1.5249344 1.1785714
45 | 1.6638022 1.7044788 1.547619
46 | 1.6758691 1.5929204 7.5
47 | 1.7018876 1.6228594 4.5348837
48 | 1.6229757 1.5847939 2.4125874
49 | 1.721671 1.7279807 2.2586873
50 | 1.9950294 1.8643839 1.6917293
51 | 1.9619562 1.9171514 1.975089
52 | 1.9633508 2.1525838 2.8625954
53 | 2.0604148 2.1811535 2.2375691
54 | 2.1428571
-------------------------------------------

Thanks for your response Maarten.

I am trying to see the effect of being a mother on women's wages, and am considering panel attrition problems as non-response (i.e. no value for wage) might be related to the woman not working in a given year due to childcare responsibilities.

I would like to test if this is this case and if it is, to create a weight equal inverse of the probability that the individual is included in the data.

So would I use your code to do this?

And do I do this only once to the data, save it, and then run regressions, or would I have to do it every time?

Comment

Maarten Buis

Join Date: Mar 2014

Posts: 3255
#5

13 Apr 2017, 01:39

I consider panel attrition as a special case of missing values where a respondent is absent from an entire wave, not just missing values on one particular variable. So I would not refer to your problem as panel attrition.

If you want to know whether missing values on wage are due to not working, then your first step would be to consult the data manual or code book that comes with your data, and look if they describe how they coded wage for non-working respondents. If that is inconclusive then you can look for a variable in your data that says whether or not that respondent works. If you have such a variable, then I would start with just a cross tabulation of that variable with an indicator variable for whether or not the wage is missing. If that variable does not exist then just don't have the empirical information necessary to check that.

If you want to ascertain whether or not the respondent works due to childcare responsibilities, then that needs to be asked to respondent and the answers need need to be recorded in the data as a variable. If that was done in your survey, great, if not, then there is nothing you can do.

Creating weights is typically something you only do once.

Weights can only adjust for the distribution of fully observed variables, so they are not a solution to your problem. Women not having wages due to not working is the classic example used for introducing the Heckman selection model. So that would be something to look into.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Ella Ki

Join Date: Mar 2017

Posts: 39
#6

13 Apr 2017, 06:59

Originally posted by Maarten Buis View Post

I consider panel attrition as a special case of missing values where a respondent is absent from an entire wave, not just missing values on one particular variable. So I would not refer to your problem as panel attrition.

If you want to know whether missing values on wage are due to not working, then your first step would be to consult the data manual or code book that comes with your data, and look if they describe how they coded wage for non-working respondents. If that is inconclusive then you can look for a variable in your data that says whether or not that respondent works. If you have such a variable, then I would start with just a cross tabulation of that variable with an indicator variable for whether or not the wage is missing. If that variable does not exist then just don't have the empirical information necessary to check that.

If you want to ascertain whether or not the respondent works due to childcare responsibilities, then that needs to be asked to respondent and the answers need need to be recorded in the data as a variable. If that was done in your survey, great, if not, then there is nothing you can do.

Creating weights is typically something you only do once.

Weights can only adjust for the distribution of fully observed variables, so they are not a solution to your problem. Women not having wages due to not working is the classic example used for introducing the Heckman selection model. So that would be something to look into.

Okay, but if just one variable is missing, the whole row wouldn't be included in the regression, would it? So it has the same effect as panel attriton?
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3255
#7

13 Apr 2017, 08:10

The mechanism is very different, and with that your options for dealing with it.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Ella Ki

Join Date: Mar 2017

Posts: 39
#8

13 Apr 2017, 08:14

Originally posted by Maarten Buis View Post

The mechanism is very different, and with that your options for dealing with it.

So if data is missing across all variables, Stata ignores the observation.

But if data is missing for one or more variable but not all, Stata does NOT ignore the observation.

Is this correct?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17072
#9

13 Apr 2017, 08:43

Ella:
no, it is not.
In both cases the observation will be listwise deleted by Stata.

Kind regards,
Carlo
(Stata 18.0 SE)
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3255
#10

13 Apr 2017, 08:45

That is incorrect. Stata will ignore the observation if it has at least one missing value.

The mechanism I was referring to is the mechanism that lead to the value being missing: e.g. refuse to answer a single question, the value not being applicable (wage when one does not have a wage), or not participating in a wave (the interviewer could not find you, or you refused to participate).

The that in "that is incorrect" refers to #8. I hadn't seen Carlo's response when typing that answer, and we are in agreement.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment

Announcement