Dropping certain observations, sample selection bias

Rose Simmons

Join Date: Feb 2017
Posts: 114

Dropping certain observations, sample selection bias

11 Apr 2017, 04:47

Hi,

I have an unbalanced panel dataset (N=2976, T=13), using survey responses.
My dependent variable is the household's ability to save (saving=1 if able to save, 0 otherwise), and I intend to use -xtprobit, re- to run my model.
hhid is the Household's unique identifier, and the data is yearly.

Code:

. xtset hhid year
       panel variable:  hhid (unbalanced)
        time variable:  year, 2004 to 2016, but with gaps
                delta:  1 unit

.
. xtdes

    hhid:  6, 21, ..., 89972                                 n =       3316
    year:  2004, 2005, ..., 2016                             T =         13
           Delta(year) = 1 unit
           Span(year)  = 13 periods
           (hhid*year uniquely identifies each observation)

Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                         1       1       1         3         6      13      13

     Freq.  Percent    Cum. |  Pattern
 ---------------------------+---------------
      280      8.44    8.44 |  ...........11
      247      7.45   15.89 |  ............1
      211      6.36   22.26 |  1111111111111
      164      4.95   27.20 |  1............
       95      2.86   30.07 |  ..........111
       81      2.44   32.51 |  ...........1.
       80      2.41   34.92 |  ..........1..
       77      2.32   37.24 |  .1...........
       74      2.23   39.48 |  11...........
     2007     60.52  100.00 | (other patterns)
 ---------------------------+---------------
     3316    100.00         |  XXXXXXXXXXXXX

The variable -position- tells me the position in the household of the interviewee:

Code:

. codebook position

----------------------------------------------------------------------------------
position                                                 position in the household
----------------------------------------------------------------------------------

                  type:  numeric (double)
                 label:  positie

                 range:  [1,7]                        units:  1
         unique values:  7                        missing .:  1/14,145

            tabulation:  Freq.   Numeric  Label
                        13,217         1  head of the household
                           684         2  spouse
                           225         3  permanent partner (not married)
                            10         4  parent (in law)
                             3         5  child living at home
                             2         6  housemate
                             3         7  family member or border
                             1         .

I would like to drop those who are not household heads (as the literature I am basing my work on uses data solely from household heads, and I think the financial data, e.g. amount saved, would be more accurate coming from household heads, because they may be better-informed than their children for example, on the household's financial affairs).

Code:

. drop if (position==2 | position==3 | position==4 | position==5 | position==6 | p
> osition==7 | position==.)
(928 observations deleted)

Q1: I wonder, have I now biased my sample by dropping observations that were not household heads?
Q2: If there is sample selection bias, please could you recommend how I may test for it? Is there a t-test that I could conduct, for example, to compare the difference in means before and after dropping observations?
Q3: Would you recommend that I look into Heckman models, or is Heckman not relevant here?

Many thanks

Last edited by Rose Simmons; 11 Apr 2017, 05:01. Reason: I added that I intend to use -xtprobit, re- to run my model

Tags: None

Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#2

11 Apr 2017, 05:07

If the literature you're working on uses household head data, and has good arguments for doing so, then you dropping non-household head observations would reduce rather than create bias.
You could do t tests to identify differences, e.g., in reported income with a variant of: (see also: http://www.stata.com/manuals13/rttest.pdf)

Code:

sysuse auto ttest price, by(foreign)

That said, differences in some variables does not mean you should not make a selection, as, again, the literature may have provided good reasons for a certain selection.

It is difficult to judge whether or not you should look into heckman without knowing what sort of research question you are after, but it's not very likely to be needed when selection would be something like household member type
Comment

Rose Simmons

Join Date: Feb 2017
Posts: 114

11 Apr 2017, 05:33

Thank you for your helpful reply Jorrit Gosens

You could do t tests to identify differences

I have conducted a t-test to compare whether -saving- (my dependent variable which measures household's ability to save) varies according to household position.
To do this test, I ran my do file again, but did not do the -drop- command shown in #1. Instead, I created a dummy variable to separate household heads and non-household heads. I hope this approach is correct:

Code:

. recode position (2=0) (3=0) (4=0) (5=0) (6=0) (7=0) (.=0)
(position: 928 changes made)

. tab position

      position in the household |      Freq.     Percent        Cum.
--------------------------------+-----------------------------------
                              0 |        928        6.56        6.56
          head of the household |     13,217       93.44      100.00
--------------------------------+-----------------------------------
                          Total |     14,145      100.00

I then conduct a t-test:

Code:

. ttest saving, by(position)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
       0 |     886    .4266366    .0166254    .4948679    .3940067    .4592664
 head of |  12,951    .3827504    .0042712    .4860769    .3743781    .3911226
---------+--------------------------------------------------------------------
combined |  13,837    .3855605    .0041379     .486745    .3774496    .3936713
---------+--------------------------------------------------------------------
    diff |            .0438862    .0168991                .0107617    .0770107
------------------------------------------------------------------------------
    diff = mean(0) - mean(head of)                                t =   2.5970
Ho: diff = 0                                     degrees of freedom =    13835

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.9953         Pr(|T| > |t|) = 0.0094          Pr(T > t) = 0.0047

My results are significant (p-value<0.05). Does this suggest that I reject the Ho (reject the statement that the means are different), hence there is a difference in samples (difference between position=1 and position=0)?

It is difficult to judge whether or not you should look into heckman without knowing what sort of research question you are after, but it's not very likely to be needed when selection would be something like household member type

My research question looks into determinants of saving (motives such as retirement, purchase etc). For example, if a household rates retirement as an important motive to save, is this associated with a higher probability to save.

Thanks

Comment

Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#4

11 Apr 2017, 06:09

Ah, if your savings variable is not continuous the t test is not your best too. Use a proportions based test, e.g., here (example 2 onwards) here (last page) and here
That said, even if the t test is not the best tool, with the sample size you have, I would guess you'd also find differences with these other tests. The question is do these differences matter, and if so how. If the literature youre working with suggests using household head data only, then these differences you find may be one reason why previous analysts have filtered out non-household head data.

A selection model would make more sense if e.g., you would assess levels of savings, given that there are many households that do not save at all. The selection in such a case would be saving any money vs not saving at all. If your outcome variable is categorical, a better method is logistic regerssion (here) section 16.3, and here
Comment
Rose Simmons

Join Date: Feb 2017

Posts: 114
#5

11 Apr 2017, 06:37

Thank you Jorrit Gosens , yes I think it should be fine me to say that I am filtering out non-household head data by using the literature to back this up, rather than statistical tests.

In one key piece of literature, they have dropped retirees from the sample, because saving behaviour has been shown to differ between retirees and non-retirees.
In this literature, they quote another article where it was found that saving behaviour differs between persons below 62 years (non-retirees) and above 62 years (retirees).
My -saving- variable is binary, so I have used -prtest- as per your suggestion.

Code:

. tab saving household | income | exceeded | spending in | past 12 | months, | excluding | major | purchases | Freq. Percent Cum. ------------+----------------------------------- No | 8,502 61.44 61.44 Yes | 5,335 38.56 100.00 ------------+----------------------------------- Total | 13,837 100.00 . tab retired occupation | status is | retired | Freq. Percent Cum. ------------+----------------------------------- No | 9,683 68.46 68.46 Yes | 4,462 31.54 100.00 ------------+----------------------------------- Total | 14,145 100.00 . prtest saving, by(retired) level(95) Two-sample test of proportions No: Number of obs = 9445 Yes: Number of obs = 4392 ------------------------------------------------------------------------------ Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- No | .3716252 .0049723 .3618796 .3813708 Yes | .4155282 .0074362 .4009536 .4301029 -------------+---------------------------------------------------------------- diff | -.043903 .0089455 -.0614358 -.0263703 | under Ho: .0088894 -4.94 0.000 ------------------------------------------------------------------------------ diff = prop(No) - prop(Yes) z = -4.9388 Ho: diff = 0 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(Z < z) = 0.0000 Pr(|Z| > |z|) = 0.0000 Pr(Z > z) = 1.0000

This suggests that there is a significant difference in means between retirees and non-retirees.
Should I drop the retirees (as done in the one piece of literature I referred to earlier), or could I include -retired- as a dummy in my regression?

A selection model would make more sense if e.g., you would assess levels of savings, given that there are many households that do not save at all. The selection in such a case would be saving any money vs not saving at all.

So, as my -saving- variable is binary (either 1 or 0, not saving levels), a selection model makes less sense?

If your outcome variable is categorical, a better method is logistic regerssion (here) section 16.3, and here

I think my outcome variable is not categorical:
My key outcome variables are 5 categories of saving motives: Prec, Purchase, Retire, Bequest and Growth.
To create these explanatory variables, I am combining survey responses using ordinal (ranked 1-7) variable responses.
For example, to create -bequest-, I am adding -reason01- and -reason09-
-reason01- describes the reason for saving as leaving a house/assets to children (importance of reason ranked 1-7, with 7 being very important).
-reason09- describes the reason for saving as leaving money to children (again ranked 1-7).
As these are both related to the -bequest- motive category, I have added these together, and -bequest- now ranges from 1-14.
I have done this for all of my saving motives (in reality my dataset is much larger, but for simplicity in this example, I have created each motive using just two -reason- variables).
Code:

Code:

gen prec = reason10 + reason14 gen purchase = reason06 + reason15 gen retire = reason03 + reason11 gen bequest = reason01 + reason09 gen growth = reason07 + reason12

Many thanks

Last edited by Rose Simmons; 11 Apr 2017, 06:49.
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#6

11 Apr 2017, 07:04

Your outcome variable is the -saving- variable in your description until now. That's a dummy, which is a type of categorical variable, and means logistic regression better suits your needs.
I haven't worked much with categorical explanatory variables, but your adding these together seems a little funny to me. Consider asking for more advice from others.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3439
#7

11 Apr 2017, 07:30

Adding explanatory/independent/right-hand-side/x-variables can make sense: it is an old school technique for constraining coefficients to be equal. See: http://www.stata-journal.com/article...article=st0261

So the effect of bequest is the effect of reason01 and reason09 if you constrained their effects to be equal, i.e. the effect on saving of the importance you assign to leaving a house/assets to children is the same as the effect on saving of the importance you assign to leaving money to children. That does not sound too outlandish to me, but this is a testable assumption. So why not do test it rather than assume it?

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Rose Simmons

Join Date: Feb 2017

Posts: 114
#8

11 Apr 2017, 07:36

Apologies for the confusion Jorrit Gosens , I mistakenly thought an outcome variable was a dependent variable.
Yes, the -saving- variable is the independent variable and it is a dummy, so I will look into using -xtlogit- instead of -xtprobit-.

With regards to #5:
The literature piece said "The sample is limited to non-retired respondents since retirees have been shown to exhibit different saving behaviors than non-retirees [....] After deleting households in which the respondent was retired, the unweighted sample consists of 3823 respondents."
However, I am unsure about whether I should drop retired respondents. Surely it can be argued that various groups can exhibit different behaviours - e.g. married respondents may exhibit different saving behaviours to non-married respondents? In which case, I could end up dropping many observations.

Would it be acceptable for me to include a dummy variable for -retired- in my regression, rather than dropping retired respondents?

Thanks
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17703
#9

11 Apr 2017, 07:48

Rose:

Would it be acceptable for me to include a dummy variable for -retired- in my regression, rather than dropping retired respondents?

Yes, I would go that way (categorical variable with 0=still working; 1=retired).

Kind regards,
Carlo
(Stata 19.0)
Comment

Rose Simmons

Join Date: Feb 2017
Posts: 114

#10

11 Apr 2017, 07:55

Maarten Buis , yes I think it would be best to test this assumption, thank you for the link.
I have attempted the test below:

Code:

. tab reason01

     reasons for |
saving money: to |
   leave a house |
    and/or other |
 valuable assets |
        to my ch |      Freq.     Percent        Cum.
-----------------+-----------------------------------
Very unimportant |      3,889       32.56       32.56
               2 |      1,966       16.46       49.02
               3 |      1,547       12.95       61.97
               4 |      1,779       14.89       76.87
               5 |      1,343       11.24       88.11
               6 |        879        7.36       95.47
  Very important |        541        4.53      100.00
-----------------+-----------------------------------
           Total |     11,944      100.00

. tab reason09

     reasons for |
saving money: to |
  leave money to |
   your children |
       (or other |
      relatives) |      Freq.     Percent        Cum.
-----------------+-----------------------------------
Very unimportant |      4,019       31.88       31.88
               2 |      2,165       17.17       49.05
               3 |      1,471       11.67       60.72
               4 |      1,812       14.37       75.09
               5 |      1,528       12.12       87.21
               6 |      1,055        8.37       95.57
  Very important |        558        4.43      100.00
-----------------+-----------------------------------
           Total |     12,608      100.00

. generate byte bequest = reason01 + reason09
(2,263 missing values generated)

. quietly logit saving bequest, or nolog

. estimates store sum1

. constraint 1 reason01 = reason09

. quietly logit saving reason01 reason09, or constraint(1) nolog

. estimates store constr1

. quietly logit saving reason01 reason09, or nolog

. estimates store unconstr1

. estimates table sum1 constr1 unconstr1, stats(ll N) eform b(%9.3g) se(%9.3g) stfmt(%9
> .4g)

--------------------------------------------------
    Variable |   sum1       constr1    unconstr1  
-------------+------------------------------------
     bequest |      1.01                          
             |    .00532                          
    reason01 |                  1.01        .992  
             |                .00532       .0185  
    reason09 |                  1.01        1.04  
             |                .00532        .019  
       _cons |      .575        .575        .575  
             |     .0212       .0212       .0211  
-------------+------------------------------------
          ll |     -7912       -7912       -7911  
           N |     11882       11882       11882  
--------------------------------------------------
                                      legend: b/se

. lrtest constr1 unconstr1

Likelihood-ratio test                                 LR chi2(1)  =      1.49
(Assumption: constr1 nested in unconstr1)             Prob > chi2 =    0.2224

My understanding of the above is as follows. Please could you let me know if I have interpreted it correctly?
- sum1 is the equation I have been using, with the independent variable saving on the left-hand side, and the bequest motive variable created by summation on the right-hand side.
- constr1 constrains the effects of reason01 and reason09 to be equal
- unconstr1 does not have this constraint
- The likelihood ratio test compares constr1 and unconstr1
- As the p-value>0.05, the lrtest result is insignificant, suggesting that we accept the null hypothesis that the constraint can be reason01=reason09, so in this case, adding the two variables (reason01 and reason09) to create bequest can be justified statistically?

Also, I wanted to ask, as my data is panel data, should I have used xtlogit/xtprobit instead of logit in the test above?

Many thanks

Last edited by Rose Simmons; 11 Apr 2017, 08:04. Reason: -tab reason01- and -tab reason09- commands added for clarity

Comment

Rose Simmons

Join Date: Feb 2017

Posts: 114
#11

11 Apr 2017, 08:08

Originally posted by Carlo Lazzaro View Post

Rose:

Yes, I would go that way (categorical variable with 0=still working; 1=retired).

Thank you Carlo Lazzaro
The inclusion of the -retired- dummy should account for the different saving behaviours of retirees and non-retirees (so no need to drop retirees from sample)
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17703
#12

11 Apr 2017, 08:31

Rose:
correct.
I would only add:

account for the different saving behaviours of retirees and non-retirees

when adjusted for the remaining predictors.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Rose Simmons

Join Date: Feb 2017

Posts: 114
#13

12 Apr 2017, 03:11

Originally posted by Rose Simmons View Post

Also, I wanted to ask, as my data is panel data, should I have used xtlogit/xtprobit instead of logit in the test above?

Maarten Buis I had another look at link that you provided and I noticed in the second example that you used -ologit- as you had "two or more ordinal or categorical variables that you want to combine".
Question 1: Was there a reason why you chose -logit- instead of -probit-?
Question 2: As shown in #10, my reason variables are ordinal (ranked 1-7 on the Likert Scale). Therefore, when I run my regression, should I be running it as -xtologit- , or even -xtoprobit-?

Many thanks
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#14

12 Apr 2017, 05:26

There's a ton of discussion on logit v probit, but if you are trying to understand the differences, a good start is some lecture slides here and here , or an archived discussion here
Short recap: results (in significance, sign) will be similar across both, but logit coefficients are more easy to interpret.

No need to look into ordered logit or probit. These models are applicable when your outcome variable is ordinal and has >2 values.
Comment
Rose Simmons

Join Date: Feb 2017

Posts: 114
#15

12 Apr 2017, 06:16

Thank you for the links Jorrit Gosens
I would like to report and compare marginal effects, so I think it would be best for me to use -xtprobit- rather than -xtlogit-.
Also, my research discipline is Economics, where Probit tends to be used more.

No need to look into ordered logit or probit. These models are applicable when your outcome variable is ordinal and has >2 values.

Indeed, when I run my normal regression, I will not used ordered logit/probit, as the outcome variable -saving- is binary.
I had wondered whether the ordered logit/probit was necessary though for the assumption tested in #10? In the link provided by Maarten Buis (http://www.stata-journal.com/article...article=st0261) the ordered logit is used in the second example (page 5). In this example, there are "two or more ordinal or categorical variables that you want to combine". So the variables being combined are ordinal, which is why I thought -ologit- had been used. But looking again, I suppose the outcome variable in this example -degree- might also be ordinal which is why -ologit- was used in the test.
So, in my regression, -saving- is binary but the reason variables are ordinal - for the purposes of the test in #10, should I use xtlogit (rather than logit or xtologit)?

Thank you

Last edited by Rose Simmons; 12 Apr 2017, 06:26. Reason: Added that my research discipline is Economics
Comment

Announcement

Dropping certain observations, sample selection bias

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment