Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping certain observations, sample selection bias

    Hi,

    I have an unbalanced panel dataset (N=2976, T=13), using survey responses.
    My dependent variable is the household's ability to save (saving=1 if able to save, 0 otherwise), and I intend to use -xtprobit, re- to run my model.
    hhid is the Household's unique identifier, and the data is yearly.

    Code:
    . xtset hhid year
           panel variable:  hhid (unbalanced)
            time variable:  year, 2004 to 2016, but with gaps
                    delta:  1 unit
    
    .
    . xtdes
    
        hhid:  6, 21, ..., 89972                                 n =       3316
        year:  2004, 2005, ..., 2016                             T =         13
               Delta(year) = 1 unit
               Span(year)  = 13 periods
               (hhid*year uniquely identifies each observation)
    
    Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                             1       1       1         3         6      13      13
    
         Freq.  Percent    Cum. |  Pattern
     ---------------------------+---------------
          280      8.44    8.44 |  ...........11
          247      7.45   15.89 |  ............1
          211      6.36   22.26 |  1111111111111
          164      4.95   27.20 |  1............
           95      2.86   30.07 |  ..........111
           81      2.44   32.51 |  ...........1.
           80      2.41   34.92 |  ..........1..
           77      2.32   37.24 |  .1...........
           74      2.23   39.48 |  11...........
         2007     60.52  100.00 | (other patterns)
     ---------------------------+---------------
         3316    100.00         |  XXXXXXXXXXXXX
    The variable -position- tells me the position in the household of the interviewee:
    Code:
    . codebook position
    
    ----------------------------------------------------------------------------------
    position                                                 position in the household
    ----------------------------------------------------------------------------------
    
                      type:  numeric (double)
                     label:  positie
    
                     range:  [1,7]                        units:  1
             unique values:  7                        missing .:  1/14,145
    
                tabulation:  Freq.   Numeric  Label
                            13,217         1  head of the household
                               684         2  spouse
                               225         3  permanent partner (not married)
                                10         4  parent (in law)
                                 3         5  child living at home
                                 2         6  housemate
                                 3         7  family member or border
                                 1         .
    I would like to drop those who are not household heads (as the literature I am basing my work on uses data solely from household heads, and I think the financial data, e.g. amount saved, would be more accurate coming from household heads, because they may be better-informed than their children for example, on the household's financial affairs).

    Code:
    . drop if (position==2 | position==3 | position==4 | position==5 | position==6 | p
    > osition==7 | position==.)
    (928 observations deleted)
    Q1: I wonder, have I now biased my sample by dropping observations that were not household heads?
    Q2: If there is sample selection bias, please could you recommend how I may test for it? Is there a t-test that I could conduct, for example, to compare the difference in means before and after dropping observations?
    Q3: Would you recommend that I look into Heckman models, or is Heckman not relevant here?

    Many thanks
    Last edited by Rose Simmons; 11 Apr 2017, 05:01. Reason: I added that I intend to use -xtprobit, re- to run my model

  • #2
    If the literature you're working on uses household head data, and has good arguments for doing so, then you dropping non-household head observations would reduce rather than create bias.
    You could do t tests to identify differences, e.g., in reported income with a variant of: (see also: http://www.stata.com/manuals13/rttest.pdf)
    Code:
    sysuse auto
    ttest price, by(foreign)
    That said, differences in some variables does not mean you should not make a selection, as, again, the literature may have provided good reasons for a certain selection.

    It is difficult to judge whether or not you should look into heckman without knowing what sort of research question you are after, but it's not very likely to be needed when selection would be something like household member type

    Comment


    • #3
      Thank you for your helpful reply Jorrit Gosens

      You could do t tests to identify differences
      I have conducted a t-test to compare whether -saving- (my dependent variable which measures household's ability to save) varies according to household position.
      To do this test, I ran my do file again, but did not do the -drop- command shown in #1. Instead, I created a dummy variable to separate household heads and non-household heads. I hope this approach is correct:
      Code:
      . recode position (2=0) (3=0) (4=0) (5=0) (6=0) (7=0) (.=0)
      (position: 928 changes made)
      
      . tab position
      
            position in the household |      Freq.     Percent        Cum.
      --------------------------------+-----------------------------------
                                    0 |        928        6.56        6.56
                head of the household |     13,217       93.44      100.00
      --------------------------------+-----------------------------------
                                Total |     14,145      100.00
      I then conduct a t-test:
      Code:
      . ttest saving, by(position)
      
      Two-sample t test with equal variances
      ------------------------------------------------------------------------------
         Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
      ---------+--------------------------------------------------------------------
             0 |     886    .4266366    .0166254    .4948679    .3940067    .4592664
       head of |  12,951    .3827504    .0042712    .4860769    .3743781    .3911226
      ---------+--------------------------------------------------------------------
      combined |  13,837    .3855605    .0041379     .486745    .3774496    .3936713
      ---------+--------------------------------------------------------------------
          diff |            .0438862    .0168991                .0107617    .0770107
      ------------------------------------------------------------------------------
          diff = mean(0) - mean(head of)                                t =   2.5970
      Ho: diff = 0                                     degrees of freedom =    13835
      
          Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
       Pr(T < t) = 0.9953         Pr(|T| > |t|) = 0.0094          Pr(T > t) = 0.0047
      My results are significant (p-value<0.05). Does this suggest that I reject the Ho (reject the statement that the means are different), hence there is a difference in samples (difference between position=1 and position=0)?


      It is difficult to judge whether or not you should look into heckman without knowing what sort of research question you are after, but it's not very likely to be needed when selection would be something like household member type
      My research question looks into determinants of saving (motives such as retirement, purchase etc). For example, if a household rates retirement as an important motive to save, is this associated with a higher probability to save.

      Thanks

      Comment


      • #4
        Ah, if your savings variable is not continuous the t test is not your best too. Use a proportions based test, e.g., here (example 2 onwards) here (last page) and here
        That said, even if the t test is not the best tool, with the sample size you have, I would guess you'd also find differences with these other tests. The question is do these differences matter, and if so how. If the literature youre working with suggests using household head data only, then these differences you find may be one reason why previous analysts have filtered out non-household head data.

        A selection model would make more sense if e.g., you would assess levels of savings, given that there are many households that do not save at all. The selection in such a case would be saving any money vs not saving at all. If your outcome variable is categorical, a better method is logistic regerssion (here) section 16.3, and here

        Comment


        • #5
          Thank you Jorrit Gosens , yes I think it should be fine me to say that I am filtering out non-household head data by using the literature to back this up, rather than statistical tests.

          In one key piece of literature, they have dropped retirees from the sample, because saving behaviour has been shown to differ between retirees and non-retirees.
          In this literature, they quote another article where it was found that saving behaviour differs between persons below 62 years (non-retirees) and above 62 years (retirees).
          My -saving- variable is binary, so I have used -prtest- as per your suggestion.
          Code:
          . tab saving
          
            household |
               income |
             exceeded |
          spending in |
              past 12 |
              months, |
            excluding |
                major |
            purchases |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                   No |      8,502       61.44       61.44
                  Yes |      5,335       38.56      100.00
          ------------+-----------------------------------
                Total |     13,837      100.00
          
          . tab retired
          
           occupation |
            status is |
              retired |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                   No |      9,683       68.46       68.46
                  Yes |      4,462       31.54      100.00
          ------------+-----------------------------------
                Total |     14,145      100.00
          
          . prtest saving, by(retired) level(95)
          
          Two-sample test of proportions                    No: Number of obs =     9445
                                                           Yes: Number of obs =     4392
          ------------------------------------------------------------------------------
              Variable |       Mean   Std. Err.      z    P>|z|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
                    No |   .3716252   .0049723                      .3618796    .3813708
                   Yes |   .4155282   .0074362                      .4009536    .4301029
          -------------+----------------------------------------------------------------
                  diff |   -.043903   .0089455                     -.0614358   -.0263703
                       |  under Ho:   .0088894    -4.94   0.000
          ------------------------------------------------------------------------------
                  diff = prop(No) - prop(Yes)                               z =  -4.9388
              Ho: diff = 0
          
              Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
           Pr(Z < z) = 0.0000         Pr(|Z| > |z|) = 0.0000          Pr(Z > z) = 1.0000
          This suggests that there is a significant difference in means between retirees and non-retirees.
          Should I drop the retirees (as done in the one piece of literature I referred to earlier), or could I include -retired- as a dummy in my regression?


          A selection model would make more sense if e.g., you would assess levels of savings, given that there are many households that do not save at all. The selection in such a case would be saving any money vs not saving at all.
          So, as my -saving- variable is binary (either 1 or 0, not saving levels), a selection model makes less sense?

          If your outcome variable is categorical, a better method is logistic regerssion (here) section 16.3, and here
          I think my outcome variable is not categorical:
          My key outcome variables are 5 categories of saving motives: Prec, Purchase, Retire, Bequest and Growth.
          To create these explanatory variables, I am combining survey responses using ordinal (ranked 1-7) variable responses.
          For example, to create -bequest-, I am adding -reason01- and -reason09-
          -reason01- describes the reason for saving as leaving a house/assets to children (importance of reason ranked 1-7, with 7 being very important).
          -reason09- describes the reason for saving as leaving money to children (again ranked 1-7).
          As these are both related to the -bequest- motive category, I have added these together, and -bequest- now ranges from 1-14.
          I have done this for all of my saving motives (in reality my dataset is much larger, but for simplicity in this example, I have created each motive using just two -reason- variables).
          Code:
          Code:
          gen prec = reason10 + reason14
          gen purchase = reason06 + reason15
          gen retire = reason03 + reason11
          gen bequest = reason01 + reason09
          gen growth = reason07 + reason12
          Many thanks
          Last edited by Rose Simmons; 11 Apr 2017, 06:49.

          Comment


          • #6
            Your outcome variable is the -saving- variable in your description until now. That's a dummy, which is a type of categorical variable, and means logistic regression better suits your needs.
            I haven't worked much with categorical explanatory variables, but your adding these together seems a little funny to me. Consider asking for more advice from others.

            Comment


            • #7
              Adding explanatory/independent/right-hand-side/x-variables can make sense: it is an old school technique for constraining coefficients to be equal. See: http://www.stata-journal.com/article...article=st0261

              So the effect of bequest is the effect of reason01 and reason09 if you constrained their effects to be equal, i.e. the effect on saving of the importance you assign to leaving a house/assets to children is the same as the effect on saving of the importance you assign to leaving money to children. That does not sound too outlandish to me, but this is a testable assumption. So why not do test it rather than assume it?
              ---------------------------------
              Maarten L. Buis
              University of Konstanz
              Department of history and sociology
              box 40
              78457 Konstanz
              Germany
              http://www.maartenbuis.nl
              ---------------------------------

              Comment


              • #8
                Apologies for the confusion Jorrit Gosens , I mistakenly thought an outcome variable was a dependent variable.
                Yes, the -saving- variable is the independent variable and it is a dummy, so I will look into using -xtlogit- instead of -xtprobit-.

                With regards to #5:
                The literature piece said "The sample is limited to non-retired respondents since retirees have been shown to exhibit different saving behaviors than non-retirees [....] After deleting households in which the respondent was retired, the unweighted sample consists of 3823 respondents."
                However, I am unsure about whether I should drop retired respondents. Surely it can be argued that various groups can exhibit different behaviours - e.g. married respondents may exhibit different saving behaviours to non-married respondents? In which case, I could end up dropping many observations.

                Would it be acceptable for me to include a dummy variable for -retired- in my regression, rather than dropping retired respondents?

                Thanks

                Comment


                • #9
                  Rose:
                  Would it be acceptable for me to include a dummy variable for -retired- in my regression, rather than dropping retired respondents?
                  Yes, I would go that way (categorical variable with 0=still working; 1=retired).
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment


                  • #10
                    Maarten Buis , yes I think it would be best to test this assumption, thank you for the link.
                    I have attempted the test below:

                    Code:
                    . tab reason01
                    
                         reasons for |
                    saving money: to |
                       leave a house |
                        and/or other |
                     valuable assets |
                            to my ch |      Freq.     Percent        Cum.
                    -----------------+-----------------------------------
                    Very unimportant |      3,889       32.56       32.56
                                   2 |      1,966       16.46       49.02
                                   3 |      1,547       12.95       61.97
                                   4 |      1,779       14.89       76.87
                                   5 |      1,343       11.24       88.11
                                   6 |        879        7.36       95.47
                      Very important |        541        4.53      100.00
                    -----------------+-----------------------------------
                               Total |     11,944      100.00
                    
                    . tab reason09
                    
                         reasons for |
                    saving money: to |
                      leave money to |
                       your children |
                           (or other |
                          relatives) |      Freq.     Percent        Cum.
                    -----------------+-----------------------------------
                    Very unimportant |      4,019       31.88       31.88
                                   2 |      2,165       17.17       49.05
                                   3 |      1,471       11.67       60.72
                                   4 |      1,812       14.37       75.09
                                   5 |      1,528       12.12       87.21
                                   6 |      1,055        8.37       95.57
                      Very important |        558        4.43      100.00
                    -----------------+-----------------------------------
                               Total |     12,608      100.00
                    
                    . generate byte bequest = reason01 + reason09
                    (2,263 missing values generated)
                    
                    . quietly logit saving bequest, or nolog
                    
                    . estimates store sum1
                    
                    . constraint 1 reason01 = reason09
                    
                    . quietly logit saving reason01 reason09, or constraint(1) nolog
                    
                    . estimates store constr1
                    
                    . quietly logit saving reason01 reason09, or nolog
                    
                    . estimates store unconstr1
                    
                    . estimates table sum1 constr1 unconstr1, stats(ll N) eform b(%9.3g) se(%9.3g) stfmt(%9
                    > .4g)
                    
                    --------------------------------------------------
                        Variable |   sum1       constr1    unconstr1  
                    -------------+------------------------------------
                         bequest |      1.01                          
                                 |    .00532                          
                        reason01 |                  1.01        .992  
                                 |                .00532       .0185  
                        reason09 |                  1.01        1.04  
                                 |                .00532        .019  
                           _cons |      .575        .575        .575  
                                 |     .0212       .0212       .0211  
                    -------------+------------------------------------
                              ll |     -7912       -7912       -7911  
                               N |     11882       11882       11882  
                    --------------------------------------------------
                                                          legend: b/se
                    
                    . lrtest constr1 unconstr1
                    
                    Likelihood-ratio test                                 LR chi2(1)  =      1.49
                    (Assumption: constr1 nested in unconstr1)             Prob > chi2 =    0.2224
                    My understanding of the above is as follows. Please could you let me know if I have interpreted it correctly?
                    - sum1 is the equation I have been using, with the independent variable saving on the left-hand side, and the bequest motive variable created by summation on the right-hand side.
                    - constr1 constrains the effects of reason01 and reason09 to be equal
                    - unconstr1 does not have this constraint
                    - The likelihood ratio test compares constr1 and unconstr1
                    - As the p-value>0.05, the lrtest result is insignificant, suggesting that we accept the null hypothesis that the constraint can be reason01=reason09, so in this case, adding the two variables (reason01 and reason09) to create bequest can be justified statistically?

                    Also, I wanted to ask, as my data is panel data, should I have used xtlogit/xtprobit instead of logit in the test above?

                    Many thanks
                    Last edited by Rose Simmons; 11 Apr 2017, 08:04. Reason: -tab reason01- and -tab reason09- commands added for clarity

                    Comment


                    • #11
                      Originally posted by Carlo Lazzaro View Post
                      Rose:


                      Yes, I would go that way (categorical variable with 0=still working; 1=retired).
                      Thank you Carlo Lazzaro
                      The inclusion of the -retired- dummy should account for the different saving behaviours of retirees and non-retirees (so no need to drop retirees from sample)

                      Comment


                      • #12
                        Rose:
                        correct.
                        I would only add:
                        account for the different saving behaviours of retirees and non-retirees
                        when adjusted for the remaining predictors.
                        Kind regards,
                        Carlo
                        (Stata 19.0)

                        Comment


                        • #13
                          Originally posted by Rose Simmons View Post
                          Also, I wanted to ask, as my data is panel data, should I have used xtlogit/xtprobit instead of logit in the test above?
                          Maarten Buis I had another look at link that you provided and I noticed in the second example that you used -ologit- as you had "two or more ordinal or categorical variables that you want to combine".
                          Question 1: Was there a reason why you chose -logit- instead of -probit-?
                          Question 2: As shown in #10, my reason variables are ordinal (ranked 1-7 on the Likert Scale). Therefore, when I run my regression, should I be running it as -xtologit- , or even -xtoprobit-?

                          Many thanks

                          Comment


                          • #14
                            There's a ton of discussion on logit v probit, but if you are trying to understand the differences, a good start is some lecture slides here and here , or an archived discussion here
                            Short recap: results (in significance, sign) will be similar across both, but logit coefficients are more easy to interpret.

                            No need to look into ordered logit or probit. These models are applicable when your outcome variable is ordinal and has >2 values.

                            Comment


                            • #15
                              Thank you for the links Jorrit Gosens
                              I would like to report and compare marginal effects, so I think it would be best for me to use -xtprobit- rather than -xtlogit-.
                              Also, my research discipline is Economics, where Probit tends to be used more.

                              No need to look into ordered logit or probit. These models are applicable when your outcome variable is ordinal and has >2 values.
                              Indeed, when I run my normal regression, I will not used ordered logit/probit, as the outcome variable -saving- is binary.
                              I had wondered whether the ordered logit/probit was necessary though for the assumption tested in #10? In the link provided by Maarten Buis (http://www.stata-journal.com/article...article=st0261) the ordered logit is used in the second example (page 5). In this example, there are "two or more ordinal or categorical variables that you want to combine". So the variables being combined are ordinal, which is why I thought -ologit- had been used. But looking again, I suppose the outcome variable in this example -degree- might also be ordinal which is why -ologit- was used in the test.
                              So, in my regression, -saving- is binary but the reason variables are ordinal - for the purposes of the test in #10, should I use xtlogit (rather than logit or xtologit)?

                              Thank you
                              Last edited by Rose Simmons; 12 Apr 2017, 06:26. Reason: Added that my research discipline is Economics

                              Comment

                              Working...
                              X