Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Confusion with a fixed effect regression for panel data with lags/leads

    I performed a fixed effects regression but only got around 7'000 observations when I should get more than 100'000.

    I am using stata 15.

    For extra information, I am using panel data to see the effects that the act of internal migration has on a person's subjective well being.
    I think an issue is that I am using the explanatory variable M0 ( shows if the person internally migrated) and its lags and leads ( e.t.c L3.M0, F3.M0). The problem is that as I created this from the original variable (Movest) which includes options such as 'new entrant' and 'moved back to GB', will this affect my results ( because I seem to ignore if they were a new entrant or moved back to UK) and are there recommended solutions I can do for this?

    This is what I got from the command:
    xtreg lfsato age nkids mastat L5.M0 L4.M0 L3.M0 L2.M0 L1.M0 M0 F1.M0 F2.M0 F3.M0 F4.M0 F5.M0, fe
    HTML Code:
    Fixed-effects (within) regression               Number of obs     =      7,667
    Group variable: pid                             Number of groups  =      4,027
    
    R-sq:                                           Obs per group:
         within  = 0.0083                                         min =          1
         between = 0.0274                                         avg =        1.9
         overall = 0.0238                                         max =          2
    
                                                    F(14,3626)        =       2.17
    corr(u_i, Xb)  = -0.4845                        Prob > F          =     0.0068
    
    ------------------------------------------------------------------------------
          lfsato |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             age |   .0483116   .0175464     2.75   0.006     .0139098    .0827133
           nkids |  -.1035138   .0644501    -1.61   0.108     -.229876    .0228483
          mastat |  -.0206515   .0396681    -0.52   0.603    -.0984256    .0571225
                 |
              M0 |
             L5. |   .1455779   .0881316     1.65   0.099    -.0272146    .3183703
             L4. |   .2665378   .1237939     2.15   0.031     .0238253    .5092504
             L3. |   .2677237   .1488129     1.80   0.072    -.0240415     .559489
             L2. |   .0835615   .1711475     0.49   0.625    -.2519934    .4191163
             L1. |    .145695   .1847631     0.79   0.430    -.2165549     .507945
             --. |   .1794107   .1960381     0.92   0.360    -.2049451    .5637665
             F1. |  -.1019313    .203507    -0.50   0.616    -.5009308    .2970682
             F2. |  -.3197264   .2059309    -1.55   0.121    -.7234784    .0840255
             F3. |  -.2678047   .1987484    -1.35   0.178    -.6574743     .121865
             F4. |  -.1905153   .1841593    -1.03   0.301    -.5515814    .1705508
             F5. |    -.02172   .1577854    -0.14   0.891     -.331077     .287637
                 |
           _cons |   2.857905   .9339647     3.06   0.002     1.026757    4.689053
    -------------+----------------------------------------------------------------
         sigma_u |   1.224461
         sigma_e |   .7531382
             rho |  .72552085   (fraction of variance due to u_i)
    ------------------------------------------------------------------------------
    F test that all u_i=0: F(4026, 3626) = 3.70                  Prob > F = 0.0000
    and here is a snippet of my data:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long pid byte lfsato int age byte(nkids mastat movest) float M0
    10007857 6 64 0 3 4 0
    10007857 5 65 0 3 1 0
    10007857 5 66 0 3 1 0
    10007857 . 67 0 3 1 0
    10014578 5 59 0 1 1 0
    10014578 7 61 0 1 1 0
    10014578 7 62 0 1 1 0
    10014578 6 63 0 1 1 0
    10014578 6 65 0 1 1 0
    10014578 6 66 0 1 1 0
    10014578 5 67 0 1 1 0
    10014578 5 68 0 1 1 0
    10014578 6 69 0 1 1 0
    10014578 7 70 0 1 2 1
    10014578 6 71 0 1 1 0
    10014608 5 62 0 1 1 0
    10014608 6 64 0 1 1 0
    10014608 6 65 0 1 1 0
    10014608 5 66 0 1 1 0
    10014608 6 68 0 1 1 0
    10014608 6 69 0 1 1 0
    10014608 6 70 0 1 1 0
    10014608 5 71 0 1 1 0
    10014608 3 72 0 1 1 0
    10014608 6 73 0 1 2 1
    10014608 3 74 0 1 1 0
    10016813 5 41 1 5 1 0
    10016813 . 43 0 4 1 0
    10016813 6 44 0 4 1 0
    10016813 . 45 0 4 1 0
    10016813 . 47 0 4 1 0
    10016813 . 49 0 3 1 0
    10016848 2 37 0 5 2 1
    10016848 1 39 0 4 1 0
    10016848 4 40 0 2 1 0
    10016848 3 41 0 2 1 0
    10016848 . 42 0 2 1 0
    10016848 6 43 0 2 1 0
    10016848 6 44 0 2 1 0
    10016848 5 45 0 2 1 0
    10016848 5 46 0 1 1 0
    10016848 6 47 0 1 1 0
    10016848 5 48 0 1 1 0
    10016848 4 49 0 1 1 0
    10016872 . 17 0 6 1 0
    10016872 6 18 0 6 1 0
    10016872 . 19 0 6 1 0
    10016872 . 19 0 6 1 0
    10016872 6 21 0 6 1 0
    10016872 6 22 0 2 1 0
    10016872 . 23 0 2 1 0
    10016872 6 24 0 2 1 0
    10016872 6 25 0 2 1 0
    10016872 6 26 0 2 1 0
    10016872 6 27 1 2 2 1
    10017933 6 54 0 5 1 0
    10017933 6 55 0 4 1 0
    10017933 6 56 0 4 1 0
    10017933 3 56 0 4 1 0
    10017933 5 58 0 4 1 0
    10017933 . 59 0 4 1 0
    10017933 5 60 0 4 1 0
    10017933 6 60 0 4 1 0
    10017933 4 62 0 4 1 0
    10017933 6 63 0 4 1 0
    10017933 5 64 0 4 1 0
    10017933 6 65 0 4 1 0
    10017933 6 66 0 4 1 0
    10017992 5 17 0 6 1 0
    10017992 6 18 0 6 1 0
    10017992 5 19 0 6 1 0
    10017992 4 19 0 6 1 0
    10017992 6 21 0 6 1 0
    10017992 . 22 0 6 1 0
    10017992 5 23 0 6 1 0
    10017992 . 23 0 6 1 0
    10017992 5 25 0 6 1 0
    10017992 7 26 0 6 1 0
    10017992 4 27 0 6 1 0
    10017992 5 28 0 6 1 0
    10019057 . 64 0 6 2 1
    10019057 6 65 0 6 1 0
    10019057 6 66 0 6 1 0
    10019057 6 67 0 6 1 0
    10019057 5 67 0 6 1 0
    10019057 . 68 0 6 1 0
    10019057 5 69 0 6 1 0
    10019057 . 71 0 6 1 0
    10019057 6 71 0 6 1 0
    10019057 5 73 0 6 1 0
    10019057 6 74 0 6 1 0
    10019057 6 75 0 6 1 0
    10019057 5 76 0 6 1 0
    10023526 4 43 0 4 1 0
    10023526 5 44 0 4 1 0
    10023526 . 48 0 1 1 0
    10023526 5 49 0 4 1 0
    10023526 5 50 0 4 1 0
    10023526 5 51 0 4 1 0
    10023526 5 52 0 4 1 0
    end
    label values lfsato flfsato
    label def flfsato 1 "not satisfied at all", modify
    label def flfsato 7 "completely satisfied", modify
    label values age fage
    label values nkids fnkids
    label def fnkids 0 "none", modify
    label values mastat fmastat
    label def fmastat 1 "married", modify
    label def fmastat 2 "living as couple", modify
    label def fmastat 3 "widowed", modify
    label def fmastat 4 "divorced", modify
    label def fmastat 5 "separated", modify
    label def fmastat 6 "never married", modify
    label values movest fmovest
    label def fmovest 1 "non-mover", modify
    label def fmovest 2 "mover within gb", modify
    label def fmovest 4 "mover back to gb", modify
    This follows on from where I got rid of individuals that migrated multiple time
    https://www.statalist.org/forums/for...multiple-times

    Thanks for the help.

  • #2
    Consider your sample data for pid 10014578. You have 11 observations. Of these, only the sixth observation will have nonmissing values for all 10 lags and leads of M0. So xtreg will omit 10 of those 11 observations from the model, retaining only the sixth observation.

    Starting with your sample data, we'll generate a fake "time" variable because you neglected to include it in your sample data, and then apply xtset and xtdescribe and review the results.
    Code:
    . sort pid, stable
    
    . by pid: generate time=_n
    
    . xtset pid time
           panel variable:  pid (unbalanced)
            time variable:  time, 1 to 13
                    delta:  1 unit
    
    . xtdescribe
    
         pid:  10007857, 10014578, ..., 10023526                 n =         10
        time:  1, 2, ..., 13                                     T =         13
               Delta(time) = 1 unit
               Span(time)  = 13 periods
               (pid*time uniquely identifies each observation)
    
    Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                             4       4       7        11        12      13      13
    
         Freq.  Percent    Cum. |  Pattern
     ---------------------------+---------------
            3     30.00   30.00 |  11111111111..
            2     20.00   50.00 |  111111111111.
            2     20.00   70.00 |  1111111111111
            1     10.00   80.00 |  1111.........
            1     10.00   90.00 |  111111.......
            1     10.00  100.00 |  1111111......
     ---------------------------+---------------
           10    100.00         |  XXXXXXXXXXXXX
    So we have two pids with 13 periods, which will each yield 3 observations in xtreg; two pids with 12 periods, which will each yield 2 observations, and three pids with 11 periods, which will each yield 1 observation. All the other pids do not yield any observations. So from your 100 original observations, you will have 3*2 + 2*2 + 1*3 = 13 observations in xtreg.

    Note that your xtreg output told you that you apparently got at most 2 observations per pid.
    Last edited by William Lisowski; 12 Aug 2018, 09:59.

    Comment


    • #3
      Originally posted by William Lisowski View Post
      [/CODE]
      So we have two pids with 13 periods, which will each yield 3 observations in xtreg; two pids with 12 periods, which will each yield 2 observations, and three pids with 11 periods, which will each yield 1 observation. All the other pids do not yield any observations. So from your 100 original observations, you will have 3*2 + 2*2 + 1*3 = 13 observations in xtreg.

      Note that your xtreg output told you that you apparently got at most 2 observations per pid.
      Thanks you for this, is there a way to take into account all observations per pid?

      Comment


      • #4
        Completely my bad, I understand it a bit more. Would the reduced amount of observations in an issue for my regression?

        Comment


        • #5
          I cannot comment. My own thought is that there might be a better way to parameterize this. Each observation represents SWB at some point before or after a move, except for the pids that do not move. And in that case, I don't think they tell you anything about the effect of migration. So I'd be inclined to drop nonmovers and replace your lags and leads with something that indicates the number of years before or number of years after the move.

          Beyond that, your outcome would perhaps be better modeled with an ordered logit or ordered probit model.

          Comment


          • #6
            Originally posted by William Lisowski View Post
            So I'd be inclined to drop non-movers and replace your lags and leads with something that indicates the number of years before or number of years after the move.
            Thank you for the suggestion. I am trying it in the case of only movers, is there a command for me to indicate number of years before and after a move?

            Comment


            • #7
              The following should point you in a useful direction.
              Code:
              sort pid, stable
              by pid: generate time = _n
              bysort pid (time): egen movetime = max(cond(M0==1,time,.))
              generate years = time-movetime
              list if pid <= 10014578, nolabel sepby(pid)
              Code:
              . list if pid <= 10014578, nolabel sepby(pid)
              
                   +----------------------------------------------------------------------------------+
                   |      pid   lfsato   age   nkids   mastat   movest   M0   time   movetime   years |
                   |----------------------------------------------------------------------------------|
                1. | 10007857        6    64       0        3        4    0      1          .       . |
                2. | 10007857        5    65       0        3        1    0      2          .       . |
                3. | 10007857        5    66       0        3        1    0      3          .       . |
                4. | 10007857        .    67       0        3        1    0      4          .       . |
                   |----------------------------------------------------------------------------------|
                5. | 10014578        5    59       0        1        1    0      1         10      -9 |
                6. | 10014578        7    61       0        1        1    0      2         10      -8 |
                7. | 10014578        7    62       0        1        1    0      3         10      -7 |
                8. | 10014578        6    63       0        1        1    0      4         10      -6 |
                9. | 10014578        6    65       0        1        1    0      5         10      -5 |
               10. | 10014578        6    66       0        1        1    0      6         10      -4 |
               11. | 10014578        5    67       0        1        1    0      7         10      -3 |
               12. | 10014578        5    68       0        1        1    0      8         10      -2 |
               13. | 10014578        6    69       0        1        1    0      9         10      -1 |
               14. | 10014578        7    70       0        1        2    1     10         10       0 |
               15. | 10014578        6    71       0        1        1    0     11         10       1 |
                   +----------------------------------------------------------------------------------+

              Comment


              • #8
                Thank you, but what if the data is unbalanced ( as I have). Is there a way to incorporate this?

                Comment


                • #9
                  Stata handles unbalanced panel data naturally. The documentation is perhaps a little more matter-of-fact about it than it should be - the only explicit acknowledgement I can find is in Example 2 in the xt chapter in the Stata Longitudinal-Data/Panel-Data Reference Manual PDF accessible from Stata's Help menu.

                  ​​​​​​​I guess they think concerns about balanced v. unbalanced are passé but they underestimate the inertia of received knowledge.

                  Comment


                  • #10
                    I meant for the method that you proposed to indicate the number of years before and after a move,
                    As you can see, wave 7 and 11 are missing but the corresponding value for 'year' doesn't take this into account, is there a way to do this?
                    HTML Code:
                         +-------------------------------------------------------------------------+
                           | lfsato        pid   age   wave   M0   time   movetime   years   regyear |
                           |-------------------------------------------------------------------------|
                        1. |      5   10014578    59      6    0      1         10      -9         . |
                        2. |      7   10014578    61      8    0      2         10      -8         . |
                        3. |      7   10014578    62      9    0      3         10      -7         . |
                        4. |      6   10014578    63     10    0      4         10      -6         . |
                        5. |      6   10014578    65     12    0      5         10      -5         . |
                        6. |      6   10014578    66     13    0      6         10      -4         . |
                        7. |      5   10014578    67     14    0      7         10      -3         . |
                        8. |      5   10014578    68     15    0      8         10      -2         . |
                        9. |      6   10014578    69     16    0      9         10      -1         . |
                       10. |      7   10014578    70     17    1     10         10       0         . |
                       11. |      6   10014578    71     18    0     11         10       1         . |

                    Comment


                    • #11
                      The only purpose of creating and using the time variable was because your sample data in post #1 did not include any indication of wave, or year, or much of anything to make a time variable for your panel, although you had obviously used one in an xtset command to enable time series lag and lead notation. As I commented in my post #2

                      we'll generate a fake "time" variable because you neglected to include it in your sample data
                      and even in post #8 where you raised the issue of unbalancedness you did not make it clear what your problem was, nor provide sample data to illustrate what your data is completely like.

                      Having been given the sample code in post #7, your task now is to read it and understand it, using the help files and manuals to figure out how it works, and if those aren't sufficient, to reply back with questions on specific issues that you've been unable to figure out. It's not very difficult code to figure out, especially with all the intermediate results available in the data.

                      Others may have other motivations for responding on Statalist, but my hope is to help other members become better Stata programmers, as others have helped me. Teaching them to fish, as the old adage goes, rather than giving them a fish.

                      Comment


                      • #12
                        Thank you so far with the help

                        Comment

                        Working...
                        X