Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • What does xtset do exactly?

    Hi Statalisters

    If I set a dataset to a panel dataset using xtset, what exactly is STATA doing? For example, I always thought that if I say:

    logit y x1 x2 i.period i.person

    then it should be the same as

    xtset period person
    xtlogit y x1 x2

    Meaning, I thought setting up a panel with xtset is the same as controlling for each peron and each period in your dataset.
    But I just ran both versions on the same dataset and I get different results, now I am not sure what xtset actually does.

    Any advice would be greatly appreciated.

    Thanks

  • #2
    On the question of what xtset does here, essentially it declares to Stata that you have panel data with particular panel identifier and time variables. And it makes various checks that you can fairly say that. In particular, xtset won't work if you have duplicate observations for (identifier, time) pairs. Once xtset is successful xtlogit is one of various possible commands, but its default is not at all the same as the logit model you tried.

    Comment


    • #3
      Nettie:
      as an aside to Nick's helpful explanation, please note that, outside the linear realm, plugging -i.panelid- in the right-hand side of a non-linear regression won't do the expected trick (that is, returning the very same results for -regress- and -xtreg,fe- as far as the shared coefficients are concerned):
      Code:
      . use "https://www.stata-press.com/data/r17/nlswork.dta"
      (National Longitudinal Survey of Young Women, 14-24 years old in 1968)
      
      . xtreg ln_wage c.age##c.age if idcode<=3, fe
      
      Fixed-effects (within) regression               Number of obs     =         39
      Group variable: idcode                          Number of groups  =          3
      
      R-squared:                                      Obs per group:
           Within  = 0.6382                                         min =         12
           Between = 0.8744                                         avg =       13.0
           Overall = 0.2765                                         max =         15
      
                                                      F(2,34)           =      29.99
      corr(u_i, Xb) = -0.2473                         Prob > F          =     0.0000
      
      ------------------------------------------------------------------------------
           ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
               age |   .2512762   .0450106     5.58   0.000     .1598037    .3427487
                   |
       c.age#c.age |  -.0037603   .0007625    -4.93   0.000    -.0053098   -.0022107
                   |
             _cons |  -2.189815   .6402959    -3.42   0.002    -3.491053   -.8885773
      -------------+----------------------------------------------------------------
           sigma_u |  .31366066
           sigma_e |  .19867104
               rho |  .71367959   (fraction of variance due to u_i)
      ------------------------------------------------------------------------------
      F test that all u_i=0: F(2, 34) = 29.72                      Prob > F = 0.0000
      
      
      . regress ln_wage c.age##c.age i.idcode if idcode<=3
      
            Source |       SS           df       MS      Number of obs   =        39
      -------------+----------------------------------   F(4, 34)        =     24.28
             Model |  3.83375281         4  .958438203   Prob > F        =    0.0000
          Residual |  1.34198615        34  .039470181   R-squared       =    0.7407
      -------------+----------------------------------   Adj R-squared   =    0.7102
             Total |  5.17573896        38  .136203657   Root MSE        =    .19867
      
      ------------------------------------------------------------------------------
           ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
               age |   .2512762   .0450106     5.58   0.000     .1598037    .3427487
                   |
       c.age#c.age |  -.0037603   .0007625    -4.93   0.000    -.0053098   -.0022107
                   |
            idcode |
                2  |  -.4231615   .0816747    -5.18   0.000    -.5891444   -.2571786
                3  |  -.6126416   .0809386    -7.57   0.000    -.7771285   -.4481546
                   |
             _cons |   -1.82398   .6366167    -2.87   0.007    -3.117741   -.5302195
      ------------------------------------------------------------------------------
      
      
      . xtlogit msp c.age##c.age grade if idcode<=3, fe
      note: grade omitted because of collinearity.
      note: multiple positive outcomes within groups encountered.
      note: 1 group (15 obs) omitted because of all positive or
            all negative outcomes.
      
      Iteration 0:   log likelihood = -4.2834822  
      Iteration 1:   log likelihood = -2.6704086  
      Iteration 2:   log likelihood = -1.3716674  
      Iteration 3:   log likelihood = -4.159e-06  
      Iteration 4:   log likelihood = -6.382e-07  
      Iteration 5:   log likelihood = -5.633e-07  
      
      Conditional fixed-effects logistic regression        Number of obs    =     24
      Group variable: idcode                               Number of groups =      2
      
                                                           Obs per group:
                                                                        min =     12
                                                                        avg =   12.0
                                                                        max =     12
      
                                                           LR chi2(2)       =  23.20
      Log likelihood = -5.633e-07                          Prob > chi2      = 0.0000
      
      ------------------------------------------------------------------------------
               msp | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
      -------------+----------------------------------------------------------------
               age |   114.3199   20804.81     0.01   0.996    -40662.36       40891
                   |
       c.age#c.age |   -2.70013   538.7396    -0.01   0.996     -1058.61     1053.21
                   |
             grade |          0  (omitted)
      ------------------------------------------------------------------------------
      
      . logit msp c.age##c.age i.idcode grade if idcode<=3
      
      note: 3.idcode != 0 predicts failure perfectly;
            3.idcode omitted and 15 obs not used.
      
      note: grade omitted because of collinearity.
      Iteration 0:   log likelihood = -16.552102  
      Iteration 1:   log likelihood = -4.4671141  
      Iteration 2:   log likelihood = -3.0383448  
      Iteration 3:   log likelihood = -1.1392775  
      Iteration 4:   log likelihood =          0  
      Iteration 5:   log likelihood =          0  
      
      Logistic regression                                     Number of obs =     24
                                                              LR chi2(-1)   =  33.10
                                                              Prob > chi2   =      .
      Log likelihood = 0                                      Pseudo R2     = 1.0000
      
      ------------------------------------------------------------------------------
               msp | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
      -------------+----------------------------------------------------------------
               age |   1460.447          .        .       .            .           .
                   |
       c.age#c.age |  -33.81325          .        .       .            .           .
                   |
            idcode |
                2  |   2938.854          .        .       .            .           .
                3  |          0  (empty)
                   |
             grade |          0  (omitted)
             _cons |   -15522.3          .        .       .            .           .
      ------------------------------------------------------------------------------
      Note: 11 failures and 13 successes completely determined.
      
      .
      As a sidelight, please note that -xtlogit,fe- means conditional fixed effects (incidental parameter bias, you know: http://www.econ.brown.edu/Faculty/To...meters1948.pdf), that are totallu different from the -xtreg- -fe- specification.
      Kind regards,
      Carlo
      (Stata 18.0 SE)

      Comment


      • #4
        xtset and tsset are educators- and by God, you shall learn! Let me explain. For an interview, I was asked to clean some data and produce visualizations of it. I won't say where since I don't work there yet, but all they wanted was some visualizations of time series data. Not bad, right? Well, tsset is sort of an educator here. How? Well, assuming we have proper time series data, we will have one unique time period for each row, yes? Well nope!!!!!!
        Code:
        webuse grunfeld, clear
        
        keep if company == 1
        
        replace year = 1936 in 1
        
        
        tsset company year, y
        
        br
        Before I tsset my data, I checked if my data structure was correct. I did
        Code:
        cap isid datetime
        
        /*Checks for unique time IDs. If we have 2 1PMs, for example
        then it is likely data collection errors happened or other
        mismanagement */
        
        if _rc {
            tempvar dup
            
            duplicates tag datetime, generate(`dup')
            
            l datetime MW if `dup'==1
            
            gcollapse (mean) MW, by(datetime)
            
            lab var MW "MW/hr"
        }
        To see if I had unique time series IDs. Otherwise, if I just went in guns-blazing, as one does, you'll not realize you have duplicate time observations, and you'll have discovered your error long after you should've looked for it. Turns out, I did in fact have 4 duplicate observations for my time variable, buried in the nethers of a 45k observation dataset. In other words, stats aside, xtset and tsset put part of your data integrity to the test. If there are gaps, missing observations, repeated time values for a panel or repeated time values for time series, these declarations will help save you, from you.. Once these gaps and so on are discovered, you must see if they're legitimate and figure out what you'll do with them. So in my opinion, it's always a good idea to declare them, even if you don't ultimately use any sexy time series/panel commands, since it will find with syntax what human eyes will not in a lake of data.

        Comment

        Working...
        X