What does xtset do exactly?

Nettie vd Merwe

Join Date: Mar 2020

Posts: 19
#1

What does xtset do exactly?

18 Dec 2022, 00:46

Hi Statalisters

If I set a dataset to a panel dataset using xtset, what exactly is STATA doing? For example, I always thought that if I say:

logit y x1 x2 i.period i.person

then it should be the same as

xtset period person
xtlogit y x1 x2

Meaning, I thought setting up a panel with xtset is the same as controlling for each peron and each period in your dataset.
But I just ran both versions on the same dataset and I get different results, now I am not sure what xtset actually does.

Any advice would be greatly appreciated.

Thanks
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35754
#2

18 Dec 2022, 03:41

On the question of what xtset does here, essentially it declares to Stata that you have panel data with particular panel identifier and time variables. And it makes various checks that you can fairly say that. In particular, xtset won't work if you have duplicate observations for (identifier, time) pairs. Once xtset is successful xtlogit is one of various possible commands, but its default is not at all the same as the logit model you tried.
1 like
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17724

18 Dec 2022, 04:04

Nettie:
as an aside to Nick's helpful explanation, please note that, outside the linear realm, plugging -i.panelid- in the right-hand side of a non-linear regression won't do the expected trick (that is, returning the very same results for -regress- and -xtreg,fe- as far as the shared coefficients are concerned):

Code:

. use "https://www.stata-press.com/data/r17/nlswork.dta"
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)

. xtreg ln_wage c.age##c.age if idcode<=3, fe

Fixed-effects (within) regression               Number of obs     =         39
Group variable: idcode                          Number of groups  =          3

R-squared:                                      Obs per group:
     Within  = 0.6382                                         min =         12
     Between = 0.8744                                         avg =       13.0
     Overall = 0.2765                                         max =         15

                                                F(2,34)           =      29.99
corr(u_i, Xb) = -0.2473                         Prob > F          =     0.0000

------------------------------------------------------------------------------
     ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .2512762   .0450106     5.58   0.000     .1598037    .3427487
             |
 c.age#c.age |  -.0037603   .0007625    -4.93   0.000    -.0053098   -.0022107
             |
       _cons |  -2.189815   .6402959    -3.42   0.002    -3.491053   -.8885773
-------------+----------------------------------------------------------------
     sigma_u |  .31366066
     sigma_e |  .19867104
         rho |  .71367959   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(2, 34) = 29.72                      Prob > F = 0.0000


. regress ln_wage c.age##c.age i.idcode if idcode<=3

      Source |       SS           df       MS      Number of obs   =        39
-------------+----------------------------------   F(4, 34)        =     24.28
       Model |  3.83375281         4  .958438203   Prob > F        =    0.0000
    Residual |  1.34198615        34  .039470181   R-squared       =    0.7407
-------------+----------------------------------   Adj R-squared   =    0.7102
       Total |  5.17573896        38  .136203657   Root MSE        =    .19867

------------------------------------------------------------------------------
     ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .2512762   .0450106     5.58   0.000     .1598037    .3427487
             |
 c.age#c.age |  -.0037603   .0007625    -4.93   0.000    -.0053098   -.0022107
             |
      idcode |
          2  |  -.4231615   .0816747    -5.18   0.000    -.5891444   -.2571786
          3  |  -.6126416   .0809386    -7.57   0.000    -.7771285   -.4481546
             |
       _cons |   -1.82398   .6366167    -2.87   0.007    -3.117741   -.5302195
------------------------------------------------------------------------------


. xtlogit msp c.age##c.age grade if idcode<=3, fe
note: grade omitted because of collinearity.
note: multiple positive outcomes within groups encountered.
note: 1 group (15 obs) omitted because of all positive or
      all negative outcomes.

Iteration 0:   log likelihood = -4.2834822  
Iteration 1:   log likelihood = -2.6704086  
Iteration 2:   log likelihood = -1.3716674  
Iteration 3:   log likelihood = -4.159e-06  
Iteration 4:   log likelihood = -6.382e-07  
Iteration 5:   log likelihood = -5.633e-07  

Conditional fixed-effects logistic regression        Number of obs    =     24
Group variable: idcode                               Number of groups =      2

                                                     Obs per group:
                                                                  min =     12
                                                                  avg =   12.0
                                                                  max =     12

                                                     LR chi2(2)       =  23.20
Log likelihood = -5.633e-07                          Prob > chi2      = 0.0000

------------------------------------------------------------------------------
         msp | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   114.3199   20804.81     0.01   0.996    -40662.36       40891
             |
 c.age#c.age |   -2.70013   538.7396    -0.01   0.996     -1058.61     1053.21
             |
       grade |          0  (omitted)
------------------------------------------------------------------------------

. logit msp c.age##c.age i.idcode grade if idcode<=3

note: 3.idcode != 0 predicts failure perfectly;
      3.idcode omitted and 15 obs not used.

note: grade omitted because of collinearity.
Iteration 0:   log likelihood = -16.552102  
Iteration 1:   log likelihood = -4.4671141  
Iteration 2:   log likelihood = -3.0383448  
Iteration 3:   log likelihood = -1.1392775  
Iteration 4:   log likelihood =          0  
Iteration 5:   log likelihood =          0  

Logistic regression                                     Number of obs =     24
                                                        LR chi2(-1)   =  33.10
                                                        Prob > chi2   =      .
Log likelihood = 0                                      Pseudo R2     = 1.0000

------------------------------------------------------------------------------
         msp | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   1460.447          .        .       .            .           .
             |
 c.age#c.age |  -33.81325          .        .       .            .           .
             |
      idcode |
          2  |   2938.854          .        .       .            .           .
          3  |          0  (empty)
             |
       grade |          0  (omitted)
       _cons |   -15522.3          .        .       .            .           .
------------------------------------------------------------------------------
Note: 11 failures and 13 successes completely determined.

.

As a sidelight, please note that -xtlogit,fe- means conditional fixed effects (incidental parameter bias, you know: http://www.econ.brown.edu/Faculty/To...meters1948.pdf), that are totallu different from the -xtreg- -fe- specification.

Kind regards,
Carlo
(Stata 19.0)

Comment

Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#4

18 Dec 2022, 13:32

xtset and tsset are educators- and by God, you shall learn! Let me explain. For an interview, I was asked to clean some data and produce visualizations of it. I won't say where since I don't work there yet, but all they wanted was some visualizations of time series data. Not bad, right? Well, tsset is sort of an educator here. How? Well, assuming we have proper time series data, we will have one unique time period for each row, yes? Well nope!!!!!!

Code:

webuse grunfeld, clear keep if company == 1 replace year = 1936 in 1 tsset company year, y br

Before I tsset my data, I checked if my data structure was correct. I did

Code:

cap isid datetime /*Checks for unique time IDs. If we have 2 1PMs, for example then it is likely data collection errors happened or other mismanagement */ if _rc { tempvar dup duplicates tag datetime, generate(`dup') l datetime MW if `dup'==1 gcollapse (mean) MW, by(datetime) lab var MW "MW/hr" }

To see if I had unique time series IDs. Otherwise, if I just went in guns-blazing, as one does, you'll not realize you have duplicate time observations, and you'll have discovered your error long after you should've looked for it. Turns out, I did in fact have 4 duplicate observations for my time variable, buried in the nethers of a 45k observation dataset. In other words, stats aside, xtset and tsset put part of your data integrity to the test. If there are gaps, missing observations, repeated time values for a panel or repeated time values for time series, these declarations will help save you, from you.. Once these gaps and so on are discovered, you must see if they're legitimate and figure out what you'll do with them. So in my opinion, it's always a good idea to declare them, even if you don't ultimately use any sexy time series/panel commands, since it will find with syntax what human eyes will not in a lake of data.
Comment
Shivam Dandgavhal

Join Date: Sep 2024

Posts: 5
#5

03 Oct 2024, 12:55

Originally posted by Nick Cox View Post

On the question of what xtset does here, essentially it declares to Stata that you have panel data with particular panel identifier and time variables. And it makes various checks that you can fairly say that. In particular, xtset won't work if you have duplicate observations for (identifier, time) pairs. Once xtset is successful xtlogit is one of various possible commands, but its default is not at all the same as the logit model you tried.

Dear Sir,
I am working on a matched sample with replacement where I am having same control observations for different observations in my treatment group. I am bound to have duplicates in my sample due to 1:many matching procedure. How do I declare xtset in such case? If I delete the duplicates, I will have fewer control observations as compared to my treatment group. Any suggestions are welcome. Thanks.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#6

03 Oct 2024, 13:10

Your post-matching data set must contain a variable, created during the matching, that identifies each treated case and all of the controls that matched to it. Such a variable might be named tuple_num, or something like that.

If the original pre-matching data was panel data, then this gets complicated because you really don't have panel data any more: you are now working with 3-level data, for which there is no really satisfactory fixed-effects analysis, but, of course, random-effects analyses have important limitations. Probably the most commonly used, least bad, compromise in this situation is to -xtset- the original panel variable, with no time variable, and cluster the standard errors in your -xt- calculations at the matched-tuple level.

If the pre-matching data was not itself panel data, however, then you are still at 2-level data. In this case, you -xtset- your data with that as the "panel" variable. Again, do not attempt to specify a time variable in the -xtset- command.

So what do you lose by not specifying a time variable? Well, without a time variable, you cannot do analyses involving lags and leads (nor those that use them indirectly such as autoregressive structure in residuals). But anything else that you would ordinarily use -xt- commands for will work just fine with no time variable.
1 like
Comment
Shivam Dandgavhal

Join Date: Sep 2024

Posts: 5
#7

03 Oct 2024, 13:29

Originally posted by Clyde Schechter View Post

Your post-matching data set must contain a variable, created during the matching, that identifies each treated case and all of the controls that matched to it. Such a variable might be named tuple_num, or something like that.

If the original pre-matching data was panel data, then this gets complicated because you really don't have panel data any more: you are now working with 3-level data, for which there is no really satisfactory fixed-effects analysis, but, of course, random-effects analyses have important limitations. Probably the most commonly used, least bad, compromise in this situation is to -xtset- the original panel variable, with no time variable, and cluster the standard errors in your -xt- calculations at the matched-tuple level.

If the pre-matching data was not itself panel data, however, then you are still at 2-level data. In this case, you -xtset- your data with that as the "panel" variable. Again, do not attempt to specify a time variable in the -xtset- command.

So what do you lose by not specifying a time variable? Well, without a time variable, you cannot do analyses involving lags and leads (nor those that use them indirectly such as autoregressive structure in residuals). But anything else that you would ordinarily use -xt- commands for will work just fine with no time variable.

Clyde Schechter Thank you sir for your reply. I am truly grateful.

These are the matching variables generated after executing Mahapick command in stata
_prime_id FirmIDYearIndustry _matchnum _score

71 71 0 0

71 25935 1 1.983381

My prematching data was a panel data. I had used Mahalanobis matching method to create my matched sample. Before matching, I grouped my FirmID FinancialYear and TwoDigitIndustryID to create FirmIDYearIndustry variable thinking that it will help me to match a treatment firm with control firm belonging to same industry and time period. I didnt really understand by what you refer to as 3-level data, why fixed effects wont work and what limitations are there for random effects analysis in such case? If you could refer any reading, I would be grateful. Also, when you say I should cluster the standard errors at matched tuple level, do you mean FirmIDYearIndustry variable in this case or the original FirmID in prematching phase? Taking lags is important for me to account for reverse causality in my regression models. Hence, I am really worried as to how to approach this dilemma. Any help is appreciated. Thanks.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#8

03 Oct 2024, 14:43

Simple panel data is two-level data. You have repeated observations at different time points nested within firms. Actually, in your case your data was already 3-levels, because your firms are then further nested within industries. Putting aside the matching, how would you have analyzed this data without matching? Most likely you what have -xtset firm year- and then used -vce(cluster industry)-. That is the standard patch for the problems of 3-level data.

The problems of three level data are as follows. You can't have fixed effects for both firm and industry, because industry is invariant over time within firm, so industry will get dropped in any fixed effects analysis that attempts to include both. Models that include random effects are not consistent unless the random effects are independent of all of the model's predictor variables--which is why they are little used for observational studies in economics and finance. In those disciplines, the usual solution to the dilemma posed by 3-level data is to use fixed effects at the firm level and then cluster standard errors at the industry level. This approach, while it fails to fully reflect the true three-dimensional structure of the data, at least acknowledges, through clustering at the industry level, that the observations of different firms cannot be considered independent, as they may share common attributes from being in the same industry.

Now, I find myself confused by your description of the matching procedure. If you truly matched on a FirmYearIndustry combination variable, you would only be able to match any observation in your data set with itself! Given that the original data was panel data, meaning one observation per year for each firm, and given that each firm belongs only in a single industry, any two observations that agree on all three of firm, year, and industry, would have to be the same observation. So I don't know what you actually did.

I'm guessing that what you actually did was match on the combination of Industry and Year. If that's the case, then you have a different data structure altogether: it is no longer a simple nested design. Rather firms are crossed with years, and firm-years are nested within matched tuples, and the matched tuples are nested in industry years. I would handle this by using the matched tuple variable (_prime_id) as the "panel" variable for -xtset-, with no time variable. And then I would cluster on industry-years.
Comment
Priscah Kyalo

Join Date: Oct 2024

Posts: 6
#9

11 Oct 2024, 06:44

Originally posted by Clyde Schechter View Post

Simple panel data is two-level data. You have repeated observations at different time points nested within firms. Actually, in your case your data was already 3-levels, because your firms are then further nested within industries. Putting aside the matching, how would you have analyzed this data without matching? Most likely you what have -xtset firm year- and then used -vce(cluster industry)-. That is the standard patch for the problems of 3-level data.

The problems of three level data are as follows. You can't have fixed effects for both firm and industry, because industry is invariant over time within firm, so industry will get dropped in any fixed effects analysis that attempts to include both. Models that include random effects are not consistent unless the random effects are independent of all of the model's predictor variables--which is why they are little used for observational studies in economics and finance. In those disciplines, the usual solution to the dilemma posed by 3-level data is to use fixed effects at the firm level and then cluster standard errors at the industry level. This approach, while it fails to fully reflect the true three-dimensional structure of the data, at least acknowledges, through clustering at the industry level, that the observations of different firms cannot be considered independent, as they may share common attributes from being in the same industry.

Now, I find myself confused by your description of the matching procedure. If you truly matched on a FirmYearIndustry combination variable, you would only be able to match any observation in your data set with itself! Given that the original data was panel data, meaning one observation per year for each firm, and given that each firm belongs only in a single industry, any two observations that agree on all three of firm, year, and industry, would have to be the same observation. So I don't know what you actually did.

I'm guessing that what you actually did was match on the combination of Industry and Year. If that's the case, then you have a different data structure altogether: it is no longer a simple nested design. Rather firms are crossed with years, and firm-years are nested within matched tuples, and the matched tuples are nested in industry years. I would handle this by using the matched tuple variable (_prime_id) as the "panel" variable for -xtset-, with no time variable. And then I would cluster on industry-years.

@ Clyde Schechter I'm also having a similar challenge and would like to request for your help. I'm looking at the drivers of foreign foreign banks to the host country, and definitely i have duplicate time and C_ID, with variation in banks. Basically, I have the Country_ID, Year, and bank. The banks are from different parent countries, so I would like to maintain country, time, and bank fixed effects. The bank is a string variable. How do I -xtset my panel data, yet maintain the fixed effects in the three variables. I've tried to group C_ID and Bank, but Im getting errors " gen id = group(C_ID Bank) C_IDBank not found r(111); Kindly guide.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#10

11 Oct 2024, 06:57

I don't know what would be a good plan for you but your immediate problem is clear.

There is an undocumented group() function that works with generate but it is never what anyone wants in this context.

The immediate problem is that you need egen not generate.

Code:

egen id = group(C_ID Bank)

If you look back at where you saw the advice I am confident that it will show use of egen.
1 like
Comment
Priscah Kyalo

Join Date: Oct 2024

Posts: 6
#11

11 Oct 2024, 07:46

Originally posted by Nick Cox View Post

I don't know what would be a good plan for you but your immediate problem is clear.

There is an undocumented group() function that works with generate but it is never what anyone wants in this context.

The immediate problem is that you need egen not generate.

Code:

egen id = group(C_ID Bank)

If you look back at where you saw the advice I am confident that it will show use of egen.

Many thanks @ Nick Cox. egen has worked very well.
Comment

_prime_id	FirmIDYearIndustry	_matchnum	_score
71	71	0	0
71	25935	1	1.983381

Announcement