Repeated time values in sample.

Georgios Kyrkos

Join Date: Oct 2016
Posts: 12

Repeated time values in sample.

21 Oct 2016, 05:11

Dear all

I am new in STATA and i would like to ask some suggestions about the problem i face..
The objective is to predict a score taking into consideration some post scores (2013-2015). My database has the following form:

ID	Date	Score	Region	INK	Gender	BS	PMT	UN	GDP	Growth	Income	POV	INS	CF	PC	BTF
393368	05.06.2013	517	3	0	1	584	12	4	31123	-1	21	16	.10733219	82	1517	1374
393290	05.06.2013	454	5	1	0	352	12	6	34796	0	21	16	.19055245	59	6774	3873
393254	05.06.2013	471	5	0	1	233	12	6	34796	0	21	16	.19055245	59	6774	3873
394099	05.06.2013	459	5	1	1	533	9	6	34796	0	21	16	.19055245	59	6774	3873
393390	05.06.2013	550	9	0	1	536	9	4	40446	1	30	15	.09240682	19	4752	1203
393379	05.06.2013	436	16	0	0	183	12	5	24663	1	24	12	.01918129	3	821	184
393235	05.06.2013	501	7	0	0	323	14	5	31226	0	20	17	.03945251	28	429	449

The ID is possible to be appeared in 2014 or 2015.

However when i try to define Date as time variable i receive the following:

Code:

. tsset D
repeated time values in sample
r(451);

Thank you in advance for your response.

Best Regards

George

Last edited by Georgios Kyrkos; 21 Oct 2016, 05:17.

Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

21 Oct 2016, 05:17

Georgios:
welcome to the list.
The issue there is that you should consider the panel structure of your dataset:

Code:

. use http://www.stata-press.com/data/r14/nlswork.dta
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. tsset year
repeated time values in sample
r(451);

. xtset idcode year
       panel variable:  idcode (unbalanced)
        time variable:  year, 68 to 88, but with gaps
                delta:  1 unit

For the future, please post your examples (or dataset excerpts) via -dataex- (-search dataex). Thanks.

Kind regards,
Carlo
(Stata 19.0)

Comment

Georgios Kyrkos

Join Date: Oct 2016

Posts: 12
#3

21 Oct 2016, 05:55

Dear Mr. Lazzaro

Thank you for your quick response and i apologize for the wrong post.
So in order to be structured as panel, the id needs to be appeared in all years? (2013-2014-2015)
Do you recommend any STATA document regarding Panel data preparation?

Kind Regards

Georgios
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#4

21 Oct 2016, 06:05

Start with

Code:

help xt

and follow its links.
Comment
Charlie Joyez

Join Date: Dec 2014

Posts: 421
#5

21 Oct 2016, 06:06

Originally posted by Georgios Kyrkos View Post

So in order to be structured as panel, the id needs to be appeared in all years? (2013-2014-2015)

Yes, this is the meaning of panel data: the same ids are observed several times (here in 2013, 2014 and 2015).
To declare your panel you'll have to use xtset (or tsset), so I advice you to take a look on the xtset helpfile (type help xtset). It would provide you details about panel data preparation.

Best,
Charlie

Edit: Nick has been the first to answer, I thought He would have taken a little more time, to make a point on the Stata spelling in #3

Last edited by Charlie Joyez; 21 Oct 2016, 06:11.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#6

21 Oct 2016, 06:12

To make myself predicted: http://www.statalist.org/forums/help#spelling
Comment
Georgios Kyrkos

Join Date: Oct 2016

Posts: 12
#7

21 Oct 2016, 06:16

Thank you all for the response.

Yes, this is the meaning of panel data: the same ids are observed several times (here in 2013, 2014 and 2015).

However in my data set this is not happening systematically. The majority appears in one year of the period. it seems like more a pooled cross-sectional data.

Best,

George
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#8

21 Oct 2016, 06:23

Working backwards, if you want to apply tsset or xtset, then these are the rules:

1. No time can appear more than once.

2. No (panel, time) couple can appear more than once.

It is not a problem in general that panels may be unbalanced, but if you want to use any commands that require tsset or xtset, you must apply one of those commands first.
Comment
Charlie Joyez

Join Date: Dec 2014

Posts: 421
#9

21 Oct 2016, 06:44

As Nick said, having unbalanced panel (where not all identifiers are reported each time) might not be a problem in general

However, in your specific case you precise that the majority of individual are only observed one year.
In this case, I would warn you that using fixed effects after having set the panel structure would lead to remove all of this uniquely observed individual.

Yet, I don't know whether you have planned to use fixed effects, but beware of the potential bias it could cause (especially is the uniquely observed individual are not randomly distributed).
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#10

21 Oct 2016, 06:50

Georgios:
in the same fashion, beware of making up your dataset in order to obtain (if feasible) a balanced panel.
That approach would seriously bias your regression results because of dealing with a sample whose relationships with the original one are tenuous at best (especially if missingness is, as it often occurs, informative).

Kind regards,
Carlo
(Stata 19.0)
Comment
Georgios Kyrkos

Join Date: Oct 2016

Posts: 12
#11

23 Oct 2016, 09:35

Dear all

Thank you very much for the feedback. It took me some time to comprehend those terms since i am not familiar with panel datasets. However, since my dataset is dominated by unique observations per year, i removed the duplicated observations and the panel is weakly unbalanced (13 530 out of 170 797 observations are out).
Now i have unique ID for each date. N>T. Is it going to bias my regression?

Charlie you mentioned fixed effects. I am planning to use Random effect instead of fixed.

Thank you and kind regards

Georgios
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#12

23 Oct 2016, 11:13

Georgios:
the bias may rest on the fact that you removed all the second entries for the ids with two observations per year, unless you had methodologically sound reasons to do so (that is, two entries per year for the same id were mistakenly entered)..
Assuming that you had them, you are now probably dealing with a largel N, small T panel dataset.
In that instance -xtreg. is the way to go and the outcome of the -hausman-specification test shoud have placed you on the right track as far as -fe- or -re- specification fits your data appropriately.

Kind regards,
Carlo
(Stata 19.0)
Comment
Georgios Kyrkos

Join Date: Oct 2016

Posts: 12
#13

23 Oct 2016, 13:00

Mr. Lazzaro:
Actually the replicated transactions were just the same customer who bought two times at the same day. Same Score, Region, payment method but different amount. So i suppose is a bug in the system of the company i study. To be more precise i can either merge or remove the duplicated observations.

Regarding the panel data set:

Code:

xtset id D panel variable: id (unbalanced) time variable: D, 5/6/2013 to 12/30/2015 delta: 1 day

Then, for the Fixed-Effects all variables are excluded due to collinearity. I also tried without the dummy variables. Command used:

Code:

xtreg SC REG INK GEN BS PMT UN GDP GR INC POV INS CF PC BTF, fe vce(robust)

For the Random Effects

Code:

xtreg SC REG INK GEN BS PMT UN GDP GR INC POV INS CF PC BTF, re insufficient observations r(2001);

However the Between regression works.

I know that for the Hausman-Specification test i need the outcome from RE and FE

Kind Regards

Georgios

Last edited by Georgios Kyrkos; 23 Oct 2016, 13:02.
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

#14

24 Oct 2016, 00:18

Georgios:
thanks for providing further details.
As per your reply, it seems that you have only one observation per panel_id.
It that were the case, it is not surrprising that -xtreg- gives back results with the -be- specification only.
At the top of that, you may find out that the results you got with -xtreg, be- are the same that you would obtain with -regress- (as -regress- is a particular case of panel data with one wave of data only).
A toy- example can support what stated above:

Code:

. sysuse auto.dta
(1978 Automobile Data)

. g year=1

. g panel_id=_n

. xtset panel_id year
       panel variable:  panel_id (strongly balanced)
        time variable:  year, 1 to 1
                delta:  1 unit

. xtreg price mpg i.rep78, fe
note: mpg omitted because of collinearity
note: 2.rep78 omitted because of collinearity
note: 3.rep78 omitted because of collinearity
note: 4.rep78 omitted because of collinearity
note: 5.rep78 omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =         69
Group variable: panel_id                        Number of groups  =         69

R-sq:                                           Obs per group:
     within  =      .                                         min =          1
     between =      .                                         avg =        1.0
     overall =      .                                         max =          1

                                                F(0,0)            =       0.00
corr(u_i, Xb)  =      .                         Prob > F          =          .

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |          0  (omitted)
             |
       rep78 |
          2  |          0  (omitted)
          3  |          0  (omitted)
          4  |          0  (omitted)
          5  |          0  (omitted)
             |
       _cons |   6146.043          .        .       .            .           .
-------------+----------------------------------------------------------------
     sigma_u |  2912.4403
     sigma_e |          .
         rho |          .   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(68, 0) = .                          Prob > F =      .

. xtreg price mpg i.rep78, re
insufficient observations
r(2001);

. xtreg price mpg i.rep78, be

Between regression (regression on group means)  Number of obs     =         69
Group variable: panel_id                        Number of groups  =         69

R-sq:                                           Obs per group:
     within  =      .                                         min =          1
     between = 0.2584                                         avg =        1.0
     overall = 0.2584                                         max =          1

                                                F(5,63)           =       4.39
sd(u_i + avg(e_i.))=  2605.782                  Prob > F          =     0.0017

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -280.2615   61.57666    -4.55   0.000    -403.3126   -157.2103
             |
       rep78 |
          2  |   877.6347   2063.285     0.43   0.672     -3245.51     5000.78
          3  |   1425.657   1905.438     0.75   0.457    -2382.057    5233.371
          4  |   1693.841   1942.669     0.87   0.387    -2188.274    5575.956
          5  |   3131.982   2041.049     1.53   0.130    -946.7282    7210.693
             |
       _cons |   10449.99   2251.041     4.64   0.000     5951.646    14948.34
------------------------------------------------------------------------------

. reg price mpg i.rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(5, 63)        =      4.39
       Model |   149020603         5  29804120.7   Prob > F        =    0.0017
    Residual |   427776355        63  6790100.88   R-squared       =    0.2584
-------------+----------------------------------   Adj R-squared   =    0.1995
       Total |   576796959        68  8482308.22   Root MSE        =    2605.8

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -280.2615   61.57666    -4.55   0.000    -403.3126   -157.2103
             |
       rep78 |
          2  |   877.6347   2063.285     0.43   0.672     -3245.51     5000.78
          3  |   1425.657   1905.438     0.75   0.457    -2382.057    5233.371
          4  |   1693.841   1942.669     0.87   0.387    -2188.274    5575.956
          5  |   3131.982   2041.049     1.53   0.130    -946.7282    7210.693
             |
       _cons |   10449.99   2251.041     4.64   0.000     5951.646    14948.34
------------------------------------------------------------------------------

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Georgios Kyrkos

Join Date: Oct 2016
Posts: 12

#15

24 Oct 2016, 02:13

Dear Mr. Lazzaro

Exactly. my results are exactly the same. (Regress BE vs Regress). I suppose this has to do with my data structure (time var and panel_id).

Code:

  xtreg SC REG INK GEN BS PMT UN GDP GR INC POV INS CF PC BTF, be

Between regression (regression on group means)  Number of obs     =    328,280
Group variable: id                              Number of groups  =    328,280

R-sq:                                           Obs per group:
     within  =      .                                         min =          1
     between = 0.0869                                         avg =        1.0
     overall = 0.0869                                         max =          1

                                                F(14,328265)      =    2230.94
sd(u_i + avg(e_i.))=  52.52907                  Prob > F          =     0.0000

------------------------------------------------------------------------------
          SC |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         REG |   .0902898   .0517752     1.74   0.081    -.0111881    .1917677
         INK |  -44.00994   .3700228  -118.94   0.000    -44.73517    -43.2847
         GEN |   1.334461   .1836945     7.26   0.000      .974425    1.694497
          BS |  -.0082948   .0005337   -15.54   0.000    -.0093408   -.0072488
         PMT |  -2.769994   .0318891   -86.86   0.000    -2.832496   -2.707492
          UN |  -3.348294   .0672922   -49.76   0.000    -3.480184   -3.216403
         GDP |  -.0010511   .0000289   -36.35   0.000    -.0011077   -.0009944
          GR |   3.438825   .1324835    25.96   0.000     3.179161    3.698489
         INC |   .0807456   .0404129     2.00   0.046     .0015375    .1599537
         POV |   2.186218   .1657191    13.19   0.000     1.861413    2.511023
         INS |    -71.812     5.2909   -13.57   0.000    -82.18201   -61.44199
          CF |   .0780609   .0064883    12.03   0.000      .065344    .0907778
          PC |   .0013398   .0000871    15.39   0.000     .0011692    .0015105
         BTF |  -.0011731   .0000736   -15.94   0.000    -.0013174   -.0010288
       _cons |   591.0429   2.570625   229.92   0.000     586.0046    596.0813
------------------------------------------------------------------------------

.

KInd Regards

Georgios Kyrkos

Announcement

Repeated time values in sample.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment