Duplicates in panel settings

Mario Ferri

Join Date: Jul 2019

Posts: 190
#1

Duplicates in panel settings

14 Feb 2020, 19:05

. A question that has come up before to forum.
I am trying to declare dataset to be a panel and I am getting and error
[CODE][]
xtset ID ts
repeated time values within panel
ID stands for the country id and ts for year/CODE]

That means I am having duplicates. I am aware of the presence of duplicates ts for some years from the nature of the dataset and it is desired dataset to remain as such. Any way to solve it while keeping the multiple years when that occurs?
I have searched in the forum before posting didn't find a suitable solution. From some old posts, I think I might have to create a new time variable, didn't get how, nonetheless.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17699
#2

15 Feb 2020, 05:19

Mario:
if duplicates are actually a matter of fact and you cannot/ do not want to get rid of them, you can simply -xtset- your dataset with -paneild- only:

Code:

xtset ID

This will work provided that you do not plan to use time-series related commands such as lags and leads.

Last edited by Carlo Lazzaro; 15 Feb 2020, 05:26.

Kind regards,
Carlo
(Stata 19.0)
2 likes
Comment
Mario Ferri

Join Date: Jul 2019

Posts: 190
#3

15 Feb 2020, 10:15

Originally posted by Carlo Lazzaro View Post

Mario:
if duplicates are actually a matter of fact and you cannot/ do not want to get rid of them, you can simply -xtset- your dataset with -paneild- only:

Code:

xtset ID

This will work provided that you do not plan to use time-series related commands such as lags and leads.

Well that is the case It is a macro panel and still have to check for stationary for all variables and want to cluster by country year. For some of them like inflation, exchange rates etc I suspect will have to make data stationary like 90 percent of the cases. I will most likely have to use rolling windows panel var or a GMM. Still have to decide on in

Some people suggested me to proceed d by giving within id observations some name :

Code:

bys ID time : gen withinID = [_n] egen newID = group(ID withinID) xtset newID time

Or some others

duplicates tag ID ts, generate(duplicate)
egen time=concat(ts duplicate)
xtset ID time

Thinking that this should keep the order of time more or else in my data.

Are these approaches correct?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35601
#4

15 Feb 2020, 11:38

Fooling Stata about your data structure won't get you good results. If your identifiers aren't on the same level, you won't be able to interpret results easily. Nor will examiners or reviewers. This needs hard thought about what kind of data generation process you have and what your goals are.
1 like
Comment

Mario Ferri

Join Date: Jul 2019
Posts: 190

15 Feb 2020, 12:18

An example of my data can be found below

---------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long ts str15 country float(pm gv ID)
1990 "Australia"   -4.5   -4.5 1
1990 "Australia"  -14.9  -14.9 1
1991 "Australia"  -14.9  -14.9 1
1991 "Australia"  -14.9  -14.9 1
1992 "Australia"  -14.9  -14.9 1
1993 "Australia"  -14.9  -14.9 1
1993 "Australia"  -.165  -.165 1
1994 "Australia"  -.165  -.165 1
1995 "Australia"  -.165  -.165 1
1996 "Australia"  -.165  -.165 1
1996 "Australia" 22.593 22.593 1
1997 "Australia" 22.593 22.593 1
1998 "Australia" 22.593 22.593 1
1998 "Australia" 48.458 48.458 1
1999 "Australia" 48.458 48.458 1
2000 "Australia" 48.458 48.458 1
end

------------------ copy up to and including the previous line ------------------

How is it possible to solve this here without dropping the duplicates?"

Last edited by Mario Ferri; 15 Feb 2020, 12:34.

Comment

Mario Ferri

Join Date: Jul 2019

Posts: 190
#6

16 Feb 2020, 16:45

Originally posted by Nick Cox View Post

Fooling Stata about your data structure won't get you good results. If your identifiers aren't on the same level, you won't be able to interpret results easily. Nor will examiners or reviewers. This needs hard thought about what kind of data generation process you have and what your goals are.

Allow me a naive question. If I will be using bayesian approach, like TVP I mentioned in a previous post, will still time be an important issue? In others will I be obtaining a solution and good results if instrad of using a time series model use a bayesian approach?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30061
#7

16 Feb 2020, 17:20

Why do you not want to drop the duplicate observations. Evidently they are getting in your way, and no information is lost by removing them (other than information about the existence of the duplicates--which you could get around by first generating a new variable indicating the presence, or perhaps the number, of duplicates).

Please note that the approach in #3 would produce bizarre results. You have two observations for Australia in 1998. For one of them L1.pm would be 48.458 and for the other it would be 22.593. Obviously at least one of those, and perhaps both, must be wrong.

In thinking about using time series operations on data like this is something like planning to do an appendectomy on a carrot--the results will not be useful.
1 like
Comment
Mario Ferri

Join Date: Jul 2019

Posts: 190
#8

16 Feb 2020, 18:34

Originally posted by Clyde Schechter View Post

Why do you not want to drop the duplicate observations. Evidently they are getting in your way, and no information is lost by removing them (other than information about the existence of the duplicates--which you could get around by first generating a new variable indicating the presence, or perhaps the number, of duplicates).

Please note that the approach in #3 would produce bizarre results. You have two observations for Australia in 1998. For one of them L1.pm would be 48.458 and for the other it would be 22.593. Obviously at least one of those, and perhaps both, must be wrong.

In thinking about using time series operations on data like this is something like planning to do an appendectomy on a carrot--the results will not be useful.

Simply because, from the real dataset you had seen in a previous thread, If I drop the duplicate observations ,I will be missing information of the regime change as I call it .In other words ,if I drop these I will be missing information on some data indexes (not present on this data example but present on the other in a previous thread) appearing only if there are more than one time observation in a year .And will not be able to answer a key research question of the project what happens when there are multiple regimes in a year.
Still have not decide however if I use rolling windows panel var or a GMM. Still have to think of it

Please allow me a naive question. As an alternative approach, If instead of using regular time series I adopt a bayesian approach, like TVP or any other bayesian will still time duplicate observations. be an important issue?
In others words, will I be obtaining a solution and good results going bayesian instead of using time series models?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30061
#9

16 Feb 2020, 18:51

Simply because, from the
real
dataset you had seen in a previous thread, If I drop the duplicate observations ,I will be missing information of the regime change as I call it .In other words ,if I drop these I will be missing information on some data indexes (not present on this data example but present on the other in a previous thread) appearing only if there are more than one time observation in a year .And will not be able to answer a key research question of the project what happens when there are multiple regimes in a year.

I see what your situation is, but no matter how hard you try, you will not be able to do the impossible; at most you will be able to delude yourself, and perhaps some others, into thinnking you have.

Look closely at your example data. What is the lagged value of pm for ID 1 (Australia) in 1997. There are two ID 1 1996 observations, and one of them has pm = 22.593 and the other has it as -.165. Which of those is the correct "lagged value?" A similar situation arises for 1 in 1994: there are two 1993 ID 1 observations with different pm values, -.165j and -14.9. If there is some systematic way to answer this question, then perhaps we can move forward from there, but in that case the solution will almost surely involve deleting one of the two values from the data.

I am not myself a user of GMM or VAR, so I can't advise you on that issue.

While I have some familiarity with Bayesian statistics, I am not at all expert in it. I don't know what TVP stands for. Perhaps I am missing something, but I don't see any way that a Bayesian approach gets around the problem that the very notion of a lagged observation is undefinable in this kind of data.
Comment
Mario Ferri

Join Date: Jul 2019

Posts: 190
#10

16 Feb 2020, 19:10

Originally posted by Clyde Schechter View Post

I see what your situation is, but no matter how hard you try, you will not be able to do the impossible; at most you will be able to delude yourself, and perhaps some others, into thinnking you have.

Look closely at your example data. What is the lagged value of pm for ID 1 (Australia) in 1997. There are two ID 1 1996 observations, and one of them has pm = 22.593 and the other has it as -.165. Which of those is the correct "lagged value?" A similar situation arises for 1 in 1994: there are two 1993 ID 1 observations with different pm values, -.165j and -14.9. If there is some systematic way to answer this question, then perhaps we can move forward from there, but in that case the solution will almost surely involve deleting one of the two values from the data.

I am not myself a user of GMM or VAR, so I can't advise you on that issue.

While I have some familiarity with Bayesian statistics, I am not at all expert in it. I don't know what TVP stands for. Perhaps I am missing something, but I don't see any way that a Bayesian approach gets around the problem that the very notion of a lagged observation is undefinable in this kind of data.

TVP stand for time varying parameter .It is assumed that parameters are time varying and stochastically volatile. I have only some basic familiarity with Bayesian statistics. From the few things I know in Bayesian econometrics you do not take lags or you do not have to consider data to be stationary. You simply takes priors. If that is the case ,then going Bayesian might that be a way to overcome the problem. If there are any Bayesian theorists or experts in the forum might wish to give their lights

Last edited by Mario Ferri; 16 Feb 2020, 19:52.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30061
#11

16 Feb 2020, 21:13

Thanks for the explanations.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35601
#12

17 Feb 2020, 02:02

Going Bayesian won’t remove the need for an adequate model of the data generation process. And it’s hard to see that time is not central to your problem. (Outside Bayesian statistics or econometrics, statiionarity isn't an essential assumption either.)

In #5 it seems that some values change within a calendar year. We need the story on why that happens. Concretely, examples like this

Code:

1990 "Australia" -4.5 -4.5 1 1990 "Australia" -14.9 -14.9 1 1993 "Australia" -14.9 -14.9 1 1993 "Australia" -.165 -.165 1 1996 "Australia" -.165 -.165 1 1996 "Australia" 22.593 22.593 1 1998 "Australia" 22.593 22.593 1 1998 "Australia" 48.458 48.458 1

suggest that your data arise from an irregular time series with jumps at arbitrary points within years. Bayesian theorists or experts here will need to know what is going on just as much as anyone else.
1 like
Comment
Mario Ferri

Join Date: Jul 2019

Posts: 190
#13

17 Feb 2020, 05:57

Originally posted by Nick Cox View Post

Going Bayesian won’t remove the need for an adequate model of the data generation process. And it’s hard to see that time is not central to your problem. (Outside Bayesian statistics or econometrics, statiionarity isn't an essential assumption either.)

In #5 it seems that some values change within a calendar year. We need the story on why that happens. Concretely, examples like this

Code:

1990 "Australia" -4.5 -4.5 1 1990 "Australia" -14.9 -14.9 1 1993 "Australia" -14.9 -14.9 1 1993 "Australia" -.165 -.165 1 1996 "Australia" -.165 -.165 1 1996 "Australia" 22.593 22.593 1 1998 "Australia" 22.593 22.593 1 1998 "Australia" 48.458 48.458 1

suggest that your data arise from an irregular time series with jumps at arbitrary points within years. Bayesian theorists or experts here will need to know what is going on just as much as anyone else.

The dataset refers to the start day and end day of the actual dates of governments in office. Dates are ommited iin this example. The data values pm and gvare referring to some sort of data indexes for each government. So the jumps in some years are the cases where multiple (more than one) governments have occurred in that year and each one takes a start and end date and a index values Macro data(not present to the example) are associated with the longest duration governments in a year.
I would appreciate any help you can give me on this.

Last edited by Mario Ferri; 17 Feb 2020, 06:29.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35601
#14

17 Feb 2020, 06:26

That's some progress, thanks....

The over-arching principle here is that it is your project. Statalist can't determine your goals or make oracular judgements on what is best for you. If you're a student, there should be people locally to advise or instruct.

I call that an irregular time series. If you're determined to reduce it to regular panel data with at most one observation for each identifier and time, then you need a protocol for combining different values within the same year for the same country. I can't suggest what is a best fit for your project beyond wondering about some kind of weighted average.
1 like
Comment
Mario Ferri

Join Date: Jul 2019

Posts: 190
#15

17 Feb 2020, 07:56

Originally posted by Nick Cox View Post

That's some progress, thanks....

The over-arching principle here is that it is your project. Statalist can't determine your goals or make oracular judgements on what is best for you. If you're a student, there should be people locally to advise or instruct.

I call that an irregular time series. If you're determined to reduce it to regular panel data with at most one observation for each identifier and time, then you need a protocol for combining different values within the same year for the same country. I can't suggest what is a best fit for your project beyond wondering about some kind of weighted average.

I am not a student. I 'm supposed to solve this by my own.
As I explained above, it is not the case to reduce the dataset, a will be loosing important information. On the other hand, I have a variable created called duration in days in a year That refers to the duration of the government in days in single year. So, as a thought, in order to solve, could I just use that variable as time instead of the regular time (ts in the my example). It's not the same and will not be showing the effects of and on each year but still a way to go.
Comment

Announcement

Duplicates in panel settings

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment