Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicates in panel settings

    . A question that has come up before to forum.
    ​​​ I am trying to declare dataset to be a panel and I am getting and error
    [CODE][]
    xtset ID ts
    repeated time values within panel
    ID stands for the country id and ts for year/CODE]

    That means I am having duplicates. I am aware of the presence of duplicates ts for some years from the nature of the dataset and it is desired dataset to remain as such. Any way to solve it while keeping the multiple years when that occurs?
    I have searched in the forum before posting didn't find a suitable solution. From some old posts, I think I might have to create a new time variable, didn't get how, nonetheless.



  • #2
    Mario:
    if duplicates are actually a matter of fact and you cannot/ do not want to get rid of them, you can simply -xtset- your dataset with -paneild- only:
    Code:
    xtset ID
    This will work provided that you do not plan to use time-series related commands such as lags and leads.
    Last edited by Carlo Lazzaro; 15 Feb 2020, 05:26.
    Kind regards,
    Carlo
    (Stata 18.0 SE)

    Comment


    • #3
      Originally posted by Carlo Lazzaro View Post
      Mario:
      if duplicates are actually a matter of fact and you cannot/ do not want to get rid of them, you can simply -xtset- your dataset with -paneild- only:
      Code:
      xtset ID
      This will work provided that you do not plan to use time-series related commands such as lags and leads.
      Well that is the case It is a macro panel and still have to check for stationary for all variables and want to cluster by country year. For some of them like inflation, exchange rates etc I suspect will have to make data stationary like 90 percent of the cases. I will most likely have to use rolling windows panel var or a GMM. Still have to decide on in

      Some people suggested me to proceed d by giving within id observations some name :
      Code:
      bys ID time : gen withinID = [_n]
      egen newID = group(ID withinID)
      xtset newID time
      Or some others

      duplicates tag ID ts, generate(duplicate)
      egen time=concat(ts duplicate)
      xtset ID time
      Thinking that this should keep the order of time more or else in my data.

      Are these approaches correct?

      Comment


      • #4
        Fooling Stata about your data structure won't get you good results. If your identifiers aren't on the same level, you won't be able to interpret results easily. Nor will examiners or reviewers. This needs hard thought about what kind of data generation process you have and what your goals are.

        Comment


        • #5
          An example of my data can be found below

          ---------------------- copy starting from the next line -----------------------
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input long ts str15 country float(pm gv ID)
          1990 "Australia"   -4.5   -4.5 1
          1990 "Australia"  -14.9  -14.9 1
          1991 "Australia"  -14.9  -14.9 1
          1991 "Australia"  -14.9  -14.9 1
          1992 "Australia"  -14.9  -14.9 1
          1993 "Australia"  -14.9  -14.9 1
          1993 "Australia"  -.165  -.165 1
          1994 "Australia"  -.165  -.165 1
          1995 "Australia"  -.165  -.165 1
          1996 "Australia"  -.165  -.165 1
          1996 "Australia" 22.593 22.593 1
          1997 "Australia" 22.593 22.593 1
          1998 "Australia" 22.593 22.593 1
          1998 "Australia" 48.458 48.458 1
          1999 "Australia" 48.458 48.458 1
          2000 "Australia" 48.458 48.458 1
          end
          ------------------ copy up to and including the previous line ------------------



          How is it possible to solve this here without dropping the duplicates?"
          Last edited by Mario Ferri; 15 Feb 2020, 12:34.

          Comment


          • #6
            Originally posted by Nick Cox View Post
            Fooling Stata about your data structure won't get you good results. If your identifiers aren't on the same level, you won't be able to interpret results easily. Nor will examiners or reviewers. This needs hard thought about what kind of data generation process you have and what your goals are.
            Allow me a naive question. If I will be using bayesian approach, like TVP I mentioned in a previous post, will still time be an important issue? In others will I be obtaining a solution and good results if instrad of using a time series model use a bayesian approach?

            Comment


            • #7
              Why do you not want to drop the duplicate observations. Evidently they are getting in your way, and no information is lost by removing them (other than information about the existence of the duplicates--which you could get around by first generating a new variable indicating the presence, or perhaps the number, of duplicates).

              Please note that the approach in #3 would produce bizarre results. You have two observations for Australia in 1998. For one of them L1.pm would be 48.458 and for the other it would be 22.593. Obviously at least one of those, and perhaps both, must be wrong.

              In thinking about using time series operations on data like this is something like planning to do an appendectomy on a carrot--the results will not be useful.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                Why do you not want to drop the duplicate observations. Evidently they are getting in your way, and no information is lost by removing them (other than information about the existence of the duplicates--which you could get around by first generating a new variable indicating the presence, or perhaps the number, of duplicates).

                Please note that the approach in #3 would produce bizarre results. You have two observations for Australia in 1998. For one of them L1.pm would be 48.458 and for the other it would be 22.593. Obviously at least one of those, and perhaps both, must be wrong.

                In thinking about using time series operations on data like this is something like planning to do an appendectomy on a carrot--the results will not be useful.
                Simply because, from the real dataset you had seen in a previous thread, If I drop the duplicate observations ,I will be missing information of the regime change as I call it .In other words ,if I drop these I will be missing information on some data indexes (not present on this data example but present on the other in a previous thread) appearing only if there are more than one time observation in a year .And will not be able to answer a key research question of the project what happens when there are multiple regimes in a year.
                Still have not decide however if I use rolling windows panel var or a GMM. Still have to think of it

                Please allow me a naive question. As an alternative approach, If instead of using regular time series I adopt a bayesian approach, like TVP or any other bayesian will still time duplicate observations. be an important issue?
                In others words, will I be obtaining a solution and good results going bayesian instead of using time series models?

                Comment


                • #9
                  Simply because, from the
                  real
                  dataset you had seen in a previous thread, If I drop the duplicate observations ,I will be missing information of the regime change as I call it .In other words ,if I drop these I will be missing information on some data indexes (not present on this data example but present on the other in a previous thread) appearing only if there are more than one time observation in a year .And will not be able to answer a key research question of the project what happens when there are multiple regimes in a year.
                  I see what your situation is, but no matter how hard you try, you will not be able to do the impossible; at most you will be able to delude yourself, and perhaps some others, into thinnking you have.

                  Look closely at your example data. What is the lagged value of pm for ID 1 (Australia) in 1997. There are two ID 1 1996 observations, and one of them has pm = 22.593 and the other has it as -.165. Which of those is the correct "lagged value?" A similar situation arises for 1 in 1994: there are two 1993 ID 1 observations with different pm values, -.165j and -14.9. If there is some systematic way to answer this question, then perhaps we can move forward from there, but in that case the solution will almost surely involve deleting one of the two values from the data.

                  I am not myself a user of GMM or VAR, so I can't advise you on that issue.

                  While I have some familiarity with Bayesian statistics, I am not at all expert in it. I don't know what TVP stands for. Perhaps I am missing something, but I don't see any way that a Bayesian approach gets around the problem that the very notion of a lagged observation is undefinable in this kind of data.

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post

                    I see what your situation is, but no matter how hard you try, you will not be able to do the impossible; at most you will be able to delude yourself, and perhaps some others, into thinnking you have.

                    Look closely at your example data. What is the lagged value of pm for ID 1 (Australia) in 1997. There are two ID 1 1996 observations, and one of them has pm = 22.593 and the other has it as -.165. Which of those is the correct "lagged value?" A similar situation arises for 1 in 1994: there are two 1993 ID 1 observations with different pm values, -.165j and -14.9. If there is some systematic way to answer this question, then perhaps we can move forward from there, but in that case the solution will almost surely involve deleting one of the two values from the data.

                    I am not myself a user of GMM or VAR, so I can't advise you on that issue.

                    While I have some familiarity with Bayesian statistics, I am not at all expert in it. I don't know what TVP stands for. Perhaps I am missing something, but I don't see any way that a Bayesian approach gets around the problem that the very notion of a lagged observation is undefinable in this kind of data.
                    TVP stand for time varying parameter .It is assumed that parameters are time varying and stochastically volatile. I have only some basic familiarity with Bayesian statistics. From the few things I know in Bayesian econometrics you do not take lags or you do not have to consider data to be stationary. You simply takes priors. If that is the case ,then going Bayesian might that be a way to overcome the problem. If there are any Bayesian theorists or experts in the forum might wish to give their lights
                    Last edited by Mario Ferri; 16 Feb 2020, 19:52.

                    Comment


                    • #11
                      Thanks for the explanations.

                      Comment


                      • #12
                        Going Bayesian won’t remove the need for an adequate model of the data generation process. And it’s hard to see that time is not central to your problem. (Outside Bayesian statistics or econometrics, statiionarity isn't an essential assumption either.)

                        In #5 it seems that some values change within a calendar year. We need the story on why that happens. Concretely, examples like this

                        Code:
                         
                         1990 "Australia"   -4.5   -4.5 1 1990 "Australia"  -14.9  -14.9 1  1993 "Australia"  -14.9  -14.9 1 1993 "Australia"  -.165  -.165 1  1996 "Australia"  -.165  -.165 1 1996 "Australia" 22.593 22.593 1  1998 "Australia" 22.593 22.593 1 1998 "Australia" 48.458 48.458 1
                        suggest that your data arise from an irregular time series with jumps at arbitrary points within years. Bayesian theorists or experts here will need to know what is going on just as much as anyone else.

                        Comment


                        • #13
                          Originally posted by Nick Cox View Post
                          Going Bayesian won’t remove the need for an adequate model of the data generation process. And it’s hard to see that time is not central to your problem. (Outside Bayesian statistics or econometrics, statiionarity isn't an essential assumption either.)

                          In #5 it seems that some values change within a calendar year. We need the story on why that happens. Concretely, examples like this

                          Code:
                          1990 "Australia" -4.5 -4.5 1 1990 "Australia" -14.9 -14.9 1 1993 "Australia" -14.9 -14.9 1 1993 "Australia" -.165 -.165 1 1996 "Australia" -.165 -.165 1 1996 "Australia" 22.593 22.593 1 1998 "Australia" 22.593 22.593 1 1998 "Australia" 48.458 48.458 1
                          suggest that your data arise from an irregular time series with jumps at arbitrary points within years. Bayesian theorists or experts here will need to know what is going on just as much as anyone else.


                          The dataset refers to the start day and end day of the actual dates of governments in office. Dates are ommited iin this example. The data values pm and gvare referring to some sort of data indexes for each government. So the jumps in some years are the cases where multiple (more than one) governments have occurred in that year and each one takes a start and end date and a index values Macro data(not present to the example) are associated with the longest duration governments in a year.
                          I would appreciate any help you can give me on this.
                          Last edited by Mario Ferri; 17 Feb 2020, 06:29.

                          Comment


                          • #14
                            That's some progress, thanks....

                            The over-arching principle here is that it is your project. Statalist can't determine your goals or make oracular judgements on what is best for you. If you're a student, there should be people locally to advise or instruct.

                            I call that an irregular time series. If you're determined to reduce it to regular panel data with at most one observation for each identifier and time, then you need a protocol for combining different values within the same year for the same country. I can't suggest what is a best fit for your project beyond wondering about some kind of weighted average.

                            Comment


                            • #15
                              Originally posted by Nick Cox View Post
                              That's some progress, thanks....

                              The over-arching principle here is that it is your project. Statalist can't determine your goals or make oracular judgements on what is best for you. If you're a student, there should be people locally to advise or instruct.

                              I call that an irregular time series. If you're determined to reduce it to regular panel data with at most one observation for each identifier and time, then you need a protocol for combining different values within the same year for the same country. I can't suggest what is a best fit for your project beyond wondering about some kind of weighted average.
                              I am not a student. I 'm supposed to solve this by my own.
                              As I explained above, it is not the case to reduce the dataset, a will be loosing important information. On the other hand, I have a variable created called duration in days in a year That refers to the duration of the government in days in single year. So, as a thought, in order to solve, could I just use that variable as time instead of the regular time (ts in the my example). It's not the same and will not be showing the effects of and on each year but still a way to go.

                              Comment

                              Working...
                              X