Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • All Observations are Duplicates - Panel Data r(451) Repeated Time Variables within panel

    Hello,

    Using Stata 12, I have a panel dataset from 2010 to 2019 with 23 variables and 1484908 observations.
    It is data of fish sales prices($) and amounts sold (kg) by year, region and species category.
    I wish to study the effect of different predictors on the sales prices and amounts by category of species (category and sub category).
    To achieve that, I want to estimate a multilevel fixed effects model, but whenever I try to declare the dataset as panel data, I get the error message :

    xtset SousCatégorie_Espèce Year
    repeated time values within panel
    r(451);

    While searching through the forum to fix this issue, I found that the duplicate set of commands helps show and delete the duplicate observations within variables.
    However, after running the commands on my variable DateKey(DD/MM/YYY) and variable Year(YYYY), both variables show that almost all observations are duplicates(except for the first occurence).
    The following screenshots show the commands used and results :
    Click image for larger version

Name:	Sum.png
Views:	1
Size:	5.8 KB
ID:	1579135

    Click image for larger version

Name:	Duplicaltes_All.png
Views:	1
Size:	5.8 KB
ID:	1579136


    Click image for larger version

Name:	Duplictes_Report&List.png
Views:	1
Size:	5.9 KB
ID:	1579137



    Click image for larger version

Name:	Duplicates_Year.png
Views:	1
Size:	46.9 KB
ID:	1579139


    Click image for larger version

Name:	Deleting_Duplicates_Year.png
Views:	1
Size:	7.4 KB
ID:	1579140


    As you can see in the "duplicates list" command, it shows that there are no duplicates in the dataset.
    However the "duplicates report Year", the "duplicates tag Year, gen(dup_Year) and the "drop if Year==Year[_n-1]" commands show that only the first occurence of the date value is an observation; all the rest are duplicates.
    Is there any way aroud this ?
    How can I notify stata to take into consideration the values of all other the variables so that it doesn't count all my dates as duplicates?

    Any kind of help is highly appreciated,
    Thanks in advance

  • #2
    Lamia:
    the usual, simpler, fix is to -xtset- your dataset with -panelid- only.
    However, it comes at the cost of making time-series related commands, such as lags and leads, unfeasible.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thanks for you suggestion Carlo,
      As I would like to include a lagged variable of the dependant variable as an independant and for the model to account for the time variable, I must have the time variable.
      The data is as it follows :daily observation of category of fish species caught, type of boat, region, city, category of fishmonger, .....
      So for the same day I have different observations. Stata only recognizes the first occurence of the date as an observation and the rest as duplicates.
      How can I tell stata that for a same date I have multiple observations and not only the first, so that it doesn't count them as duplicates ?

      Comment


      • #4
        Lamia:
        the only way out I can envisage is to create a -timevar- that avoid Stata throwing out -repeated time values within panel- warning message.
        Maybe you can include fictitious -hrs- data in order to fix the issue.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          The problem for me is that it is not clear how your data are laid out.
          Your basic data is prices and quantites.
          Do you have for each year,
          an observation for each region
          an observation for each species
          and observatio for each type of boat?
          It would be good to see a sample of your data

          Comment


          • #6
            Carlo :
            The standard format of the date before any normalizing had hours in it, and it still showed the r(451) error.
            I actually tried many date formats, but it doesn't seem to help much in my case for this issue.
            I don't know where the problem is coming from

            Comment


            • #7
              Hello Eric,
              I have daily data of the species caught, the price in which they were sold, amounts sold , region, type of boat,.... Each line represents an operation and its characteristics.
              Here is a small sample to show the problem I am facing :

              Year DateKey Volume(Kg) CA(Dh) Regions Groupe_Espèce Catégorie_Espèce SousCatégorie_Espèce Type_Mareyeur Libellé_Envt_Travail Catégorie_Génerale_Bateau Genre_Bateau Libellé_Destination
              2010 01-jan-10 25540 null ATLANTIQUE SUD POISSON PELAGIQUES SARDINE SARDINE Personne Morale Voie de mer et transit part bateau SARDINIER BATEAU FARINE
              2010 01-jan-10 975 11300 ATLANTIQUE CENTRE POISSON BLANC SOLE COMMUNE SOLE COMMUNE null vente à lui même CHALUTIER BATEAU CONSOMMATION
              2010 01-jan-10 810 12000 ATLANTIQUE CENTRE POISSON BLANC SOLE COMMUNE SOLE COMMUNE(PETIT) Personne Morale Voie de mer et transit part bateau CHALUTIER BATEAU CONSOMMATION
              2010 01-jan-10 14 350 ATLANTIQUE CENTRE CEPHALOPODES CALAMAR VRAI CALMAR (ENCORNET) Personne Physique Voie de mer et transit part bateau CHALUTIER BATEAU CONGÉLATION
              2010 01-jan-10 540 6250 ATLANTIQUE CENTRE POISSON BLANC LANGUE LANGUE(PETIT) Personne Physique Voie de mer et transit part bateau CHALUTIER BATEAU CONSOMMATION
              2010 01-jan-10 30 300 ATLANTIQUE CENTRE POISSON BLANC MERLU MERLU COMMUN(PETIT) Personne Physique Voie de mer et transit part bateau CHALUTIER BATEAU CONSOMMATION

              For 01-Jan-2010 I have several operations with different characteristics, but Stata only counts the first one as an observation. As you saw in the head post, all the rest is treated as duplicates, when they are in fact different observations too (on the same day).

              I hope this makes it clearer,
              Any suggestions ?
              Last edited by Lamia Ben; 27 Oct 2020, 13:27.

              Comment


              • #8
                Lamia:
                I was not able to spot any -panelid- in your dataset.
                Hence, I can't get whether you actually have repeated observations on the same panels or a repeated cross-sectional design.
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  Hello Carlo,
                  Thanks for yours answers.
                  The -panelid- in this dataset comes from the "SousCatégorie_Espèce", I use the command "egen sce_id =group(SousCatégorie_Espèce), label" to create it.
                  I have daily observations from 2010 to 2019 for each variable, that means I have repeated lines with the same date (many lines for each day). And stata only count the first line as an observation, all the rest are considered duplicates.
                  How to tell stata that those are independant obervations and not duplicates? Is there a command or code that can help my case ?



                  Comment


                  • #10
                    I saw your post yesterday evening just as I was about to shut down my office computer. As I understand your problem you want to study the relation between prices (Ca(Dh) ?) and quantities (Volume(Kg)).
                    Either you choose for each date a unique (price quantity) pair
                    or you do a pooled OLS estimation.
                    But without seeing the model you want to estimate it is not possible (for me at least) to add anything.
                    By the way, as far as I can see, SousCatégorie_Espèce is just one variable, not a group.

                    Comment


                    • #11
                      Lamia:
                      I share Eric's take that you do not seem to have a -panelid-.
                      I would have expected that panels were boats (or fishing companies) that were repeated mesured on the same variables during a given timespan.
                      Kind regards,
                      Carlo
                      (Stata 19.0)

                      Comment


                      • #12
                        Eric :
                        Thanks for your answers,
                        I have two models : one for the prices and the other for quantities. I wish to study the effect of all the other variables on the two dependants.
                        Since I have the group, category and subscategory of species, I thought that using a multilevel fixed effect model will be relevant.
                        Choosing for each date a unique (price quantity) pair will result in loosing ore than half the dataset.
                        And to run a pooled OLS estimation, I have to declare data to be panel set, which brings me back to the inital issue.
                        Any insights might be helpful

                        Comment


                        • #13
                          Carlo:
                          There were variables I could use as panelid : such as the fishing companies and names of boats, but due to confidentiality issues I wasn't allowed that data.
                          I figured since I want to study the prices and amounts of fish species, that I would use a multilevel fixed effect model, and use the group, category and subcategory of fish species as panelid.
                          Is that possible ?

                          Comment


                          • #14
                            As far as I can see from the data extract you posted, you cannot study the effect of all the other variables on the two "dependent variables": there is a lot of overlapping. Some variables are subcategories of others.
                            When you say "two dependent variables" do you have a two equation model?

                            Comment


                            • #15
                              Lamia:
                              it is still obscure to me how you can run a panel data regression with a fictitious -id- (as the results cannot be customized to the original -id- once you have run the regression).
                              As far as multilevel model are concerned, you should have a nested design, something like: fishes nested within lakes nested within regions nested within countries.
                              Eventually, I do not understand if you're taking about a random intercept mixed model or else.
                              Kind regards,
                              Carlo
                              (Stata 19.0)

                              Comment

                              Working...
                              X