All Observations are Duplicates - Panel Data r(451) Repeated Time Variables within panel

Lamia Ben

Join Date: Oct 2018

Posts: 26
#1

All Observations are Duplicates - Panel Data r(451) Repeated Time Variables within panel

27 Oct 2020, 08:37

Hello,

Using Stata 12, I have a panel dataset from 2010 to 2019 with 23 variables and 1484908 observations.
It is data of fish sales prices($) and amounts sold (kg) by year, region and species category.
I wish to study the effect of different predictors on the sales prices and amounts by category of species (category and sub category).
To achieve that, I want to estimate a multilevel fixed effects model, but whenever I try to declare the dataset as panel data, I get the error message :

xtset SousCatégorie_Espèce Year
repeated time values within panel
r(451);

While searching through the forum to fix this issue, I found that the duplicate set of commands helps show and delete the duplicate observations within variables.
However, after running the commands on my variable DateKey(DD/MM/YYY) and variable Year(YYYY), both variables show that almost all observations are duplicates(except for the first occurence).
The following screenshots show the commands used and results :

As you can see in the "duplicates list" command, it shows that there are no duplicates in the dataset.
However the "duplicates report Year", the "duplicates tag Year, gen(dup_Year) and the "drop if Year==Year[_n-1]" commands show that only the first occurence of the date value is an observation; all the rest are duplicates.
Is there any way aroud this ?
How can I notify stata to take into consideration the values of all other the variables so that it doesn't count all my dates as duplicates?

Any kind of help is highly appreciated,
Thanks in advance
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#2

27 Oct 2020, 08:49

Lamia:
the usual, simpler, fix is to -xtset- your dataset with -panelid- only.
However, it comes at the cost of making time-series related commands, such as lags and leads, unfeasible.

Kind regards,
Carlo
(Stata 19.0)
Comment
Lamia Ben

Join Date: Oct 2018

Posts: 26
#3

27 Oct 2020, 09:10

Thanks for you suggestion Carlo,
As I would like to include a lagged variable of the dependant variable as an independant and for the model to account for the time variable, I must have the time variable.
The data is as it follows :daily observation of category of fish species caught, type of boat, region, city, category of fishmonger, .....
So for the same day I have different observations. Stata only recognizes the first occurence of the date as an observation and the rest as duplicates.
How can I tell stata that for a same date I have multiple observations and not only the first, so that it doesn't count them as duplicates ?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#4

27 Oct 2020, 09:55

Lamia:
the only way out I can envisage is to create a -timevar- that avoid Stata throwing out -repeated time values within panel- warning message.
Maybe you can include fictitious -hrs- data in order to fix the issue.

Kind regards,
Carlo
(Stata 19.0)
Comment
Eric de Souza

Join Date: Mar 2014

Posts: 587
#5

27 Oct 2020, 10:43

The problem for me is that it is not clear how your data are laid out.
Your basic data is prices and quantites.
Do you have for each year,
an observation for each region
an observation for each species
and observatio for each type of boat?
It would be good to see a sample of your data
Comment
Lamia Ben

Join Date: Oct 2018

Posts: 26
#6

27 Oct 2020, 13:11

Carlo :
The standard format of the date before any normalizing had hours in it, and it still showed the r(451) error.
I actually tried many date formats, but it doesn't seem to help much in my case for this issue.
I don't know where the problem is coming from
Comment

Lamia Ben

Join Date: Oct 2018
Posts: 26

27 Oct 2020, 13:23

Hello Eric,
I have daily data of the species caught, the price in which they were sold, amounts sold , region, type of boat,.... Each line represents an operation and its characteristics.
Here is a small sample to show the problem I am facing :

Year

DateKey

Volume(Kg)

CA(Dh)

Regions

Groupe_Espèce

Catégorie_Espèce

SousCatégorie_Espèce

Type_Mareyeur

Libellé_Envt_Travail

Catégorie_Génerale_Bateau

Genre_Bateau

Libellé_Destination

2010

01-jan-10

25540

null

ATLANTIQUE SUD

POISSON PELAGIQUES

SARDINE

Personne Morale

Voie de mer et transit part bateau

SARDINIER

BATEAU

FARINE

2010

01-jan-10

975

11300

ATLANTIQUE CENTRE

POISSON BLANC

SOLE COMMUNE

null

vente à lui même

CHALUTIER

BATEAU

CONSOMMATION

2010

01-jan-10

810

12000

ATLANTIQUE CENTRE

POISSON BLANC

SOLE COMMUNE

SOLE COMMUNE(PETIT)

Personne Morale

Voie de mer et transit part bateau

CHALUTIER

BATEAU

CONSOMMATION

2010

01-jan-10

350

ATLANTIQUE CENTRE

CEPHALOPODES

CALAMAR VRAI

CALMAR (ENCORNET)

Personne Physique

Voie de mer et transit part bateau

CHALUTIER

BATEAU

CONGÉLATION

2010

01-jan-10

540

6250

ATLANTIQUE CENTRE

POISSON BLANC

LANGUE

LANGUE(PETIT)

Personne Physique

Voie de mer et transit part bateau

CHALUTIER

BATEAU

CONSOMMATION

2010

01-jan-10

300

ATLANTIQUE CENTRE

POISSON BLANC

MERLU

MERLU COMMUN(PETIT)

Personne Physique

Voie de mer et transit part bateau

CHALUTIER

BATEAU

CONSOMMATION

For 01-Jan-2010 I have several operations with different characteristics, but Stata only counts the first one as an observation. As you saw in the head post, all the rest is treated as duplicates, when they are in fact different observations too (on the same day).

I hope this makes it clearer,
Any suggestions ?

Last edited by Lamia Ben; 27 Oct 2020, 13:27.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#8

28 Oct 2020, 03:11

Lamia:
I was not able to spot any -panelid- in your dataset.
Hence, I can't get whether you actually have repeated observations on the same panels or a repeated cross-sectional design.

Kind regards,
Carlo
(Stata 19.0)
Comment
Lamia Ben

Join Date: Oct 2018

Posts: 26
#9

28 Oct 2020, 03:48

Hello Carlo,
Thanks for yours answers.
The -panelid- in this dataset comes from the "SousCatégorie_Espèce", I use the command "egen sce_id =group(SousCatégorie_Espèce), label" to create it.
I have daily observations from 2010 to 2019 for each variable, that means I have repeated lines with the same date (many lines for each day). And stata only count the first line as an observation, all the rest are considered duplicates.
How to tell stata that those are independant obervations and not duplicates? Is there a command or code that can help my case ?
Comment
Eric de Souza

Join Date: Mar 2014

Posts: 587
#10

28 Oct 2020, 04:05

I saw your post yesterday evening just as I was about to shut down my office computer. As I understand your problem you want to study the relation between prices (Ca(Dh) ?) and quantities (Volume(Kg)).
Either you choose for each date a unique (price quantity) pair
or you do a pooled OLS estimation.
But without seeing the model you want to estimate it is not possible (for me at least) to add anything.
By the way, as far as I can see, SousCatégorie_Espèce is just one variable, not a group.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#11

28 Oct 2020, 04:08

Lamia:
I share Eric's take that you do not seem to have a -panelid-.
I would have expected that panels were boats (or fishing companies) that were repeated mesured on the same variables during a given timespan.

Kind regards,
Carlo
(Stata 19.0)
Comment
Lamia Ben

Join Date: Oct 2018

Posts: 26
#12

28 Oct 2020, 04:20

Eric :
Thanks for your answers,
I have two models : one for the prices and the other for quantities. I wish to study the effect of all the other variables on the two dependants.
Since I have the group, category and subscategory of species, I thought that using a multilevel fixed effect model will be relevant.
Choosing for each date a unique (price quantity) pair will result in loosing ore than half the dataset.
And to run a pooled OLS estimation, I have to declare data to be panel set, which brings me back to the inital issue.
Any insights might be helpful
Comment
Lamia Ben

Join Date: Oct 2018

Posts: 26
#13

28 Oct 2020, 04:29

Carlo:
There were variables I could use as panelid : such as the fishing companies and names of boats, but due to confidentiality issues I wasn't allowed that data.
I figured since I want to study the prices and amounts of fish species, that I would use a multilevel fixed effect model, and use the group, category and subcategory of fish species as panelid.
Is that possible ?
Comment
Eric de Souza

Join Date: Mar 2014

Posts: 587
#14

28 Oct 2020, 04:36

As far as I can see from the data extract you posted, you cannot study the effect of all the other variables on the two "dependent variables": there is a lot of overlapping. Some variables are subcategories of others.
When you say "two dependent variables" do you have a two equation model?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#15

28 Oct 2020, 05:13

Lamia:
it is still obscure to me how you can run a panel data regression with a fictitious -id- (as the results cannot be customized to the original -id- once you have run the regression).
As far as multilevel model are concerned, you should have a nested design, something like: fishes nested within lakes nested within regions nested within countries.
Eventually, I do not understand if you're taking about a random intercept mixed model or else.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement