Duplicates after -collapse- and -egen, group()-

Mallory Montgomery

Join Date: Jun 2015
Posts: 8

Duplicates after -collapse- and -egen, group()-

07 Dec 2017, 14:20

In brief, here's my question:
I use -collapse- to guarantee no duplicates in terms of person, location, and date. Then I run -egen, group(person location)- and find that there are multiple person-location combinations with the same group identifier. Why is this?

Here is a sample of the original dataset, before -collapse-:

Code:

 location

person

date

action_type

action_count

11512816225

K98501875

12-Nov-16

1

3

11512816225

K98501875

12-Nov-16

2

2

11512816225

K98501875

19-Nov-16

1

1

11512816225

K98501875

19-Nov-16

2

1

11512816225

K98501875

26-Nov-16

1

2

11512816225

K98501875

26-Nov-16

2

1

11512816225

K98501875

3-Dec-16

1

1

11512816225

K98501875

10-Dec-16

2

1

11512816225

K98501875

10-Dec-16

1

1

11512816225

K98501875

24-Dec-16

1

1

11512816225

K98501875

14-Jan-17

2

2

11512816225

L218975436

12-Nov-16

1

1

11512816225

L218975436

12-Nov-16

2

2

11512816225

L218975436

19-Nov-16

1

4

11512816225

L218975436

19-Nov-16

2

1

11512816225

L218975436

26-Nov-16

2

12

11512816225

L218975436

26-Nov-16

1

6

11512816225

L218975436

3-Dec-16

1

1

11512816225

L218975436

10-Dec-16

1

2

11512816225

L218975436

10-Dec-16

2

1

11512816225

L218975436

24-Dec-16

1

1

11512816225

L218975436

24-Dec-16

2

1

11512816225

L218975436

14-Jan-17

1

1

I use -collapse- to sum the actions by person-location-date, and create a person-location identifier using -egen, group()-:

Code:

. collapse (sum) action_count, by(location person date)

. egen person_loc = group(person location)

Using person_loc as a unique identifier for each person-location pair, I want to use -tsfill- to create rows for each missing date for each person-location. However, I am unable to use -tsset- becaues of duplicates:

Code:

. tsset sellerasin date

repeated time values within panel

Some digging reveals that there are duplicates of (person_loc date) but not (person location date). Looking only at the duplicates, and running the same -egen, group()- code, yields no duplicates.

Code:

. duplicates tag person_loc date, gen(dup1) // omitted: sum shows about 10% duplicates

. duplicates tag person location date, gen(dup2) // omitted: sum shows no duplicates

. keep if dup1==1

. egen person_loc2 = group(person location)

. duplicates report person_loc2 date

Duplicates in terms of person_loc2 date

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |      7398210             0
--------------------------------------

Here is a sample of my data with both group variables:

Code:

 location

person

date

action_count
person_loc
person_loc2

11512816225

B01LY7RJ3S

12-Nov-16

5

16777216

1

11512816225

B01LY7RJ3S

19-Nov-16

2

16777216

1

11512816225

B01LY7RJ3S

26-Nov-16

3

16777216

1

11512816225

B01LY7RJ3S

3-Dec-16

1

16777216

1

11512816225

B01LY7RJ3S

10-Dec-16

2

16777216

1

11512816225

B01LY7RJ3S

24-Dec-16

1

16777216

1

11512816225

B01LY7RJ3S

14-Jan-17

2

16777216

1

11512816225

B01LY8OC7F

12-Nov-16

3

16777216

2

11512816225

B01LY8OC7F

19-Nov-16

5

16777216

2

11512816225

B01LY8OC7F

26-Nov-16

18

16777216

2

11512816225

B01LY8OC7F

3-Dec-16

1

16777216

2

11512816225

B01LY8OC7F

10-Dec-16

3

16777216

2

11512816225

B01LY8OC7F

24-Dec-16

2

16777216

2

11512816225

B01LY8OC7F

14-Jan-17

1

16777216

2

A couple notes:

There are hundreds of millions of rows and hundreds of thousands of people/locations
There are no missing values of person, location, or date
location is stored as a double, person is stored as a string, date is stored as a date-formatted float
I cannot seem to recreate the problem on a dataset small enough to share, which makes me think it has something to do with how large it is

Is this expected behavior of -egen, group()-? Is there a limit to the number of groups or something? Why did -egen, group()- not distinguish the 2 different entries of person the first time, but did the second time (after deleting the non-duplicated values)?

Tags: None

Robert Picard

Join Date: Mar 2014

Posts: 1536
#2

08 Dec 2017, 09:21

Please use dataex to generate your data examples, it's included with Stata 15.1 and available from SSC if you are using an older version of Stata.

I'm going to take a wild guess here that you have many more location people groups than you think, more than can be stored accurately in a float. Try

Code:

egen long person_loc = group(person location)
1 like
Comment
Mallory Montgomery

Join Date: Jun 2015

Posts: 8
#3

08 Dec 2017, 10:43

Thanks, Robert Picard! You were absolutely correct.

I suspected the data type might be the problem, but I attempted to address is in the wrong way, by trying to change the data type of location before using -egen, group()-. I see now that location is ok as-is, but because there are so many groups, I need the group identifier variable to be a long.

Hopefully someone else working with giant datasets will see this post and avoid my mistake! Thanks again.
Comment

Announcement

Duplicates after -collapse- and -egen, group()-

Comment

Comment