In brief, here's my question:
I use -collapse- to guarantee no duplicates in terms of person, location, and date. Then I run -egen, group(person location)- and find that there are multiple person-location combinations with the same group identifier. Why is this?
Here is a sample of the original dataset, before -collapse-:
I use -collapse- to sum the actions by person-location-date, and create a person-location identifier using -egen, group()-:
Using person_loc as a unique identifier for each person-location pair, I want to use -tsfill- to create rows for each missing date for each person-location. However, I am unable to use -tsset- becaues of duplicates:
Some digging reveals that there are duplicates of (person_loc date) but not (person location date). Looking only at the duplicates, and running the same -egen, group()- code, yields no duplicates.
Here is a sample of my data with both group variables:
A couple notes:
I use -collapse- to guarantee no duplicates in terms of person, location, and date. Then I run -egen, group(person location)- and find that there are multiple person-location combinations with the same group identifier. Why is this?
Here is a sample of the original dataset, before -collapse-:
Code:
location person date action_type action_count 11512816225 K98501875 12-Nov-16 1 3 11512816225 K98501875 12-Nov-16 2 2 11512816225 K98501875 19-Nov-16 1 1 11512816225 K98501875 19-Nov-16 2 1 11512816225 K98501875 26-Nov-16 1 2 11512816225 K98501875 26-Nov-16 2 1 11512816225 K98501875 3-Dec-16 1 1 11512816225 K98501875 10-Dec-16 2 1 11512816225 K98501875 10-Dec-16 1 1 11512816225 K98501875 24-Dec-16 1 1 11512816225 K98501875 14-Jan-17 2 2 11512816225 L218975436 12-Nov-16 1 1 11512816225 L218975436 12-Nov-16 2 2 11512816225 L218975436 19-Nov-16 1 4 11512816225 L218975436 19-Nov-16 2 1 11512816225 L218975436 26-Nov-16 2 12 11512816225 L218975436 26-Nov-16 1 6 11512816225 L218975436 3-Dec-16 1 1 11512816225 L218975436 10-Dec-16 1 2 11512816225 L218975436 10-Dec-16 2 1 11512816225 L218975436 24-Dec-16 1 1 11512816225 L218975436 24-Dec-16 2 1 11512816225 L218975436 14-Jan-17 1 1
Code:
. collapse (sum) action_count, by(location person date) . egen person_loc = group(person location)
Code:
. tsset sellerasin date repeated time values within panel
Code:
. duplicates tag person_loc date, gen(dup1) // omitted: sum shows about 10% duplicates . duplicates tag person location date, gen(dup2) // omitted: sum shows no duplicates . keep if dup1==1 . egen person_loc2 = group(person location) . duplicates report person_loc2 date Duplicates in terms of person_loc2 date -------------------------------------- copies | observations surplus ----------+--------------------------- 1 | 7398210 0 --------------------------------------
Code:
location person date action_count person_loc person_loc2 11512816225 B01LY7RJ3S 12-Nov-16 5 16777216 1 11512816225 B01LY7RJ3S 19-Nov-16 2 16777216 1 11512816225 B01LY7RJ3S 26-Nov-16 3 16777216 1 11512816225 B01LY7RJ3S 3-Dec-16 1 16777216 1 11512816225 B01LY7RJ3S 10-Dec-16 2 16777216 1 11512816225 B01LY7RJ3S 24-Dec-16 1 16777216 1 11512816225 B01LY7RJ3S 14-Jan-17 2 16777216 1 11512816225 B01LY8OC7F 12-Nov-16 3 16777216 2 11512816225 B01LY8OC7F 19-Nov-16 5 16777216 2 11512816225 B01LY8OC7F 26-Nov-16 18 16777216 2 11512816225 B01LY8OC7F 3-Dec-16 1 16777216 2 11512816225 B01LY8OC7F 10-Dec-16 3 16777216 2 11512816225 B01LY8OC7F 24-Dec-16 2 16777216 2 11512816225 B01LY8OC7F 14-Jan-17 1 16777216 2
- There are hundreds of millions of rows and hundreds of thousands of people/locations
- There are no missing values of person, location, or date
- location is stored as a double, person is stored as a string, date is stored as a date-formatted float
- I cannot seem to recreate the problem on a dataset small enough to share, which makes me think it has something to do with how large it is
Comment