Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicates after -collapse- and -egen, group()-

    In brief, here's my question:
    I use -collapse- to guarantee no duplicates in terms of person, location, and date. Then I run -egen, group(person location)- and find that there are multiple person-location combinations with the same group identifier. Why is this?

    Here is a sample of the original dataset, before -collapse-:
    Code:
     
    location person date action_type action_count
    11512816225 K98501875 12-Nov-16 1 3
    11512816225 K98501875 12-Nov-16 2 2
    11512816225 K98501875 19-Nov-16 1 1
    11512816225 K98501875 19-Nov-16 2 1
    11512816225 K98501875 26-Nov-16 1 2
    11512816225 K98501875 26-Nov-16 2 1
    11512816225 K98501875 3-Dec-16 1 1
    11512816225 K98501875 10-Dec-16 2 1
    11512816225 K98501875 10-Dec-16 1 1
    11512816225 K98501875 24-Dec-16 1 1
    11512816225 K98501875 14-Jan-17 2 2
    11512816225 L218975436 12-Nov-16 1 1
    11512816225 L218975436 12-Nov-16 2 2
    11512816225 L218975436 19-Nov-16 1 4
    11512816225 L218975436 19-Nov-16 2 1
    11512816225 L218975436 26-Nov-16 2 12
    11512816225 L218975436 26-Nov-16 1 6
    11512816225 L218975436 3-Dec-16 1 1
    11512816225 L218975436 10-Dec-16 1 2
    11512816225 L218975436 10-Dec-16 2 1
    11512816225 L218975436 24-Dec-16 1 1
    11512816225 L218975436 24-Dec-16 2 1
    11512816225 L218975436 14-Jan-17 1 1
    I use -collapse- to sum the actions by person-location-date, and create a person-location identifier using -egen, group()-:
    Code:
    . collapse (sum) action_count, by(location person date)
    
    . egen person_loc = group(person location)
    Using person_loc as a unique identifier for each person-location pair, I want to use -tsfill- to create rows for each missing date for each person-location. However, I am unable to use -tsset- becaues of duplicates:
    Code:
    . tsset sellerasin date
    
    repeated time values within panel
    Some digging reveals that there are duplicates of (person_loc date) but not (person location date). Looking only at the duplicates, and running the same -egen, group()- code, yields no duplicates.
    Code:
    . duplicates tag person_loc date, gen(dup1) // omitted: sum shows about 10% duplicates
    
    . duplicates tag person location date, gen(dup2) // omitted: sum shows no duplicates
    
    . keep if dup1==1
    
    . egen person_loc2 = group(person location)
    
    . duplicates report person_loc2 date
    
    Duplicates in terms of person_loc2 date
    
    --------------------------------------
       copies | observations       surplus
    ----------+---------------------------
            1 |      7398210             0
    --------------------------------------
    Here is a sample of my data with both group variables:
    Code:
     
    location person date action_count person_loc person_loc2
    11512816225 B01LY7RJ3S 12-Nov-16 5 16777216 1
    11512816225 B01LY7RJ3S 19-Nov-16 2 16777216 1
    11512816225 B01LY7RJ3S 26-Nov-16 3 16777216 1
    11512816225 B01LY7RJ3S 3-Dec-16 1 16777216 1
    11512816225 B01LY7RJ3S 10-Dec-16 2 16777216 1
    11512816225 B01LY7RJ3S 24-Dec-16 1 16777216 1
    11512816225 B01LY7RJ3S 14-Jan-17 2 16777216 1
    11512816225 B01LY8OC7F 12-Nov-16 3 16777216 2
    11512816225 B01LY8OC7F 19-Nov-16 5 16777216 2
    11512816225 B01LY8OC7F 26-Nov-16 18 16777216 2
    11512816225 B01LY8OC7F 3-Dec-16 1 16777216 2
    11512816225 B01LY8OC7F 10-Dec-16 3 16777216 2
    11512816225 B01LY8OC7F 24-Dec-16 2 16777216 2
    11512816225 B01LY8OC7F 14-Jan-17 1 16777216 2
    A couple notes:
    • There are hundreds of millions of rows and hundreds of thousands of people/locations
    • There are no missing values of person, location, or date
    • location is stored as a double, person is stored as a string, date is stored as a date-formatted float
    • I cannot seem to recreate the problem on a dataset small enough to share, which makes me think it has something to do with how large it is
    Is this expected behavior of -egen, group()-? Is there a limit to the number of groups or something? Why did -egen, group()- not distinguish the 2 different entries of person the first time, but did the second time (after deleting the non-duplicated values)?

  • #2
    Please use dataex to generate your data examples, it's included with Stata 15.1 and available from SSC if you are using an older version of Stata.

    I'm going to take a wild guess here that you have many more location people groups than you think, more than can be stored accurately in a float. Try
    Code:
    egen long person_loc = group(person location)

    Comment


    • #3
      Thanks, Robert Picard! You were absolutely correct.

      I suspected the data type might be the problem, but I attempted to address is in the wrong way, by trying to change the data type of location before using -egen, group()-. I see now that location is ok as-is, but because there are so many groups, I need the group identifier variable to be a long.

      Hopefully someone else working with giant datasets will see this post and avoid my mistake! Thanks again.

      Comment

      Working...
      X