Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • egen newvar=group(varlist) does not generate a variable based on all unique combinations in the varlist

    I have been using egen newvar=group(varname1 varname2) for a while, with the understanding that newvar would be a new variable containing all unique combinations of varname1 and varname2. Since I want to collapse my data to 1 observation per unique combination of varname1 and varname2, but I also need to have these variables (as well as newvar) in the final dataset, I decided to test the following code (a somewhat pathological example, I agree, but I regularly work with datasets with tens of millions of observations) to see if it was faster to put varname1 and varname2 in the by() option or in the variable list. Remember, there should be no variation in varname1 and varname2 within values of newvar, so the means of varname1 and varname2 that come out of collapse are just their values for each specific value of newvar, and both approaches should give the same thing. However, I got an unfortunate surprise: egen group does not appear to generate its new variable based on all unique combinations in the varlist. Here is the code I tested:

    Code:
    clear
    set obs 20000000
    gen indid=floor(uniform()*10000)
    gen firmid=floor(uniform()*10000)
    sort indid firmid
    egen matchid=group(indid firmid)
    drawnorm x y
    preserve
    timer on 1 sort matchid collapse x y indid firmid, by(matchid) summarize timer off 1
    restore preserve
    timer on 2 sort matchid indid firmid collapse x y, by(matchid indid firmid) summarize timer off 2
    restore timer list 1 timer list 2
    The first summarize gives
    Variable | Obs Mean Std. Dev. Min Max
    -------------+---------------------------------------------------------
    matchid | 17,451,332 8738687 5060250 1 1.81e+07
    x | 17,451,332 .0003232 .9687208 -5.402949 5.76085
    y | 17,451,332 .000109 .9687222 -5.663941 5.742223
    indid | 17,451,332 4820.521 2791.404 0 9999
    firmid | 17,451,332 4998.954 2886.053 0 9999


    and the second gives
    Variable | Obs Mean Std. Dev. Min Max
    -------------+---------------------------------------------------------
    indid | 18,125,449 4999.293 2886.385 0 9999
    firmid | 18,125,449 4999.083 2886.161 0 9999
    matchid | 18,125,449 9062725 5232367 1 1.81e+07
    x | 18,125,449 .0002974 .9751242 -5.402949 5.76085
    y | 18,125,449 .0001436 .9751012 -5.663941 5.742223


    Note the different number of observations; since matchid was defined by combinations of indid and firmid, there should be no variation in these variables within matchid and adding them to the by() option should not change the number of observations or anything else in the descriptive statistics (this holds, by the way, when I choose 1000 different values for indid and firmid, instead of 10 000).

    To follow up on this, I checked to see that matchid was indeed not getting unique combinations with this code:

    Code:
    format matchid %9.0f
    tempfile hold
    sort matchid
    save `hold'
    collapse (sd) sdi=indid sdf=firmid, by (matchid)
    merge matchid using `hold'
    gsort -sdi -sdf matchid
    order matchid indid firmid sdi sdf
    list in 1/10
    and Stata returns:
    +----------------------------------------------------------------------------------+
    | matchid indid firmid sdi sdf x y _merge |
    |----------------------------------------------------------------------------------|
    1. | 17283644 9534 9999 .5773503 5772.637 .1314288 .5337735 3 |
    2. | 17283644 9535 0 .5773503 5772.637 -1.074206 1.676847 3 |
    3. | 17283644 9535 1 .5773503 5772.637 -.2784413 -2.624709 3 |
    4. | 17998432 9929 9999 .5773503 5772.637 .654556 .3150313 3 |
    5. | 17998432 9930 1 .5773503 5772.637 -1.504835 .3309812 3 |
    |----------------------------------------------------------------------------------|
    6. | 17998432 9930 0 .5773503 5772.637 -.0689836 -1.354481 3 |
    7. | 16986156 9370 9998 .5773503 5772.06 -.1597866 .414003 3 |
    8. | 16986156 9371 0 .5773503 5772.06 .0601182 -.0195616 3 |
    9. | 16986156 9370 9998 .5773503 5772.06 -.1587963 1.086275 3 |
    10. | 16986156 9371 1 .5773503 5772.06 .5464685 .4928921 3 |
    +----------------------------------------------------------------------------------+


    where you can see that the same matchid has multiple indid-firmid combinations associated with it.

    Clearly, I seem to have misunderstood what egen group does. Is this a commonly known issue? If so, is there a better way to generate a variable that refers to unique combinations of variables in the varlist?

    Thank you all for your time.

    P.S. Here are the timers, if you are curious:
    . timer list 1
    1: 128.52 / 4 = 32.1290

    . timer list 2
    2: 217.01 / 3 = 72.3383





  • #2
    My guess is that you need to specify long as a variable type to get each distinct (not unique!) value represented as such.

    Comment


    • #3
      Dear Nick,

      Thanks, your intuition was perfect for my example. Somewhat surprisingly (at least to me), specifying int (or double) as the variable type also fixes the problem, whereas float doesn't work despite it also being a 4-byte data type. Given the ranges for int variables in the Stata documentation (-32,767 to 32,740), I wasn't expecting it to be able to represent a variable that takes on tens of millions of distinct integer values accurately, but I guess it can.

      Thanks again for the quick, and helpful, suggestion!

      Comment


      • #4
        Glad it helped.

        group() won't produce results less than 1, so the possibilities for an int type are fewer than you say. For that and other reasons I wouldn't trust int here. You may get the same results with either method, but my guess is that just means that some different values were lumped together in the same way.

        Comment


        • #5
          For what it's worth, these are the limits on storage of decimal integers with full accuracy in the various numeric storage types. The fixed-point variables lose the 27 largest positive values to missing value codes; the similar loss for floating point variables occurs only for the largest exponent, so it doesn't affect the much smaller integer values.

          byte - 7 bits -127 100
          int - 15 bits -32,767 32,740
          long - 31 bits -2,147,483,647 2,147,483,620
          float - 24 bits -16,777,216 16,777,216
          double - 53 bits -9,007,199,254,740,992 9,007,199,254,740,992










          Addressing the question of why int worked as well as long, once egen tried to insert a value larger than 32740 into the int, its type was escalated from int to long.
          Code:
          . clear
          
          . set obs 40000
          number of observations (_N) was 0, now 40,000
          
          . generate long x = _n
          
          . egen int g = group(x)
          
          . describe
          
          Contains data
            obs:        40,000                          
           vars:             2                          
          ------------------------------------------------------------
                        storage   display    value
          variable name   type    format     label      variable label
          ------------------------------------------------------------
          x               long    %12.0g                
          g               long    %8.0g                 group(x)
          ------------------------------------------------------------
          Last edited by William Lisowski; 02 Jan 2020, 19:46.

          Comment

          Working...
          X