egen newvar=group(varlist) does not generate a variable based on all unique combinations in the varlist

David Margolis

Join Date: Jan 2020

Posts: 4
#1

egen newvar=group(varlist) does not generate a variable based on all unique combinations in the varlist

02 Jan 2020, 05:38

I have been using egen newvar=group(varname1 varname2) for a while, with the understanding that newvar would be a new variable containing all unique combinations of varname1 and varname2. Since I want to collapse my data to 1 observation per unique combination of varname1 and varname2, but I also need to have these variables (as well as newvar) in the final dataset, I decided to test the following code (a somewhat pathological example, I agree, but I regularly work with datasets with tens of millions of observations) to see if it was faster to put varname1 and varname2 in the by() option or in the variable list. Remember, there should be no variation in varname1 and varname2 within values of newvar, so the means of varname1 and varname2 that come out of collapse are just their values for each specific value of newvar, and both approaches should give the same thing. However, I got an unfortunate surprise: egen group does not appear to generate its new variable based on all unique combinations in the varlist. Here is the code I tested:

Code:

clear set obs 20000000 gen indid=floor(uniform()*10000) gen firmid=floor(uniform()*10000) sort indid firmid egen matchid=group(indid firmid) drawnorm x y preserve
timer on 1 sort matchid collapse x y indid firmid, by(matchid) summarize timer off 1
restore preserve
timer on 2 sort matchid indid firmid collapse x y, by(matchid indid firmid) summarize timer off 2
restore timer list 1 timer list 2

The first summarize gives
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
matchid | 17,451,332 8738687 5060250 1 1.81e+07
x | 17,451,332 .0003232 .9687208 -5.402949 5.76085
y | 17,451,332 .000109 .9687222 -5.663941 5.742223
indid | 17,451,332 4820.521 2791.404 0 9999
firmid | 17,451,332 4998.954 2886.053 0 9999

and the second gives
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
indid | 18,125,449 4999.293 2886.385 0 9999
firmid | 18,125,449 4999.083 2886.161 0 9999
matchid | 18,125,449 9062725 5232367 1 1.81e+07
x | 18,125,449 .0002974 .9751242 -5.402949 5.76085
y | 18,125,449 .0001436 .9751012 -5.663941 5.742223

Note the different number of observations; since matchid was defined by combinations of indid and firmid, there should be no variation in these variables within matchid and adding them to the by() option should not change the number of observations or anything else in the descriptive statistics (this holds, by the way, when I choose 1000 different values for indid and firmid, instead of 10 000).

To follow up on this, I checked to see that matchid was indeed not getting unique combinations with this code:

Code:

format matchid %9.0f tempfile hold sort matchid save `hold' collapse (sd) sdi=indid sdf=firmid, by (matchid) merge matchid using `hold' gsort -sdi -sdf matchid order matchid indid firmid sdi sdf list in 1/10

and Stata returns:
+----------------------------------------------------------------------------------+
| matchid indid firmid sdi sdf x y _merge |
|----------------------------------------------------------------------------------|
1. | 17283644 9534 9999 .5773503 5772.637 .1314288 .5337735 3 |
2. | 17283644 9535 0 .5773503 5772.637 -1.074206 1.676847 3 |
3. | 17283644 9535 1 .5773503 5772.637 -.2784413 -2.624709 3 |
4. | 17998432 9929 9999 .5773503 5772.637 .654556 .3150313 3 |
5. | 17998432 9930 1 .5773503 5772.637 -1.504835 .3309812 3 |
|----------------------------------------------------------------------------------|
6. | 17998432 9930 0 .5773503 5772.637 -.0689836 -1.354481 3 |
7. | 16986156 9370 9998 .5773503 5772.06 -.1597866 .414003 3 |
8. | 16986156 9371 0 .5773503 5772.06 .0601182 -.0195616 3 |
9. | 16986156 9370 9998 .5773503 5772.06 -.1587963 1.086275 3 |
10. | 16986156 9371 1 .5773503 5772.06 .5464685 .4928921 3 |
+----------------------------------------------------------------------------------+

where you can see that the same matchid has multiple indid-firmid combinations associated with it.

Clearly, I seem to have misunderstood what egen group does. Is this a commonly known issue? If so, is there a better way to generate a variable that refers to unique combinations of variables in the varlist?

Thank you all for your time.

P.S. Here are the timers, if you are curious:
. timer list 1
1: 128.52 / 4 = 32.1290

. timer list 2
2: 217.01 / 3 = 72.3383
Tags: collapse, egen, group
Nick Cox

Join Date: Mar 2014

Posts: 35711
#2

02 Jan 2020, 05:45

My guess is that you need to specify long as a variable type to get each distinct (not unique!) value represented as such.
Comment
David Margolis

Join Date: Jan 2020

Posts: 4
#3

02 Jan 2020, 06:33

Dear Nick,

Thanks, your intuition was perfect for my example. Somewhat surprisingly (at least to me), specifying int (or double) as the variable type also fixes the problem, whereas float doesn't work despite it also being a 4-byte data type. Given the ranges for int variables in the Stata documentation (-32,767 to 32,740), I wasn't expecting it to be able to represent a variable that takes on tens of millions of distinct integer values accurately, but I guess it can.

Thanks again for the quick, and helpful, suggestion!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35711
#4

02 Jan 2020, 06:40

Glad it helped.

group() won't produce results less than 1, so the possibilities for an int type are fewer than you say. For that and other reasons I wouldn't trust int here. You may get the same results with either method, but my guess is that just means that some different values were lumped together in the same way.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

02 Jan 2020, 19:19

For what it's worth, these are the limits on storage of decimal integers with full accuracy in the various numeric storage types. The fixed-point variables lose the 27 largest positive values to missing value codes; the similar loss for floating point variables occurs only for the largest exponent, so it doesn't affect the much smaller integer values.

byte - 7 bits	-127	100
int - 15 bits	-32,767	32,740
long - 31 bits	-2,147,483,647	2,147,483,620
float - 24 bits	-16,777,216	16,777,216
double - 53 bits	-9,007,199,254,740,992	9,007,199,254,740,992

Addressing the question of why int worked as well as long, once egen tried to insert a value larger than 32740 into the int, its type was escalated from int to long.

Code:

. clear

. set obs 40000
number of observations (_N) was 0, now 40,000

. generate long x = _n

. egen int g = group(x)

. describe

Contains data
  obs:        40,000                          
 vars:             2                          
------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------
x               long    %12.0g                
g               long    %8.0g                 group(x)
------------------------------------------------------------

Last edited by William Lisowski; 02 Jan 2020, 19:46.

Announcement

egen newvar=group(varlist) does not generate a variable based on all unique combinations in the varlist

Comment

Comment

Comment

Comment