Why doesn't -egen, group- produce int or long variables by default?

László Sándor

Join Date: Apr 2014

Posts: 120
#1

Why doesn't -egen, group- produce int or long variables by default?

01 Aug 2014, 07:39

I was just bitten by unexpected behavior: I have so many unique combinations on two ID variables, that -egen newid = group(id1 id2)- produced a float variable with 5.60e+07 as its maximum, which doesn't have enough precision to make the whole exercise useful (i.e. different newid values will be rounded together). I hope -egen long newid=…- will fix this, but I am at a loss why this is left for the user to take care of. In any case, here I flagged this for other users, otherwise I did not find a trace of this on the web.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

01 Aug 2014, 08:58

FWIW, I, too, have been bitten by this on several occasions. I just make it a matter of practice to always specify -egen long new_id = group(id1 id2)-, even if I'm working in a small enough data set that I don't need to. Now it's a habit, and I don't run into the problem any more--at the cost of perhaps wasting a little memory.

I suppose that Stata Corp. would say that all commands that create new variables do so as float, or the type specified in a -set type- command unless you override it in the command, and that the consistency of that rule is worth preserving.

But I agree with you: it would be nice if -egen- would recognize this situation and, if not handle it directly, at least issue a warning.
1 like
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

01 Aug 2014, 10:49

One can certainly argue that egen group() should always create a long instead of float. Both types use 4 bytes. But what should be done if the default type is double (set type double)?

As is explained in help precision, "Floats can store up to 16,777,215 exactly". So in order for this to bite, you must be trying to create more groups than that. If I was working with that many observations, I would not use egen group() because it is slower than doing it using first principles. That's true for most egen functions, in large part because of the preserve / restore steps they include.

I would think that anyone who is working with this many observations would have an excellent understanding of help data_types. I'm not saying that, like Clyde, you can't be bitten by this but it should not be a surprise and the solution should be obvious.

Code:

* reset timers
timer clear
local ntimer 0


* number of groups that can be stored in a float
clear
set type float
local obs = 2e7
set obs `obs'
gen x = _n
bysort x: gen N = _N
tab N
sum x if N > 1, meanonly
dis r(min) - 1


* ----------- no type specified --------------
clear
set seed 12345
set obs `obs'
gen long x = -_n if runiform() < .99

timer on `++ntimer'
egen group1 = group(x)
timer off `ntimer'

timer on `++ntimer'
bysort x: gen firstobs = _n == 1
gen group2 = sum(firstobs) if !mi(x)
timer off `ntimer'

assert group1 == group2

bys group2: gen N = _N
tab N if !mi(group2)


* ----------- redo with long --------------
clear
set seed 12345
set obs `obs'
gen long x = -_n if runiform() < .99

timer on `++ntimer'
egen long group1 = group(x)
timer off `ntimer'

timer on `++ntimer'
bysort x: gen firstobs = _n == 1
gen long group2 = sum(firstobs) if !mi(x)
timer off `ntimer'

assert group1 == group2

bys group2: gen N = _N
tab N if !mi(group2)

timer list

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#4

01 Aug 2014, 11:20

Robert, thanks for that script. Interesting: first principles runs about 40-55% faster than -egen, group()-. I'm surprised the difference is that large.

And, particularly if this were being iterated on an even larger data set, I would certainly see the advantage of coding it the faster way.

But I don't work with huge data sets very often. The "several" times I've made the same mistake Laszlo noted amounts to about 6 or 7 times over 20 years. Most of my work involves medium size (several thousand observations) data sets, and the computationally intensive part typically involves things that I'm not capable of directly programming myself, such as the fitting of a non-linear multi-level random effects model. The data management components of my work are usually pretty short and sweet. So I've usually taken the attitude that even if I can shave some minutes off of the run time, I'd rather that the code be as transparent as possible. Otherwise, I'll give those saved minutes back, with a lot of interest, later when I come back to the code and try to figure out what I did.

I think, though, if I worked often with very large data sets and analyses that pushed the limits of my hardware and software, I would feel otherwise and program the way you suggest.

Thanks.
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#5

01 Aug 2014, 11:32

Thanks. Most importantly, which other -egen- functions are likely to be this much faster using first principles? I know -egen- is slow, but I never bothered with first principles. This would be very useful for teams struggling with Stata on big data. Also see http://nber.org/stata/efficient/
(I was also hoping that with a fast SSD holding our temporary folder, even the occasional preserve is acceptable if necessary. But I don't deem a preserve for -egen- necessary.)
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#6

01 Aug 2014, 11:34

Also, I am reasonably savvy about Stata's ways, and I did read Bill Gould's posts on precision, and internalized their lessons. I was still bitten by this error. I checked the documentation, and did not even a brief warning about the 16,777,215 limit.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#7

01 Aug 2014, 11:44

For many/most people, setting up a monstrous data set is something you might do once every several years, rather than something you do on a routine basis. For such people, egen may be slow, but writing faster customized code may take 10 times as long if you can do it at all. A lot of people who know little about data base manipulation are forced to learn because nobody else is around who will do it for them. They'll just be happy to get something that works right even if it slow.

Like Clyde, I would put more effort into my coding if I confronted major challenges routinely as opposed to, say, every year or two.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#8

01 Aug 2014, 14:26

I think in Stata's world, you constantly need to generate variables for various purposes. Many commands don't accept expressions. (Not that egen-variables would be that easy to express anyway.)
Comment

Announcement