Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why doesn't -egen, group- produce int or long variables by default?

    I was just bitten by unexpected behavior: I have so many unique combinations on two ID variables, that -egen newid = group(id1 id2)- produced a float variable with 5.60e+07 as its maximum, which doesn't have enough precision to make the whole exercise useful (i.e. different newid values will be rounded together). I hope -egen long newid=…- will fix this, but I am at a loss why this is left for the user to take care of. In any case, here I flagged this for other users, otherwise I did not find a trace of this on the web.

  • #2
    FWIW, I, too, have been bitten by this on several occasions. I just make it a matter of practice to always specify -egen long new_id = group(id1 id2)-, even if I'm working in a small enough data set that I don't need to. Now it's a habit, and I don't run into the problem any more--at the cost of perhaps wasting a little memory.

    I suppose that Stata Corp. would say that all commands that create new variables do so as float, or the type specified in a -set type- command unless you override it in the command, and that the consistency of that rule is worth preserving.

    But I agree with you: it would be nice if -egen- would recognize this situation and, if not handle it directly, at least issue a warning.

    Comment


    • #3
      One can certainly argue that egen group() should always create a long instead of float. Both types use 4 bytes. But what should be done if the default type is double (set type double)?

      As is explained in help precision, "Floats can store up to 16,777,215 exactly". So in order for this to bite, you must be trying to create more groups than that. If I was working with that many observations, I would not use egen group() because it is slower than doing it using first principles. That's true for most egen functions, in large part because of the preserve / restore steps they include.

      I would think that anyone who is working with this many observations would have an excellent understanding of help data_types. I'm not saying that, like Clyde, you can't be bitten by this but it should not be a surprise and the solution should be obvious.

      Code:
      * reset timers
      timer clear
      local ntimer 0
      
      
      * number of groups that can be stored in a float
      clear
      set type float
      local obs = 2e7
      set obs `obs'
      gen x = _n
      bysort x: gen N = _N
      tab N
      sum x if N > 1, meanonly
      dis r(min) - 1
      
      
      * ----------- no type specified --------------
      clear
      set seed 12345
      set obs `obs'
      gen long x = -_n if runiform() < .99
      
      timer on `++ntimer'
      egen group1 = group(x)
      timer off `ntimer'
      
      timer on `++ntimer'
      bysort x: gen firstobs = _n == 1
      gen group2 = sum(firstobs) if !mi(x)
      timer off `ntimer'
      
      assert group1 == group2
      
      bys group2: gen N = _N
      tab N if !mi(group2)
      
      
      * ----------- redo with long --------------
      clear
      set seed 12345
      set obs `obs'
      gen long x = -_n if runiform() < .99
      
      timer on `++ntimer'
      egen long group1 = group(x)
      timer off `ntimer'
      
      timer on `++ntimer'
      bysort x: gen firstobs = _n == 1
      gen long group2 = sum(firstobs) if !mi(x)
      timer off `ntimer'
      
      assert group1 == group2
      
      bys group2: gen N = _N
      tab N if !mi(group2)
      
      timer list

      Comment


      • #4
        Robert, thanks for that script. Interesting: first principles runs about 40-55% faster than -egen, group()-. I'm surprised the difference is that large.

        And, particularly if this were being iterated on an even larger data set, I would certainly see the advantage of coding it the faster way.

        But I don't work with huge data sets very often. The "several" times I've made the same mistake Laszlo noted amounts to about 6 or 7 times over 20 years. Most of my work involves medium size (several thousand observations) data sets, and the computationally intensive part typically involves things that I'm not capable of directly programming myself, such as the fitting of a non-linear multi-level random effects model. The data management components of my work are usually pretty short and sweet. So I've usually taken the attitude that even if I can shave some minutes off of the run time, I'd rather that the code be as transparent as possible. Otherwise, I'll give those saved minutes back, with a lot of interest, later when I come back to the code and try to figure out what I did.

        I think, though, if I worked often with very large data sets and analyses that pushed the limits of my hardware and software, I would feel otherwise and program the way you suggest.

        Thanks.

        Comment


        • #5
          Thanks. Most importantly, which other -egen- functions are likely to be this much faster using first principles? I know -egen- is slow, but I never bothered with first principles. This would be very useful for teams struggling with Stata on big data. Also see http://nber.org/stata/efficient/
          (I was also hoping that with a fast SSD holding our temporary folder, even the occasional preserve is acceptable if necessary. But I don't deem a preserve for -egen- necessary.)

          Comment


          • #6
            Also, I am reasonably savvy about Stata's ways, and I did read Bill Gould's posts on precision, and internalized their lessons. I was still bitten by this error. I checked the documentation, and did not even a brief warning about the 16,777,215 limit.

            Comment


            • #7
              For many/most people, setting up a monstrous data set is something you might do once every several years, rather than something you do on a routine basis. For such people, egen may be slow, but writing faster customized code may take 10 times as long if you can do it at all. A lot of people who know little about data base manipulation are forced to learn because nobody else is around who will do it for them. They'll just be happy to get something that works right even if it slow.

              Like Clyde, I would put more effort into my coding if I confronted major challenges routinely as opposed to, say, every year or two.
              -------------------------------------------
              Richard Williams, Notre Dame Dept of Sociology
              StataNow Version: 19.5 MP (2 processor)

              EMAIL: [email protected]
              WWW: https://www3.nd.edu/~rwilliam

              Comment


              • #8
                I think in Stata's world, you constantly need to generate variables for various purposes. Many commands don't accept expressions. (Not that egen-variables would be that easy to express anyway.)

                Comment

                Working...
                X