Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unexpected behavior of "_n" within egen command while using a by group.

    Hello,
    I've encountered a peculiar issue when I tried to use "_n==1" within an egen command and while using a by group.
    The data is structured this way: each observation is a job with various variables that describe it. A single person (id) can hold several jobs. The goal is to assign an occupational code of the highest paying job to all of persons's observations.

    This is the code I tried first:
    Code:
    gsort id -wage
    by id: egen main_occ_1=max(occ/(_n==1))
    It didn't work correctly and only assigned the expected values to a single id, seemingly refusing to hop to other id's besides the first one.


    This code worked correctly and produced what I expected:
    Code:
    gsort id -wage
    by id: g first=_n==1
    by id: egen main_occ_2=max(occ/(first==1))
    So the question is why didn't the first version of the code work correctly despite seemingly having identical logic to the second version?

    Here's the code for data, it includes the resulting variables of both versions of the code:
    Code:
    clear
    input float(id wage occupation_code main_occ first main_occ_2)
    1 5000 9 9 1 9
    1 3000 4 9 0 9
    1 2000 8 9 0 9
    1 1000 5 9 0 9
    2 3000 2 . 1 2
    2 2000 7 . 0 2
    2 2000 1 . 0 2
    3 4000 5 . 1 5
    3 1000 8 . 0 5
    end

    Thanks.

  • #2
    The help for egen warns against references to subscripts — because many of its functions sort temporarily and may break your intended interpretation.

    Comment


    • #3
      Got it. Never bothered to look at the actual command description for some reason...
      Thanks Nick.

      Comment


      • #4
        To compound the mystery, the problem arises only with some egen functions but not with others. The max function in your egen is just a dummy function which does nothing, as what you are interested is just picking up the only nonmissing expression in parenthesis. So you can replace the max with many other functions. By the way, I usually accomplish your type of task where dummy function is needed by using the mean. And here as you can see the mean does not fails us. The min and max do fails us:

        Code:
        . gsort id -wage
        
        . by id: egen main_occ_max=max(occ/(_n==1))
        (5 missing values generated)
        
        . by id: egen main_occ_min=min(occ/(_n==1))
        (5 missing values generated)
        
        . by id: egen main_occ_mean=mean(occ/(_n==1))
        
        . by id: egen main_occ_total=total(occ/(_n==1))
        
        . list, sep(0) abb(16)
        
             +--------------------------------------------------------------------------------------------+
             | id   wage   occupation_code   main_occ_max   main_occ_min   main_occ_mean   main_occ_total |
             |--------------------------------------------------------------------------------------------|
          1. |  1   5000                 9              9              9               9                9 |
          2. |  1   3000                 4              9              9               9                9 |
          3. |  1   2000                 8              9              9               9                9 |
          4. |  1   1000                 5              9              9               9                9 |
          5. |  2   3000                 2              .              .               2                2 |
          6. |  2   2000                 1              .              .               2                2 |
          7. |  2   2000                 7              .              .               2                2 |
          8. |  3   4000                 5              .              .               5                5 |
          9. |  3   1000                 8              .              .               5                5 |
             +--------------------------------------------------------------------------------------------+

        Comment


        • #5
          Yeah, it's a dummy function in this case, but sometimes there's a more complicated condition after the division that would result in more than one possible observation per group for which to compute something with egen. It's a useful "trick" I learned from Nick a while ago.

          It is funny how it works with other egen functions indeed, but in any case, lesson learned - don't use subscripitng with egen. It wouldn't hurt for Statacorp to implement some hardwired sanity check that would produce an error when using subscripting with egen given its unpredictable nature.

          Comment


          • #6
            Notwithstanding the usefulness of the trick Nick shows in one of his columns--that division by zero in Stata does not break the computer but simply produces a missing--in your case what you want can be easily done with alternative methods. There is also the question of what do you do in badly defined situations such as e.g., the highest salary occurs at more than one occupation, or the salary is missing for all occupations.

            Alternative solution one, only sorting, no egens:
            Code:
            . gsort id -wage
            
            . by id: gen occsort = occupation_code[1]
            Alternative solution two, no sorting, only egens:

            Code:
            . egen maxwage = max(wage), by(id)
            
            . egen occup = max(occupation_code) if maxwage==wage, by(id)
            (6 missing values generated)
            
            . egen occegen = mean(occup), by(id)
            
            . list, sep(0)
            
                 +------------------------------------------------------------------------------------------+
                 | id   wage   occupa~e   main_occ   first   main_o~2   occsort   maxwage   occup   occegen |
                 |------------------------------------------------------------------------------------------|
              1. |  1   5000          9          9       1          9         9      5000       9         9 |
              2. |  1   3000          4          9       0          9         9      5000       .         9 |
              3. |  1   2000          8          9       0          9         9      5000       .         9 |
              4. |  1   1000          5          9       0          9         9      5000       .         9 |
              5. |  2   3000          2          .       1          2         2      3000       2         2 |
              6. |  2   2000          7          .       0          2         2      3000       .         2 |
              7. |  2   2000          1          .       0          2         2      3000       .         2 |
              8. |  3   4000          5          .       1          5         5      4000       5         5 |
              9. |  3   1000          8          .       0          5         5      4000       .         5 |
                 +------------------------------------------------------------------------------------------+

            Originally posted by Evgeny Sironov View Post
            Yeah, it's a dummy function in this case, but sometimes there's a more complicated condition after the division that would result in more than one possible observation per group for which to compute something with egen. It's a useful "trick" I learned from Nick a while ago.

            It is funny how it works with other egen functions indeed, but in any case, lesson learned - don't use subscripitng with egen. It wouldn't hurt for Statacorp to implement some hardwired sanity check that would produce an error when using subscripting with egen given its unpredictable nature.

            Comment


            • #7
              The allusion in #6 is to https://www.stata-journal.com/articl...article=dm0055 which is freely visible.

              The division by zero trick is in Section 10 of that paper, while earlier sections will be found to endorse precisely the kind of more basic techniques that Joro Kolev rightly recommends.

              It's very hard in Stata, and ultimately quite unimportant, to know whether you read some trick earlier somewhere and didn't register at the time, but then later re-invented it yourself. It's just possible that the division by zero trick is a little original, but otherwise dm0055 is, may I say, one of the more useful of my columns if I can judge by the number of times it has been relevant to a question here.

              Comment

              Working...
              X