Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • egen with by(varlist)

    On StackOverflow, the estimable Nick Cox posted the code quoted below in answer to a question. The second egen command caught my eye with the unfamiliar (to me) by(varlist) option. Unlike the by varlist: prefix, it apparently does not require the data to be sorted.

    I have not been able to find support in the Stata help files or documentation for the by(varlist) option with egen. [Documentation and help from Stata/SE 13.1 for Mac (64-bit Intel) Revision 19 Dec 2014.] I did stumble across by(varlist) for collapse and statsby, and noted that no mention was made of sorting one way or the other for those commands.

    Can anyone point this newbie to further enlightenment in Stata's documentation?

    Code:
    clear 
    input str1 pos str5 name  flag
    A   Joe   1
    A   Joe   1
    B   Frank 0
    C   Mike  2
    C   Ted   0
    D   Mike  2
    D   Mike  2
    E   Bill  1
    F   Bill  1
    end 
    egen tag = tag(name pos) 
    egen npos = total(tag), by(name) 
    list , sepby(pos) 
    
         +---------------------------------+
         | pos    name   flag   tag   npos |
         |---------------------------------|
      1. |   A     Joe      1     1      1 |
      2. |   A     Joe      1     0      1 |
         |---------------------------------|
      3. |   B   Frank      0     1      1 |
         |---------------------------------|
      4. |   C    Mike      2     1      2 |
      5. |   C     Ted      0     1      1 |
         |---------------------------------|
      6. |   D    Mike      2     1      2 |
      7. |   D    Mike      2     0      2 |
         |---------------------------------|
      8. |   E    Bill      1     1      2 |
         |---------------------------------|
      9. |   F    Bill      1     1      2 |
         +---------------------------------+

  • #2
    It's the historic syntax that is undocumented, but continues to work. I'd have to rummage in old manuals to say when it went undocumented but at a wild guess about 6 versions ago. There is a small risk that StataCorp will withdraw the flexibility that it still works, but my wild guess is that they have no reason to do that, especially while its usage remains moderately common.

    I think it's not documented, or so I surmise, because StataCorp decided to encourage you to think of the prefix command by: as the syntax to use. When egen was made by-able, that made the old by() options redundant in principle, but they weren't withdrawn. One reason I am still fond of it is that you don't have to sort explicitly. I don't think it's necessarily true that all egen functions support a by() option even if it's documented that they allow a by: prefix.

    Inside the code, the by() option is converted to a call to by:.

    Comment


    • #3
      Thank you, Nick. After my experience finding regular expression documentation relegated to a FAQ, I thought this was a similar situation of a new feature incompletely documented, rather than a deprecated feature sent down the memory hole. I'll pretend I never heard about it and continue doing my own sorting, rather than invest in figuring out when it's usable.

      Comment


      • #4
        I think that's what StataCorp wants you to do. I wouldn't call this deprecated. StataCorp was trying to tidy up.

        Comment


        • #5
          I see now that my use of the term "deprecation" is an uncommon one, or at least, not the primary definition found in many dictionaries. I meant not that StataCorp was "expressing disapproval of" of the syntax but rather in the software development sense that Wikipedia describes:

          Deprecation is an attribute applied to a computer software feature, characteristic, or practice to indicate that it should be avoided (often because it is being superseded).
          My apologies to StataCorp for any misunderstanding.


          Comment


          • #6
            I count myself as an enthusiast for the nuances of the English language but did not know of that sense of the word. Naturally, learning of new meanings all the time is how I got to be an enthusiast.

            I see a tension between the meanings of by()and by:. Loosely, it is between whether you do things groupwise at the same time, or groupwise but one after the other. That's hard to sustain and hard to be systematic about. If I use by() on some graph command I expect to see different panels in the same display. (StataCorp themselves were late to the distinction between by() and over().) If I use by: with a statistical command I expect to see results emerging one by one.

            For me the appeal of writing egen calls with by() options is twofold. I learned of such options when I first started using egen and they seemed naturally named and I haven't needed to unlearn the habit because while they went undocumented they didn't go unsupported. Further, I've written various egen functions that support by() options, so the syntax is doubly congenial. That's more by way of commenting on my own code, which was part of the original question, than a recommendation or an implication that anybody else should write in the same way.

            Comment


            • #7
              In general, does Stata now encourage use of the by: prefix instead of the by() option? I.e., when implementing a new statistical command, should I write it to work with by: or by()?
              (This question goes beyond egen.)

              Comment


              • #8
                That depends how you want it to work. If by() basically indicates a grouping variable then a by: prefix would often be wrong. So, it pivots on whether results are to be presented jointly or separately. I've had occasion to implement both but left one undocumented. See also [P] byable for different flavors of byable any way.

                Comment


                • #9
                  If a command with the by() option only takes one by variable, how can it be modified to take two or more?

                  Comment


                  • #10
                    If you mean, how do you modify the code of a program to allow the -by()- option to take a varlist, that would depend on the code in the program: it clearly depends on how the -by()- variable is used, and what would be a sensible way to handle multiple variables in that context.
                    If you mean, given a program that has a -by(varname)- option, and you want to run it using all combinations of a varlist instead, then it's:

                    Code:
                    egen v = group(insert_varlist_here)
                    program whatever, by(v) maybe_other_options

                    Comment

                    Working...
                    X