Why doesn't by/bysort work with everything?

Nick Huntington-Klein

Join Date: Jul 2014

Posts: 22
#1

Why doesn't by/bysort work with everything?

24 Aug 2017, 11:52

This is more out of my own curiosity than trying to solve a problem.

I will occasionally use "by x:" or "bysort x:" with a command. This does not work for all commands. For example, "bysort x: mean y" will return the error "mean may not be combined with by"

And when this happens, I grunt and sigh and write a simple for loop of the format

levelsof x, l(xvalues)
foreach i in `xvalues' {
mean y if x == `i'
}

I've always imagined that the "by" prefix is basically just a shorthand for that loop that I wrote (with some bells and whistles tacked on top). But it must not be, because if it were, there's no reason why any command that accepts "if" would be unable to handle "by." But there are several commands that can't handle "by".

So what is actually going on behind the scenes with "by"? How is it structured that it can't do this seemingly simple thing?
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35755
#2

24 Aug 2017, 12:19

Just one thing, but it's often crucial. If you run something under by: then typically only the last set of estimation results remains accessible. For many groupwise commands, that is not what you want.

Also, if tends to have just one meaning as specifying selection of observations.

by: has at least two meanings that are distinct:

Do something separately for groups but in essence all at once.

Do something separately for groups but one after the other.
Comment

Richard Williams

Join Date: Apr 2014
Posts: 5011

24 Aug 2017, 12:37

Interesting question. Mean has the -over- option, so you don't need by. It still seems it could be supported though.

What are some of the other commands that do not work with by? If we had a list, we could maybe see what they have in common.

If you use the -over- option, you do get all the results stored, not just the last one. Maybe that has something to do with it. e.g.

Code:

. webuse nhanes2f, clear

. mean weight height, over(race)

Mean estimation                   Number of obs   =     10,337

        White: race = White
        Black: race = Black
        Other: race = Other

--------------------------------------------------------------
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
weight       |
       White |    71.7112   .1586525      71.40021    72.02219
       Black |   75.09187   .5137014      74.08491    76.09882
       Other |   63.15765   .9725929      61.25118    65.06412
-------------+------------------------------------------------
height       |
       White |   167.7604   .1014584      167.5615    167.9592
       Black |   167.8115   .2897411      167.2435    168.3794
       Other |   161.8428   .6246221      160.6184    163.0672
--------------------------------------------------------------

. mat list e(b)

e(b)[1,6]
       weight:    weight:    weight:    height:    height:    height:
        White      Black      Other      White      Black      Other
y1  71.711204  75.091869   63.15765  167.76037  167.81149  161.84282

. mat list e(_N)

e(_N)[1,6]
     weight:  weight:  weight:  height:  height:  height:
      White    Black    Other    White    Black    Other
r1     9051     1086      200     9051     1086      200

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/

Comment

Richard Williams

Join Date: Apr 2014

Posts: 5011
#4

24 Aug 2017, 12:48

Now you've got me wondering why more commands don't support the -over- option. It seems superior to by because all the results get stored, not just the last one. Although I suppose that could get messy with commands that store a lot of results, e.g. a lot of stuff that gets stored as scalars would instead have to be stored as matrices.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#5

24 Aug 2017, 12:48

I do not know how Stata implements -by- internally, nor how they decide which commands to make -by-able. Perhaps a StataCorp employee will respond, if it isn't a proprietary secret.

But I can tell you quite confidently that -by- is not implement as an internal -levelsof- followed by a -foreach- loop. The reason is clear if you look at some timings. Here's some output I generated from a toy data set of 1,000,000 observations in 1,000 equally sized groups. The iterated command is simply summarizing one variable that has random values. I timed four different approaches. The first is when we use -by group: summ x-. The second uses -levelsof- followed by -foreach-. The third run is the same as the second, but it actually summarizes all 1,000,000 x's each time through: the point of this one is to get a sense of how much of the time used by the second approach is spent managing the loop iteration and interpreting the -summ- command. Finally the fourth is an approach to emulating -by- in situations where the -levelsof-/-foreach- approach is too slow to tolerate (computationally intensive commands with large data sets). It relies on creating "pointers" to the observations in each group so that -in-, rather than -if- can be used to select the active subsample, and then iterating the pointers through the data set.

Code:

. clear* . . set obs 1000 number of observations (_N) was 0, now 1,000 . set seed 1234 . gen int group = _n . expand 1000 (999,000 observations created) . . gen x = rnormal() . . sort group . . timer clear . . // USING -by- . timer on 1 . quietly by group: summ x . timer off 1 . . // USING -levelsof- AND SELECTING WITH -if- . quietly levelsof group, local(groups) . timer on 2 . foreach g of local groups { 2. quietly summ x if group == `g' 3. } . timer off 2 . . // TIME SPENT INTERPRETING AND SUMMARIZING . timer on 3 . foreach g of local groups { 2. quietly summ x 3. } . timer off 3 . . // USING "POINTERS" TO EMULATE -by- . timer on 4 . by group: gen int gcount = _N . local N = _N . local first 1 . local last = gcount[1] . while `first' <= `N' { 2. quietly summ x in `first'/`last' 3. local first = `first' + gcount[`first'] 4. local last = `first' + gcount[`first'] - 1 5. } . timer off 4 . . timer list 1: 0.09 / 1 = 0.0900 2: 33.43 / 1 = 33.4270 3: 22.43 / 1 = 22.4320 4: 0.11 / 1 = 0.1140

You can see that -by- is much faster than -levelsof-/-foreach-. You can also see that about 1/3 of the time spent by -levelsof-/-foreach- is spent implementing the condition specified by -if-. The pointer-based emulation of -by- runs almost as quickly as -by- itself, and it is conceivable that the difference between them is the overhead of interpreting the commands that manipulate the pointers, as opposed to having those commands in compiled code. Of course, the real implementation of -by- cannot rely on -by group: gen int gcount = _N- to initialize the process, so they presumably have some very fast compiled routine that does that. I would hazard a guess that -by-'s internals are something like the pointer method, but it's just speculation on my part.

Added: Crossed with #2 and #3. Also, if my speculation is correct, that would imply that -by- can be used with any command that leaves the sort order of the data undisturbed. Since I can't see any reason why -means- would shuffle the data around, but is not -by-able, that would suggest my speculation is not correct, though conceivably there is some other reason (such as, as Richard points out, the availability of the -over()- option to provide the same functionality).

Last edited by Clyde Schechter; 24 Aug 2017, 12:59.
Comment
Nick Huntington-Klein

Join Date: Jul 2014

Posts: 22
#6

24 Aug 2017, 13:47

That's interesting. It is a good point that by requires the data to be sorted, and so is likely using some method of going through the values in a method that is faster and does not use "if." But curious that it doesn't apply to mean.

I wonder if it's something that is shut off manually for commands that have a superior approach built in, like over for mean.

I wish I had a list of other commands that by doesn't work with off the top of my head - I have definitely run across others, but mean is the only one that springs to mind.

edit: as an additional note, I think Stata stores some additional information about the data set's group structure when you do sort, and so I wonder if the "by" method you do there in Method 4 to count the number of observations per group can just be skipped by "by", if group size is already stored when you do sort.

Last edited by Nick Huntington-Klein; 24 Aug 2017, 14:03.
Comment
Andrew Lover

Join Date: Apr 2014

Posts: 182
#7

24 Aug 2017, 16:34

Again this doesn't address the "why'' of Nick's question, but -tabstat- does all the same and more. Potentially -by- has simply never been implemented since there are other (potentially better) options? Clyde, does timing a long -tabstat- provide any useful info?

Code:

webuse nhanes2f, clear tabstat weight, by(race) stat(mean p50)

Last edited by Andrew Lover; 24 Aug 2017, 16:45.

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#8

24 Aug 2017, 16:50

Interesting question, Andrew. It turns out that -tabstat- is the slowest of all, at 68.5760 seconds using the same data.
Comment

Announcement