-by- syntax of adding another variable in brackets

Junran Cao

Join Date: May 2019

Posts: 75
#1

-by- syntax of adding another variable in brackets

06 Jul 2019, 23:58

Hi Statalist

This is an admittedly general question but I've struggled to find the answer to it despite having looked at many places. From https://www.stata.com/support/faqs/d...ions-in-group/, the syntax of the code used is:

Code:

by eid (egenotype), sort: gen diff = egenotype[1] != egenotype[_N]

Question: when we go need to add another variable in brackets following the -by- option (in this case, (egenotype))? Moreover, why wouldn't the above work if it is simply

Code:

bysort eid: gen diff = egenotype[1] != egenotype[_N]

Thanks.
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4421
#2

07 Jul 2019, 01:21

Originally posted by Junran Cao View Post

why wouldn't the above work if it is simply

Code:

bysort eid: gen diff = egenotype[1] != egenotype[_N]

See below. I illustrate using your dataset from the other thread, changing the first observation for the second ID.

.ÿ
.ÿversionÿ16.0

.ÿ
.ÿclearÿ*

.ÿ
.ÿinputÿbyteÿIDÿintÿYearÿstr1ÿGender

ÿÿÿÿÿÿÿÿÿÿÿIDÿÿÿÿÿÿYearÿÿÿÿÿGender
ÿÿ1.ÿ1ÿÿÿÿÿÿÿ2007ÿÿÿÿM
ÿÿ2.ÿ1ÿÿÿÿÿÿÿ2008ÿÿÿÿM
ÿÿ3.ÿ1ÿÿÿÿÿÿÿ2009ÿÿÿÿM
ÿÿ4.ÿ2ÿÿÿÿÿÿÿ2007ÿÿÿÿMÿ//ÿChanged
ÿÿ5.ÿ2ÿÿÿÿÿÿÿ2008ÿÿÿÿF
ÿÿ6.ÿ2ÿÿÿÿÿÿÿ2009ÿÿÿÿF
ÿÿ7.ÿ2ÿÿÿÿÿÿÿ2010ÿÿÿÿM
ÿÿ8.ÿ2ÿÿÿÿÿÿÿ2011ÿÿÿÿM
ÿÿ9.ÿ3ÿÿÿÿÿÿÿ2007ÿÿÿÿF
ÿ10.ÿ4ÿÿÿÿÿÿÿ2007ÿÿÿÿF
ÿ11.ÿ4ÿÿÿÿÿÿÿ2008ÿÿÿÿF
ÿ12.ÿ4ÿÿÿÿÿÿÿ2009ÿÿÿÿF
ÿ13.ÿ5ÿÿÿÿÿÿÿ2007ÿÿÿÿM
ÿ14.ÿ5ÿÿÿÿÿÿÿ2008ÿÿÿÿF
ÿ15.ÿend

.ÿ
.ÿbysortÿID:ÿgenerateÿbyteÿbad_flagÿ=ÿGender[1]ÿ!=ÿGender[_N]

.ÿbysortÿIDÿ(Gender):ÿgenerateÿbyteÿgood_flagÿ=ÿGender[1]ÿ!=ÿGender[_N]

.ÿ
.ÿlistÿ,ÿnoobsÿsepby(ID)ÿabbreviate(20)

ÿÿ+-------------------------------------------+
ÿÿ|ÿIDÿÿÿYearÿÿÿGenderÿÿÿbad_flagÿÿÿgood_flagÿ|
ÿÿ|-------------------------------------------|
ÿÿ|ÿÿ1ÿÿÿ2008ÿÿÿÿÿÿÿÿMÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
ÿÿ|ÿÿ1ÿÿÿ2009ÿÿÿÿÿÿÿÿMÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
ÿÿ|ÿÿ1ÿÿÿ2007ÿÿÿÿÿÿÿÿMÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
ÿÿ|-------------------------------------------|
ÿÿ|ÿÿ2ÿÿÿ2008ÿÿÿÿÿÿÿÿFÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ1ÿ|
ÿÿ|ÿÿ2ÿÿÿ2009ÿÿÿÿÿÿÿÿFÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ1ÿ|
ÿÿ|ÿÿ2ÿÿÿ2011ÿÿÿÿÿÿÿÿMÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ1ÿ|
ÿÿ|ÿÿ2ÿÿÿ2007ÿÿÿÿÿÿÿÿMÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ1ÿ|
ÿÿ|ÿÿ2ÿÿÿ2010ÿÿÿÿÿÿÿÿMÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ1ÿ|
ÿÿ|-------------------------------------------|
ÿÿ|ÿÿ3ÿÿÿ2007ÿÿÿÿÿÿÿÿFÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
ÿÿ|-------------------------------------------|
ÿÿ|ÿÿ4ÿÿÿ2009ÿÿÿÿÿÿÿÿFÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
ÿÿ|ÿÿ4ÿÿÿ2008ÿÿÿÿÿÿÿÿFÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
ÿÿ|ÿÿ4ÿÿÿ2007ÿÿÿÿÿÿÿÿFÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
ÿÿ|-------------------------------------------|
ÿÿ|ÿÿ5ÿÿÿ2008ÿÿÿÿÿÿÿÿFÿÿÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿÿ1ÿ|
ÿÿ|ÿÿ5ÿÿÿ2007ÿÿÿÿÿÿÿÿMÿÿÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿÿ1ÿ|
ÿÿ+-------------------------------------------+

.ÿ
.ÿexit

endÿofÿdo-file

.
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

07 Jul 2019, 03:55

The data example in the FAQ cited in #1 also can be used to make the key point. Here is it in self-contained form

Code:

 clear
 input  eid     str3 egenotype  
 0       vv        
 0       vv        
 1       vv        
 1       ww        
 2       ww        
 2       vv        
 2       ww
end

For identifier 0 we have the same egenotype for all (two) observations.

For identifier 1 we have different egenotype for all (two) observations.

Identifier 2 is slightly tricky. We have different egenotype in three observations. The criterion for different values that the first and last observations for each identifier have different values works fine for identifiers with two observations, but won't work in this case, because fortuitously the first and first observations have the same value:

Code:

 bysort eid : gen different = egenotype[1] != egenotype[_N]
 
 list, sepby(eid)
 
      +---------------------------+
     | eid   egenot~e   differ~t |
     |---------------------------|
  1. |   0         vv          0 |
  2. |   0         vv          0 |
     |---------------------------|
  3. |   1         vv          1 |
  4. |   1         ww          1 |
     |---------------------------|
  5. |   2         ww          0 |
  6. |   2         vv          0 |
  7. |   2         ww          0 |
     +---------------------------+

We need to sort within identifiers first, then compare first and last observations in each group.

This is why the () syntax is helpful. It lets usl do that all at once.

Code:

 bysort eid (egenotype) : replace different = egenotype[1] != egenotype[_N]
 
 list, sepby(eid)

     +---------------------------+
     | eid   egenot~e   differ~t |
     |---------------------------|
  1. |   0         vv          0 |
  2. |   0         vv          0 |
     |---------------------------|
  3. |   1         vv          1 |
  4. |   1         ww          1 |
     |---------------------------|
  5. |   2         vv          1 |
  6. |   2         ww          1 |
  7. |   2         ww          1 |
     +---------------------------+

Code:

bysort a:

says "sort on a, then do calculations within groups defined by distinct values of a".

Code:

bysort a  b:

says "sort on a then on b, then do calculations within groups defined by distinct values of a and b jointly"

Code:

bysort a (b):

says "sort on a then on b, then do calculations within groups defined by distinct values of a"

The help for by: explains that the parentheses syntax makes a difference.

Last edited by Nick Cox; 07 Jul 2019, 03:57.

Comment

Junran Cao

Join Date: May 2019

Posts: 75
#4

08 Jul 2019, 06:48

Thank you Joseph for illustrating the difference using my sample data.

And thank you Nick for the additional explanations to your FAQ article. This has helped me understand the differences between -bysort a: -, -bysort a b:- and -bysort a (b):- immensely. Much appreciated.
Comment

Announcement