Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -by- syntax of adding another variable in brackets

    Hi Statalist

    This is an admittedly general question but I've struggled to find the answer to it despite having looked at many places. From https://www.stata.com/support/faqs/d...ions-in-group/, the syntax of the code used is:

    Code:
     
     by eid (egenotype), sort: gen diff = egenotype[1] != egenotype[_N] 
    Question: when we go need to add another variable in brackets following the -by- option (in this case, (egenotype))? Moreover, why wouldn't the above work if it is simply

    Code:
     
     bysort eid: gen diff = egenotype[1] != egenotype[_N] 
    Thanks.

  • #2
    Originally posted by Junran Cao View Post
    why wouldn't the above work if it is simply

    Code:
    bysort eid: gen diff = egenotype[1] != egenotype[_N] 
    See below. I illustrate using your dataset from the other thread, changing the first observation for the second ID.

    .ÿ
    .ÿversionÿ16.0

    .ÿ
    .ÿclearÿ*

    .ÿ
    .ÿinputÿbyteÿIDÿintÿYearÿstr1ÿGender

    ÿÿÿÿÿÿÿÿÿÿÿIDÿÿÿÿÿÿYearÿÿÿÿÿGender
    ÿÿ1.ÿ1ÿÿÿÿÿÿÿ2007ÿÿÿÿM
    ÿÿ2.ÿ1ÿÿÿÿÿÿÿ2008ÿÿÿÿM
    ÿÿ3.ÿ1ÿÿÿÿÿÿÿ2009ÿÿÿÿM
    ÿÿ4.ÿ2ÿÿÿÿÿÿÿ2007ÿÿÿÿMÿ//ÿChanged
    ÿÿ5.ÿ2ÿÿÿÿÿÿÿ2008ÿÿÿÿF
    ÿÿ6.ÿ2ÿÿÿÿÿÿÿ2009ÿÿÿÿF
    ÿÿ7.ÿ2ÿÿÿÿÿÿÿ2010ÿÿÿÿM
    ÿÿ8.ÿ2ÿÿÿÿÿÿÿ2011ÿÿÿÿM
    ÿÿ9.ÿ3ÿÿÿÿÿÿÿ2007ÿÿÿÿF
    ÿ10.ÿ4ÿÿÿÿÿÿÿ2007ÿÿÿÿF
    ÿ11.ÿ4ÿÿÿÿÿÿÿ2008ÿÿÿÿF
    ÿ12.ÿ4ÿÿÿÿÿÿÿ2009ÿÿÿÿF
    ÿ13.ÿ5ÿÿÿÿÿÿÿ2007ÿÿÿÿM
    ÿ14.ÿ5ÿÿÿÿÿÿÿ2008ÿÿÿÿF
    ÿ15.ÿend

    .ÿ
    .ÿbysortÿID:ÿgenerateÿbyteÿbad_flagÿ=ÿGender[1]ÿ!=ÿGender[_N]

    .ÿbysortÿIDÿ(Gender):ÿgenerateÿbyteÿgood_flagÿ=ÿGender[1]ÿ!=ÿGender[_N]

    .ÿ
    .ÿlistÿ,ÿnoobsÿsepby(ID)ÿabbreviate(20)

    ÿÿ+-------------------------------------------+
    ÿÿ|ÿIDÿÿÿYearÿÿÿGenderÿÿÿbad_flagÿÿÿgood_flagÿ|
    ÿÿ|-------------------------------------------|
    ÿÿ|ÿÿ1ÿÿÿ2008ÿÿÿÿÿÿÿÿMÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
    ÿÿ|ÿÿ1ÿÿÿ2009ÿÿÿÿÿÿÿÿMÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
    ÿÿ|ÿÿ1ÿÿÿ2007ÿÿÿÿÿÿÿÿMÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
    ÿÿ|-------------------------------------------|
    ÿÿ|ÿÿ2ÿÿÿ2008ÿÿÿÿÿÿÿÿFÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ1ÿ|
    ÿÿ|ÿÿ2ÿÿÿ2009ÿÿÿÿÿÿÿÿFÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ1ÿ|
    ÿÿ|ÿÿ2ÿÿÿ2011ÿÿÿÿÿÿÿÿMÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ1ÿ|
    ÿÿ|ÿÿ2ÿÿÿ2007ÿÿÿÿÿÿÿÿMÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ1ÿ|
    ÿÿ|ÿÿ2ÿÿÿ2010ÿÿÿÿÿÿÿÿMÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ1ÿ|
    ÿÿ|-------------------------------------------|
    ÿÿ|ÿÿ3ÿÿÿ2007ÿÿÿÿÿÿÿÿFÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
    ÿÿ|-------------------------------------------|
    ÿÿ|ÿÿ4ÿÿÿ2009ÿÿÿÿÿÿÿÿFÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
    ÿÿ|ÿÿ4ÿÿÿ2008ÿÿÿÿÿÿÿÿFÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
    ÿÿ|ÿÿ4ÿÿÿ2007ÿÿÿÿÿÿÿÿFÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
    ÿÿ|-------------------------------------------|
    ÿÿ|ÿÿ5ÿÿÿ2008ÿÿÿÿÿÿÿÿFÿÿÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿÿ1ÿ|
    ÿÿ|ÿÿ5ÿÿÿ2007ÿÿÿÿÿÿÿÿMÿÿÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿÿ1ÿ|
    ÿÿ+-------------------------------------------+

    .ÿ
    .ÿexit

    endÿofÿdo-file


    .

    Comment


    • #3
      The data example in the FAQ cited in #1 also can be used to make the key point. Here is it in self-contained form

      Code:
       clear
       input  eid     str3 egenotype  
       0       vv        
       0       vv        
       1       vv        
       1       ww        
       2       ww        
       2       vv        
       2       ww
      end
      For identifier 0 we have the same egenotype for all (two) observations.

      For identifier 1 we have different egenotype for all (two) observations.

      Identifier 2 is slightly tricky. We have different egenotype in three observations. The criterion for different values that the first and last observations for each identifier have different values works fine for identifiers with two observations, but won't work in this case, because fortuitously the first and first observations have the same value:

      Code:
       bysort eid : gen different = egenotype[1] != egenotype[_N]
       
       list, sepby(eid)
       
            +---------------------------+
           | eid   egenot~e   differ~t |
           |---------------------------|
        1. |   0         vv          0 |
        2. |   0         vv          0 |
           |---------------------------|
        3. |   1         vv          1 |
        4. |   1         ww          1 |
           |---------------------------|
        5. |   2         ww          0 |
        6. |   2         vv          0 |
        7. |   2         ww          0 |
           +---------------------------+
      We need to sort within identifiers first, then compare first and last observations in each group.

      This is why the () syntax is helpful. It lets usl do that all at once.

      Code:
       bysort eid (egenotype) : replace different = egenotype[1] != egenotype[_N]
       
       list, sepby(eid)
      
           +---------------------------+
           | eid   egenot~e   differ~t |
           |---------------------------|
        1. |   0         vv          0 |
        2. |   0         vv          0 |
           |---------------------------|
        3. |   1         vv          1 |
        4. |   1         ww          1 |
           |---------------------------|
        5. |   2         vv          1 |
        6. |   2         ww          1 |
        7. |   2         ww          1 |
           +---------------------------+
      Code:
      bysort a:
      says "sort on a, then do calculations within groups defined by distinct values of a".

      Code:
      bysort a  b:
      says "sort on a then on b, then do calculations within groups defined by distinct values of a and b jointly"

      Code:
      bysort a (b):
      says "sort on a then on b, then do calculations within groups defined by distinct values of a"

      The help for by: explains that the parentheses syntax makes a difference.
      Last edited by Nick Cox; 07 Jul 2019, 03:57.

      Comment


      • #4
        Thank you Joseph for illustrating the difference using my sample data.

        And thank you Nick for the additional explanations to your FAQ article. This has helped me understand the differences between -bysort a: -, -bysort a b:- and -bysort a (b):- immensely. Much appreciated.

        Comment

        Working...
        X