Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sum of observations in different columns with a condition

    Good morning,

    I'm Daniel, thanks for reading this comment. I'm working with a dataset that reports the sibling characteristics of my observations in different columns (find attached with dataex an example using just observations with two siblings _01= first sibling, _02= second sibling). That said, I would like to create 3 different variables, a sum of the siblings, a sum of male siblings, and a sum of female siblings, only for the siblings who are still alive

    I tried to use the following command:

    Code:
    egen sumsiblings=rowtotal(mm1_*) if mm1_*!=. & mm2_*==1
    egen summalesiblings=rowtotal(mm1_*) if mm1_*==1 & mm2_*==1
    egen sumfemalesiblings=rowtotal(mm1_*) if mm1_*==2 & mm2_*==1
    Sadly, the asterisk does not work with the "if" condition and I run out of ideas on how to overcome this issue considering that the database contains up to 20 registries for siblings (from _01 to _20).

    PD: mm3 is the age of the sibling.

    Thank you in advance,

    Daniel.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(mm1_01 mm2_01 mm3_01 mm1_02 mm2_02 mm3_02)
    1 1 13 1 0  .
    2 1 32 2 1 29
    1 1 34 1 0  .
    1 1 13 1 1 11
    1 1 35 1 1 32
    2 0  . 2 0  .
    1 0  . 2 0  .
    1 1 11 1 1  8
    . .  . . .  .
    2 1 13 1 0  .
    2 1 12 1 1  9
    2 1 25 1 0  .
    1 0  . 2 1  5
    2 0  . 1 0  .
    1 1 20 1 1 14
    1 1 30 1 1 25
    1 1 17 1 1 13
    2 0  . 1 0  .
    2 0  . 2 1 37
    2 1 20 1 1 15
    1 1 29 2 1 24
    1 1 21 1 0  .
    2 1 18 2 0  .
    1 1 27 1 1 24
    1 1 30 2 1 27
    1 1 14 2 0  .
    2 1 28 2 1 24
    2 0  . 2 0  .
    1 0  . 1 1 19
    1 1 17 1 0  .
    2 0  . 2 0  .
    1 1 26 2 1 23
    2 0  . 1 0  .
    1 1 21 2 1 18
    1 1 14 2 1 11
    1 0  . 2 0  .
    2 1 37 2 1 34
    1 1 33 1 0  .
    2 0  . 2 1 30
    2 1 29 1 1 26
    2 1 22 2 1 19
    2 1 17 2 1 13
    1 1 20 2 0  .
    2 1 23 1 1 20
    1 0  . 1 1 15
    1 1 34 1 1 32
    2 1 26 1 1 23
    1 1 23 1 0  .
    1 0  . 1 1  6
    1 1 14 1 1 11
    2 1 18 1 1 12
    1 1 26 1 1 24
    1 1 26 1 1 24
    1 0  . 2 1 26
    1 0  . 1 0  .
    2 0  . 1 1 23
    2 1 18 1 0  .
    1 1 25 1 1 25
    1 1  7 . .  .
    1 1 35 1 1 33
    1 1 28 2 0  .
    2 0  . 1 0  .
    1 1 20 2 0  .
    1 1 32 2 0  .
    1 0  . 1 1 21
    2 0  . 1 1 35
    1 0  . 1 1 33
    1 1 33 2 1 28
    1 0  . 1 0  .
    2 1 25 1 1 22
    1 1 13 2 1 10
    2 0  . 2 1 25
    1 1 23 1 0  .
    1 1 17 1 1 13
    2 1 29 2 1 27
    2 1 13 2 1 10
    1 1 26 1 1 23
    2 1 20 1 1 13
    2 1 27 2 1 24
    1 1 30 1 1 27
    1 0  . 2 1 18
    1 1 27 2 1 21
    2 1 32 1 1 29
    2 0  . 2 1 21
    1 0  . 2 1 23
    2 1 42 1 1 39
    1 0  . 2 1 33
    2 0  . 1 0  .
    2 1 23 1 1 12
    2 0  . 2 0  .
    2 1 26 1 1 22
    1 0  . 1 1 12
    2 1 22 1 0  .
    2 1 19 2 1 13
    1 1 19 . .  .
    2 1 11 2 0  .
    2 1 18 2 0  .
    2 1 27 2 1 24
    2 1 25 2 1 22
    2 0  . 1 1  8
    end
    label values mm1_01 mm1_01
    label def mm1_01 1 "male", modify
    label def mm1_01 2 "female", modify
    label values mm2_01 mm2_01
    label def mm2_01 0 "dead", modify
    label def mm2_01 1 "alive", modify
    label values mm3_01 mm3_01
    label values mm1_02 mm1_02
    label def mm1_02 1 "male", modify
    label def mm1_02 2 "female", modify
    label values mm2_02 mm2_02
    label def mm2_02 0 "dead", modify
    label def mm2_02 1 "alive", modify
    label values mm3_02 mm3_02
    Last edited by Daniel Perez Parra; 24 Oct 2022, 02:30.

  • #2
    The strategy here uses reshape. You may need to modify it depending on the other variables present in your dataset.

    Your data did not have any identifier. I am assuming every row is a separate ID, and have created a variable for that. You can skip that part in your actual code:

    Code:
    gen `c(obs)' id = _n
    
    reshape long mm1_ mm2_ mm3_, i(id) j(sibnum) string
    
    sort id
    by id: egen sumsiblings = total(mm2)
    by id: egen summalesiblings = total(mm2*(mm1 == 1))
    by id: egen sumfemalesiblings = total(mm2*(mm1 == 2))
    
    reshape wide mm1 mm2 mm3, i(id sumsiblings summalesiblings sumfemalesiblings) j(sibnum) string

    Comment


    • #3
      Originally posted by Hemanshu Kumar View Post
      The strategy here uses reshape. You may need to modify it depending on the other variables present in your dataset.

      Your data did not have any identifier. I am assuming every row is a separate ID, and have created a variable for that. You can skip that part in your actual code:

      Code:
      gen `c(obs)' id = _n
      
      reshape long mm1_ mm2_ mm3_, i(id) j(sibnum) string
      
      sort id
      by id: egen sumsiblings = total(mm2)
      by id: egen summalesiblings = total(mm2*(mm1 == 1))
      by id: egen sumfemalesiblings = total(mm2*(mm1 == 2))
      
      reshape wide mm1 mm2 mm3, i(id sumsiblings summalesiblings sumfemalesiblings) j(sibnum) string
      Thank you for your response and clarity. Indeed, every row is a separate id.

      Daniel.

      Comment


      • #4
        Here is another way to do what I think you want.



        Code:
        gen summalesiblings = 0 
        gen sumfemalesiblings = 0 
        
        unab mm1 : mm1_* 
        local m1 : subinstr local mm1 "mm1_" "", all 
        
        foreach x of local m1 { 
            replace summalesiblings = summalesiblings + (mm1_`x' == 1 & mm2_`x' == 1)
            replace sumfemalesibling = sumfemalesiblings + (mm1_`x' == 1 & mm2_`x' == 2)
        }
        
        gen sumsiblings = summalesiblings + sumfemalesiblings
        As you have found, wildcards are not supported in if qualifiers. Even if they were they wouldn't have the interpretation you seek, as you want variables to be paired off specifically, which needs a loop as here, or a different layout, as in Hemanshu Kumar 's helpful answer.

        Comment


        • #5
          Originally posted by Nick Cox View Post
          Here is another way to do what I think you want.



          Code:
          gen summalesiblings = 0
          gen sumfemalesiblings = 0
          
          unab mm1 : mm1_*
          local m1 : subinstr local mm1 "mm1_" "", all
          
          foreach x of local m1 {
          replace summalesiblings = summalesiblings + (mm1_`x' == 1 & mm2_`x' == 1)
          replace sumfemalesibling = sumfemalesiblings + (mm1_`x' == 1 & mm2_`x' == 2)
          }
          
          gen sumsiblings = summalesiblings + sumfemalesiblings
          As you have found, wildcards are not supported in if qualifiers. Even if they were they wouldn't have the interpretation you seek, as you want variables to be paired off specifically, which needs a loop as here, or a different layout, as in Hemanshu Kumar 's helpful answer.
          Thank you Mr. Cox, it perfectly works

          Just one little thing for other colleagues that may check your comment in the future, in your foreach function, line 2 (sumfemalesiblings), mm1_`x' should be equal to 2, not 1, since 2 is for females. Also, mm2_`x' == 1, not equal to 2, since 1 indicates whether the sibling is still alive. Probably you just swapped "1" and "2" by mistake.

          Code:
          foreach x of local m1 { 
              replace summalesiblings = summalesiblings + (mm1_`x' == 1 & mm2_`x' == 1)
              replace sumfemalesibling = sumfemalesiblings + (mm1_`x' == 2 & mm2_`x' == 1)
          }
          Thanks again! It's very appreciated.

          Comment


          • #6
            No doubt you're right. You caught the spirit of the idea even if the letter was wrong.

            Comment

            Working...
            X