Sum of observations in different columns with a condition

Daniel Perez Parra

Join Date: Jul 2021
Posts: 63

Sum of observations in different columns with a condition

24 Oct 2022, 02:20

Good morning,

I'm Daniel, thanks for reading this comment. I'm working with a dataset that reports the sibling characteristics of my observations in different columns (find attached with dataex an example using just observations with two siblings _01= first sibling, _02= second sibling). That said, I would like to create 3 different variables, a sum of the siblings, a sum of male siblings, and a sum of female siblings, only for the siblings who are still alive

I tried to use the following command:

Code:

egen sumsiblings=rowtotal(mm1_*) if mm1_*!=. & mm2_*==1
egen summalesiblings=rowtotal(mm1_*) if mm1_*==1 & mm2_*==1
egen sumfemalesiblings=rowtotal(mm1_*) if mm1_*==2 & mm2_*==1

Sadly, the asterisk does not work with the "if" condition and I run out of ideas on how to overcome this issue considering that the database contains up to 20 registries for siblings (from _01 to _20).

PD: mm3 is the age of the sibling.

Thank you in advance,

Daniel.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(mm1_01 mm2_01 mm3_01 mm1_02 mm2_02 mm3_02)
1 1 13 1 0  .
2 1 32 2 1 29
1 1 34 1 0  .
1 1 13 1 1 11
1 1 35 1 1 32
2 0  . 2 0  .
1 0  . 2 0  .
1 1 11 1 1  8
. .  . . .  .
2 1 13 1 0  .
2 1 12 1 1  9
2 1 25 1 0  .
1 0  . 2 1  5
2 0  . 1 0  .
1 1 20 1 1 14
1 1 30 1 1 25
1 1 17 1 1 13
2 0  . 1 0  .
2 0  . 2 1 37
2 1 20 1 1 15
1 1 29 2 1 24
1 1 21 1 0  .
2 1 18 2 0  .
1 1 27 1 1 24
1 1 30 2 1 27
1 1 14 2 0  .
2 1 28 2 1 24
2 0  . 2 0  .
1 0  . 1 1 19
1 1 17 1 0  .
2 0  . 2 0  .
1 1 26 2 1 23
2 0  . 1 0  .
1 1 21 2 1 18
1 1 14 2 1 11
1 0  . 2 0  .
2 1 37 2 1 34
1 1 33 1 0  .
2 0  . 2 1 30
2 1 29 1 1 26
2 1 22 2 1 19
2 1 17 2 1 13
1 1 20 2 0  .
2 1 23 1 1 20
1 0  . 1 1 15
1 1 34 1 1 32
2 1 26 1 1 23
1 1 23 1 0  .
1 0  . 1 1  6
1 1 14 1 1 11
2 1 18 1 1 12
1 1 26 1 1 24
1 1 26 1 1 24
1 0  . 2 1 26
1 0  . 1 0  .
2 0  . 1 1 23
2 1 18 1 0  .
1 1 25 1 1 25
1 1  7 . .  .
1 1 35 1 1 33
1 1 28 2 0  .
2 0  . 1 0  .
1 1 20 2 0  .
1 1 32 2 0  .
1 0  . 1 1 21
2 0  . 1 1 35
1 0  . 1 1 33
1 1 33 2 1 28
1 0  . 1 0  .
2 1 25 1 1 22
1 1 13 2 1 10
2 0  . 2 1 25
1 1 23 1 0  .
1 1 17 1 1 13
2 1 29 2 1 27
2 1 13 2 1 10
1 1 26 1 1 23
2 1 20 1 1 13
2 1 27 2 1 24
1 1 30 1 1 27
1 0  . 2 1 18
1 1 27 2 1 21
2 1 32 1 1 29
2 0  . 2 1 21
1 0  . 2 1 23
2 1 42 1 1 39
1 0  . 2 1 33
2 0  . 1 0  .
2 1 23 1 1 12
2 0  . 2 0  .
2 1 26 1 1 22
1 0  . 1 1 12
2 1 22 1 0  .
2 1 19 2 1 13
1 1 19 . .  .
2 1 11 2 0  .
2 1 18 2 0  .
2 1 27 2 1 24
2 1 25 2 1 22
2 0  . 1 1  8
end
label values mm1_01 mm1_01
label def mm1_01 1 "male", modify
label def mm1_01 2 "female", modify
label values mm2_01 mm2_01
label def mm2_01 0 "dead", modify
label def mm2_01 1 "alive", modify
label values mm3_01 mm3_01
label values mm1_02 mm1_02
label def mm1_02 1 "male", modify
label def mm1_02 2 "female", modify
label values mm2_02 mm2_02
label def mm2_02 0 "dead", modify
label def mm2_02 1 "alive", modify
label values mm3_02 mm3_02

Last edited by Daniel Perez Parra; 24 Oct 2022, 02:30.

Tags: None

Hemanshu Kumar

Join Date: Mar 2015

Posts: 1548
#2

24 Oct 2022, 02:40

The strategy here uses reshape. You may need to modify it depending on the other variables present in your dataset.

Your data did not have any identifier. I am assuming every row is a separate ID, and have created a variable for that. You can skip that part in your actual code:

Code:

gen `c(obs)' id = _n reshape long mm1_ mm2_ mm3_, i(id) j(sibnum) string sort id by id: egen sumsiblings = total(mm2) by id: egen summalesiblings = total(mm2*(mm1 == 1)) by id: egen sumfemalesiblings = total(mm2*(mm1 == 2)) reshape wide mm1 mm2 mm3, i(id sumsiblings summalesiblings sumfemalesiblings) j(sibnum) string
1 like
Comment
Daniel Perez Parra

Join Date: Jul 2021

Posts: 63
#3

24 Oct 2022, 03:47

Originally posted by Hemanshu Kumar View Post

The strategy here uses reshape. You may need to modify it depending on the other variables present in your dataset.

Your data did not have any identifier. I am assuming every row is a separate ID, and have created a variable for that. You can skip that part in your actual code:

Code:

gen `c(obs)' id = _n reshape long mm1_ mm2_ mm3_, i(id) j(sibnum) string sort id by id: egen sumsiblings = total(mm2) by id: egen summalesiblings = total(mm2*(mm1 == 1)) by id: egen sumfemalesiblings = total(mm2*(mm1 == 2)) reshape wide mm1 mm2 mm3, i(id sumsiblings summalesiblings sumfemalesiblings) j(sibnum) string

Thank you for your response and clarity. Indeed, every row is a separate id.

Daniel.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36053
#4

24 Oct 2022, 03:59

Here is another way to do what I think you want.

Code:

gen summalesiblings = 0 gen sumfemalesiblings = 0 unab mm1 : mm1_* local m1 : subinstr local mm1 "mm1_" "", all foreach x of local m1 { replace summalesiblings = summalesiblings + (mm1_`x' == 1 & mm2_`x' == 1) replace sumfemalesibling = sumfemalesiblings + (mm1_`x' == 1 & mm2_`x' == 2) } gen sumsiblings = summalesiblings + sumfemalesiblings

As you have found, wildcards are not supported in if qualifiers. Even if they were they wouldn't have the interpretation you seek, as you want variables to be paired off specifically, which needs a loop as here, or a different layout, as in Hemanshu Kumar 's helpful answer.
1 like
Comment
Daniel Perez Parra

Join Date: Jul 2021

Posts: 63
#5

24 Oct 2022, 05:00

Originally posted by Nick Cox View Post

Here is another way to do what I think you want.

Code:

gen summalesiblings = 0 gen sumfemalesiblings = 0 unab mm1 : mm1_* local m1 : subinstr local mm1 "mm1_" "", all foreach x of local m1 { replace summalesiblings = summalesiblings + (mm1_`x' == 1 & mm2_`x' == 1) replace sumfemalesibling = sumfemalesiblings + (mm1_`x' == 1 & mm2_`x' == 2) } gen sumsiblings = summalesiblings + sumfemalesiblings

As you have found, wildcards are not supported in if qualifiers. Even if they were they wouldn't have the interpretation you seek, as you want variables to be paired off specifically, which needs a loop as here, or a different layout, as in Hemanshu Kumar 's helpful answer.

Thank you Mr. Cox, it perfectly works

Just one little thing for other colleagues that may check your comment in the future, in your foreach function, line 2 (sumfemalesiblings), mm1_`x' should be equal to 2, not 1, since 2 is for females. Also, mm2_`x' == 1, not equal to 2, since 1 indicates whether the sibling is still alive. Probably you just swapped "1" and "2" by mistake.

Code:

foreach x of local m1 { replace summalesiblings = summalesiblings + (mm1_`x' == 1 & mm2_`x' == 1) replace sumfemalesibling = sumfemalesiblings + (mm1_`x' == 2 & mm2_`x' == 1) }

Thanks again! It's very appreciated.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36053
#6

24 Oct 2022, 08:49

No doubt you're right. You caught the spirit of the idea even if the letter was wrong.
Comment

Announcement

Sum of observations in different columns with a condition

Comment

Comment

Comment

Comment

Comment