Creating a new variable

Varsha Vaishnav

Join Date: Oct 2022

Posts: 144
#1

Creating a new variable

22 Nov 2022, 06:01

Hello,

Code:

bysort year: egen totalsum = total(A), m

The above code gives me the overall sum of variable A for each year (my dataset has 2 years). I want to create a new variable 'B' that equals (totalsum of year 1 / totalsum of year2). How can I do that?
How can we get the total sum of variable A for each year in a separate column as we can then calculate B easily? I have attached an example dataset below.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte(year a) 1 1 1 2 1 3 1 . 1 5 2 10 2 9 2 . 2 7 2 6 end

Thank you.
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35668

22 Nov 2022, 06:27

The result you want is just a scalar, and also easily accessible after summarize, but here is some technique. See also section 9 in https://www.stata-journal.com/articl...article=dm0055 for how to populate variables so that they occur in the same observations.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(year a)
1  1
1  2
1  3
1  .
1  5
2 10
2  9
2  .
2  7
2  6
end

egen total1 = total(cond(year == 1, a, .)) 
egen total2 = total(cond(year == 2, a, .)) 

su total? 

gen ratio = total2 / total1 

list, sepby(year) 

     +----------------------------------------+
     | year    a   total1   total2      ratio |
     |----------------------------------------|
  1. |    1    1       11       32   2.909091 |
  2. |    1    2       11       32   2.909091 |
  3. |    1    3       11       32   2.909091 |
  4. |    1    .       11       32   2.909091 |
  5. |    1    5       11       32   2.909091 |
     |----------------------------------------|
  6. |    2   10       11       32   2.909091 |
  7. |    2    9       11       32   2.909091 |
  8. |    2    .       11       32   2.909091 |
  9. |    2    7       11       32   2.909091 |
 10. |    2    6       11       32   2.909091 |
     +----------------------------------------+

I see you want the reciprocal of that ratio, so tweak the code accordingly.

Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1385

22 Nov 2022, 06:35

Two methods, the second being the one that creates columns the way you suggested, but which -- as the first method shows -- is not really necessary.

Code:

bysort year: egen totalsum = total(a), m

forval i = 1/2 {
    sum totalsum if year == `i', meanonly
    local sum`i' = r(mean)
}
gen b1 = `sum1'/`sum2'

forval i = 1/2 {
    egen totalsum`i' = max(totalsum*(year == `i'))
}
gen b2 = totalsum1/totalsum2

which produces:

Code:

. li , noobs sepby(year) ab(20)

  +----------------------------------------------------------------+
  | year    a   totalsum       b1   totalsum1   totalsum2       b2 |
  |----------------------------------------------------------------|
  |    1    1         11   .34375          11          32   .34375 |
  |    1    2         11   .34375          11          32   .34375 |
  |    1    3         11   .34375          11          32   .34375 |
  |    1    .         11   .34375          11          32   .34375 |
  |    1    5         11   .34375          11          32   .34375 |
  |----------------------------------------------------------------|
  |    2   10         32   .34375          11          32   .34375 |
  |    2    9         32   .34375          11          32   .34375 |
  |    2    .         32   .34375          11          32   .34375 |
  |    2    7         32   .34375          11          32   .34375 |
  |    2    6         32   .34375          11          32   .34375 |
  +----------------------------------------------------------------+

Comment

Varsha Vaishnav

Join Date: Oct 2022

Posts: 144
#4

22 Nov 2022, 11:28

Thank you, Nick and Hemanshu.

Nick, the end part of section 9 in https://www.stata-journal.com/articl...article=dm0055 talks about the trick of averaging. I'm not able to understand it. Can you please explain that here, if possible?
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35668

22 Nov 2022, 11:36

The point is to compare different ways of getting a mean into a variable. If we want the mean weight -- given mpg > 25 -- one method below is right and the other quite wrong, as a direct summarize shows.

Code:

. sysuse auto, clear
(1978 Automobile Data)

. egen mean1 = mean(cond(mpg > 25, weight, .))

. egen mean2 = mean((mpg > 25) * weight)

. su weight if mpg > 25

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      weight |         14    2122.857    386.0536       1760       3260

. l mean1 mean2 in 1

     +---------------------+
     |    mean1      mean2 |
     |---------------------|
  1. | 2122.857   401.6216 |
     +---------------------+

Comment

Varsha Vaishnav

Join Date: Oct 2022

Posts: 144
#6

23 Nov 2022, 02:58

Understood it now. Thank you, Nick.
Comment

Varsha Vaishnav

Join Date: Oct 2022
Posts: 144

23 Jan 2023, 04:49

Hello, in the attached image, there are different types of household compositions. I need all these types as separate variables. Can someone help me with this? I have attached an example dataset for the same.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double(hh_id ind_id) byte(relationship sex age marital)
1010001 101000101 1 1 58 2
1010001 101000102 2 2 53 2
1010001 101000103 3 1 32 1
1010001 101000104 3 2 29 1
1010010 101001001 1 1 38 2
1010010 101001002 2 2 32 2
1010010 101001003 3 2  4 .
1010010 101001004 4 1 75 2
1010010 101001005 4 2 68 2
1010014 101001401 1 1 35 2
1010014 101001402 2 2 33 2
1010014 101001403 3 2  8 .
1010014 101001404 3 1  6 .
1010014 101001405 4 1 55 2
1010014 101001406 4 2 53 2
1010017 101001701 1 1 25 1
1010017 101001702 4 1 23 1
1010017 101001704 4 1 54 2
1010017 101001705 4 2 49 2
1010022 101002201 1 1 31 2
1010022 101002202 2 2 29 2
1010022 101002203 3 2 13 1
1010022 101002204 3 1 11 .
1010022 101002205 3 1  9 .
1010022 101002206 4 1 56 3
1010023 101002301 1 1 37 2
1010023 101002302 2 2 38 2
1010023 101002303 4 1 69 2
1010023 101002304 4 2 64 2
1010025 101002501 1 1 27 1
1010025 101002502 4 1 24 1
1010025 101002504 4 1 65 2
1010025 101002505 4 2 57 2
1010027 101002701 1 1 35 2
1010027 101002702 2 2 34 2
1010027 101002703 3 1  2 .
1010027 101002704 4 1 73 3
1010036 101003601 1 1 40 2
1010036 101003602 2 2 39 2
1010036 101003603 3 1 10 .
1010036 101003604 3 2  8 .
1010036 101003605 3 2  6 .
1010037 101003701 1 1 28 1
1010037 101003702 4 1 24 1
1010037 101003703 4 1 57 2
1010037 101003704 4 2 55 2
1010044 101004401 1 1 27 1
1010044 101004403 4 1 22 1
1010044 101004404 4 1 66 2
1010044 101004405 4 2 63 2
end
label values relationship lab_rel
label def lab_rel 1 "head", modify
label def lab_rel 2 "spouse", modify
label def lab_rel 3 "child", modify
label def lab_rel 4 "extended family", modify
label values sex HV104
label def HV104 1 "male", modify
label def HV104 2 "female", modify
label values age HV105
label values marital lab_mar
label def lab_mar 1 "never married", modify
label def lab_mar 2 "currently married", modify
label def lab_mar 3 "widowed", modify

Thanks.

Attached Files

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35668
#8

23 Jan 2023, 05:51

#7 should have been the start of a new thread.

egen is your friend, in a nutshell. One kind of calculation is a count such as

Code:

egen adult_female = total(sex == 2 & inrange(age, 18, 64)), by(hhid)

and other calculations depend on min() and max().

Watch out that missing values count as arbitrarily large and that some variables need to be generated in two or three steps.
Comment
Varsha Vaishnav

Join Date: Oct 2022

Posts: 144
#9

23 Jan 2023, 07:03

Thank you, Nick.
Comment

Announcement