creating a dummy variable based on a condition on another variable

Diego Malo

Join Date: Jun 2019
Posts: 140

creating a dummy variable based on a condition on another variable

21 Aug 2023, 01:45

Hello,

I would like to generate a dummy variable equal to 1 if at least one crop has a production greater than 4000 for each ID. In other words, I want a dummy takes value 1 for all the ID which has at least one crop with a production greater than 4000.

This is the data set I have:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int ID str14 crop float production
0 "alfalfa"                0
0 "barley"                 0
0 "beans"                  0
0 "cabbage"                0
0 "carrots"                0
0 "citrus"                 0
0 "cotton"                 0
0 "cowpea"                 0
0 "groundnut"              0
0 "millet_foxtail"         0
0 "millet_pearl"           0
0 "olive"                  0
0 "onion"                  0
0 "potatoes"               0
0 "rapeseed"               0
0 "rice_dry"               0
0 "rice_wetland"           0
0 "rye"                    0
0 "soybean"                0
0 "sugarbeet"              0
0 "sunflower"              0
0 "tobacco"                0
0 "tomatoes"               0
0 "wheat"                  0
1 "alfalfa"                0
1 "barley"                 0
1 "beans"                  0
1 "cabbage"                0
1 "carrots"                0
1 "citrus"                 0
1 "cotton"          280.3889
1 "cowpea"                 0
1 "groundnut"              0
1 "millet_foxtail"  18.13889
1 "millet_pearl"      137.25
1 "olive"                  0
1 "onion"                  0
1 "potatoes"               0
1 "rapeseed"               0
1 "rice_dry"               0
1 "rice_wetland"           0
1 "rye"                    0
1 "soybean"                0
1 "sugarbeet"              0
1 "sunflower"              0
1 "tobacco"                0
1 "tomatoes"               0
1 "wheat"                  0
2 "alfalfa"                0
2 "barley"                 0
2 "beans"           1247.393
2 "cabbage"                0
2 "carrots"        2401.8572
2 "citrus"                 0
2 "cotton"          3532.607
2 "cowpea"         2234.7856
2 "groundnut"      2397.3572
2 "millet_foxtail"         0
2 "millet_pearl"   4555.3213
2 "olive"                  0
2 "onion"          1220.5714
2 "potatoes"               0
2 "rapeseed"               0
2 "rice_dry"        203.4643
2 "rice_wetland"           0
2 "rye"                    0
2 "soybean"         782.9286
2 "sugarbeet"              0
2 "sunflower"       616.4286
2 "tobacco"        1487.9642
2 "tomatoes"        767.5357
2 "wheat"                  0
3 "alfalfa"        516.94446
3 "barley"                 0
3 "beans"           2719.528
3 "cabbage"                0
3 "carrots"         2739.889
3 "citrus"         142.33333
3 "cotton"          3807.806
3 "cowpea"          2565.639
3 "groundnut"         2711.5
3 "millet_foxtail"    1130.5
3 "millet_pearl"   4537.1113
3 "olive"                  0
3 "onion"           2408.861
3 "potatoes"               0
3 "rapeseed"               0
3 "rice_dry"        878.3055
3 "rice_wetland"           0
3 "rye"                    0
3 "soybean"        2377.6943
3 "sugarbeet"              0
3 "sunflower"      2298.0278
3 "tobacco"         2787.222
3 "tomatoes"       2388.2778
3 "wheat"            2050.75
4 "alfalfa"                0
4 "barley"                 0
4 "beans"            4656.75
4 "cabbage"                0
end

I am trying two things:

Code:

1)
gen dummy = 0
bysort ID: replace dummy = 1 if production >= 4000

2)
bysort ID: gen dummy = 1 if production >= 4000

It works but it is just give me a value of 1 for the exact line in which the crop is greater than 4000. I want that all the ID which at least one crop is greater than 4000 takes the value 1, not just the row in which it is satisfied. Please, let me know if it is not clear.

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35778

21 Aug 2023, 02:25

Thanks for your clear question and data example. However, watch out: your text says "greater than 4000" but your operator is >=, greater than or equal to.

There is a simple method based on sorting that works fine if there are no missing values. If missing values are present you need more care.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int ID str14 crop float production
0 "alfalfa"                0
0 "barley"                 0
0 "beans"                  0
0 "cabbage"                0
0 "carrots"                0
0 "citrus"                 0
0 "cotton"                 0
0 "cowpea"                 0
0 "groundnut"              0
0 "millet_foxtail"         0
0 "millet_pearl"           0
0 "olive"                  0
0 "onion"                  0
0 "potatoes"               0
0 "rapeseed"               0
0 "rice_dry"               0
0 "rice_wetland"           0
0 "rye"                    0
0 "soybean"                0
0 "sugarbeet"              0
0 "sunflower"              0
0 "tobacco"                0
0 "tomatoes"               0
0 "wheat"                  0
1 "alfalfa"                0
1 "barley"                 0
1 "beans"                  0
1 "cabbage"                0
1 "carrots"                0
1 "citrus"                 0
1 "cotton"          280.3889
1 "cowpea"                 0
1 "groundnut"              0
1 "millet_foxtail"  18.13889
1 "millet_pearl"      137.25
1 "olive"                  0
1 "onion"                  0
1 "potatoes"               0
1 "rapeseed"               0
1 "rice_dry"               0
1 "rice_wetland"           0
1 "rye"                    0
1 "soybean"                0
1 "sugarbeet"              0
1 "sunflower"              0
1 "tobacco"                0
1 "tomatoes"               0
1 "wheat"                  0
2 "alfalfa"                0
2 "barley"                 0
2 "beans"           1247.393
2 "cabbage"                0
2 "carrots"        2401.8572
2 "citrus"                 0
2 "cotton"          3532.607
2 "cowpea"         2234.7856
2 "groundnut"      2397.3572
2 "millet_foxtail"         0
2 "millet_pearl"   4555.3213
2 "olive"                  0
2 "onion"          1220.5714
2 "potatoes"               0
2 "rapeseed"               0
2 "rice_dry"        203.4643
2 "rice_wetland"           0
2 "rye"                    0
2 "soybean"         782.9286
2 "sugarbeet"              0
2 "sunflower"       616.4286
2 "tobacco"        1487.9642
2 "tomatoes"        767.5357
2 "wheat"                  0
3 "alfalfa"        516.94446
3 "barley"                 0
3 "beans"           2719.528
3 "cabbage"                0
3 "carrots"         2739.889
3 "citrus"         142.33333
3 "cotton"          3807.806
3 "cowpea"          2565.639
3 "groundnut"         2711.5
3 "millet_foxtail"    1130.5
3 "millet_pearl"   4537.1113
3 "olive"                  0
3 "onion"           2408.861
3 "potatoes"               0
3 "rapeseed"               0
3 "rice_dry"        878.3055
3 "rice_wetland"           0
3 "rye"                    0
3 "soybean"        2377.6943
3 "sugarbeet"              0
3 "sunflower"      2298.0278
3 "tobacco"         2787.222
3 "tomatoes"       2388.2778
3 "wheat"            2050.75
4 "alfalfa"                0
4 "barley"                 0
4 "beans"            4656.75
4 "cabbage"                0
end

bysort ID (production) : gen wanted1 = production[_N] >= 4000 

egen max = max(production), by(ID)
gen wanted2 = max >= 4000 if max < . 

tabdisp ID, c(max wanted?)

----------------------------------------------
       ID |        max     wanted1     wanted2
----------+-----------------------------------
        0 |          0           0           0
        1 |   280.3889           0           0
        2 |   4555.321           1           1
        3 |   4537.111           1           1
        4 |    4656.75           1           1
----------------------------------------------

The method of generating a (0, 1) indicator (*) by a generate with value 1 and a replace with value 0 is defensible if and only if you or your readers find that much clearer. (Your second code fragment didn't include a replace.)

Otherwise it's worth exploiting the fact that true or false evaluations yield results of 1 or 0 automatically.

See https://www.stata.com/support/faqs/d...rue-and-false/
and https://journals.sagepub.com/doi/pdf...36867X19830921

for more on this point.

(*) See the paper just cited for more on why "dummy variable" is a term to be avoided.

Comment

Diego Malo

Join Date: Jun 2019

Posts: 140
#3

21 Aug 2023, 02:54

Thank you Nick Cox for the answer and the references. It really works. Absolutely, I wrote "greater", and I coded "greater or equal", my text was bad, my code was right .

I would like to ask you, if you do not mind, what exactly does when we write a variable between parenthesis after bysort. What is it exactly the difference of writing the brackets or not?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35778
#4

21 Aug 2023, 03:46

That is documented. See

Code:

help by

In your case, a sort on production within identifier is needed to guarantee that the last observation for each identifier holds the maximum.

As said, missing values are a problem for that device, as any missing values would sort to the end of each block of observations.
Comment
Diego Malo

Join Date: Jun 2019

Posts: 140
#5

21 Aug 2023, 06:48

I understand now! Thanks for the clarification.

Diego.
Comment

Announcement

creating a dummy variable based on a condition on another variable

Comment

Comment

Comment

Comment