Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • creating a dummy variable based on a condition on another variable

    Hello,

    I would like to generate a dummy variable equal to 1 if at least one crop has a production greater than 4000 for each ID. In other words, I want a dummy takes value 1 for all the ID which has at least one crop with a production greater than 4000.

    This is the data set I have:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int ID str14 crop float production
    0 "alfalfa"                0
    0 "barley"                 0
    0 "beans"                  0
    0 "cabbage"                0
    0 "carrots"                0
    0 "citrus"                 0
    0 "cotton"                 0
    0 "cowpea"                 0
    0 "groundnut"              0
    0 "millet_foxtail"         0
    0 "millet_pearl"           0
    0 "olive"                  0
    0 "onion"                  0
    0 "potatoes"               0
    0 "rapeseed"               0
    0 "rice_dry"               0
    0 "rice_wetland"           0
    0 "rye"                    0
    0 "soybean"                0
    0 "sugarbeet"              0
    0 "sunflower"              0
    0 "tobacco"                0
    0 "tomatoes"               0
    0 "wheat"                  0
    1 "alfalfa"                0
    1 "barley"                 0
    1 "beans"                  0
    1 "cabbage"                0
    1 "carrots"                0
    1 "citrus"                 0
    1 "cotton"          280.3889
    1 "cowpea"                 0
    1 "groundnut"              0
    1 "millet_foxtail"  18.13889
    1 "millet_pearl"      137.25
    1 "olive"                  0
    1 "onion"                  0
    1 "potatoes"               0
    1 "rapeseed"               0
    1 "rice_dry"               0
    1 "rice_wetland"           0
    1 "rye"                    0
    1 "soybean"                0
    1 "sugarbeet"              0
    1 "sunflower"              0
    1 "tobacco"                0
    1 "tomatoes"               0
    1 "wheat"                  0
    2 "alfalfa"                0
    2 "barley"                 0
    2 "beans"           1247.393
    2 "cabbage"                0
    2 "carrots"        2401.8572
    2 "citrus"                 0
    2 "cotton"          3532.607
    2 "cowpea"         2234.7856
    2 "groundnut"      2397.3572
    2 "millet_foxtail"         0
    2 "millet_pearl"   4555.3213
    2 "olive"                  0
    2 "onion"          1220.5714
    2 "potatoes"               0
    2 "rapeseed"               0
    2 "rice_dry"        203.4643
    2 "rice_wetland"           0
    2 "rye"                    0
    2 "soybean"         782.9286
    2 "sugarbeet"              0
    2 "sunflower"       616.4286
    2 "tobacco"        1487.9642
    2 "tomatoes"        767.5357
    2 "wheat"                  0
    3 "alfalfa"        516.94446
    3 "barley"                 0
    3 "beans"           2719.528
    3 "cabbage"                0
    3 "carrots"         2739.889
    3 "citrus"         142.33333
    3 "cotton"          3807.806
    3 "cowpea"          2565.639
    3 "groundnut"         2711.5
    3 "millet_foxtail"    1130.5
    3 "millet_pearl"   4537.1113
    3 "olive"                  0
    3 "onion"           2408.861
    3 "potatoes"               0
    3 "rapeseed"               0
    3 "rice_dry"        878.3055
    3 "rice_wetland"           0
    3 "rye"                    0
    3 "soybean"        2377.6943
    3 "sugarbeet"              0
    3 "sunflower"      2298.0278
    3 "tobacco"         2787.222
    3 "tomatoes"       2388.2778
    3 "wheat"            2050.75
    4 "alfalfa"                0
    4 "barley"                 0
    4 "beans"            4656.75
    4 "cabbage"                0
    end

    I am trying two things:

    Code:
    1)
    gen dummy = 0
    bysort ID: replace dummy = 1 if production >= 4000
    
    2)
    bysort ID: gen dummy = 1 if production >= 4000

    It works but it is just give me a value of 1 for the exact line in which the crop is greater than 4000. I want that all the ID which at least one crop is greater than 4000 takes the value 1, not just the row in which it is satisfied. Please, let me know if it is not clear.

  • #2
    Thanks for your clear question and data example. However, watch out: your text says "greater than 4000" but your operator is >=, greater than or equal to.

    There is a simple method based on sorting that works fine if there are no missing values. If missing values are present you need more care.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int ID str14 crop float production
    0 "alfalfa"                0
    0 "barley"                 0
    0 "beans"                  0
    0 "cabbage"                0
    0 "carrots"                0
    0 "citrus"                 0
    0 "cotton"                 0
    0 "cowpea"                 0
    0 "groundnut"              0
    0 "millet_foxtail"         0
    0 "millet_pearl"           0
    0 "olive"                  0
    0 "onion"                  0
    0 "potatoes"               0
    0 "rapeseed"               0
    0 "rice_dry"               0
    0 "rice_wetland"           0
    0 "rye"                    0
    0 "soybean"                0
    0 "sugarbeet"              0
    0 "sunflower"              0
    0 "tobacco"                0
    0 "tomatoes"               0
    0 "wheat"                  0
    1 "alfalfa"                0
    1 "barley"                 0
    1 "beans"                  0
    1 "cabbage"                0
    1 "carrots"                0
    1 "citrus"                 0
    1 "cotton"          280.3889
    1 "cowpea"                 0
    1 "groundnut"              0
    1 "millet_foxtail"  18.13889
    1 "millet_pearl"      137.25
    1 "olive"                  0
    1 "onion"                  0
    1 "potatoes"               0
    1 "rapeseed"               0
    1 "rice_dry"               0
    1 "rice_wetland"           0
    1 "rye"                    0
    1 "soybean"                0
    1 "sugarbeet"              0
    1 "sunflower"              0
    1 "tobacco"                0
    1 "tomatoes"               0
    1 "wheat"                  0
    2 "alfalfa"                0
    2 "barley"                 0
    2 "beans"           1247.393
    2 "cabbage"                0
    2 "carrots"        2401.8572
    2 "citrus"                 0
    2 "cotton"          3532.607
    2 "cowpea"         2234.7856
    2 "groundnut"      2397.3572
    2 "millet_foxtail"         0
    2 "millet_pearl"   4555.3213
    2 "olive"                  0
    2 "onion"          1220.5714
    2 "potatoes"               0
    2 "rapeseed"               0
    2 "rice_dry"        203.4643
    2 "rice_wetland"           0
    2 "rye"                    0
    2 "soybean"         782.9286
    2 "sugarbeet"              0
    2 "sunflower"       616.4286
    2 "tobacco"        1487.9642
    2 "tomatoes"        767.5357
    2 "wheat"                  0
    3 "alfalfa"        516.94446
    3 "barley"                 0
    3 "beans"           2719.528
    3 "cabbage"                0
    3 "carrots"         2739.889
    3 "citrus"         142.33333
    3 "cotton"          3807.806
    3 "cowpea"          2565.639
    3 "groundnut"         2711.5
    3 "millet_foxtail"    1130.5
    3 "millet_pearl"   4537.1113
    3 "olive"                  0
    3 "onion"           2408.861
    3 "potatoes"               0
    3 "rapeseed"               0
    3 "rice_dry"        878.3055
    3 "rice_wetland"           0
    3 "rye"                    0
    3 "soybean"        2377.6943
    3 "sugarbeet"              0
    3 "sunflower"      2298.0278
    3 "tobacco"         2787.222
    3 "tomatoes"       2388.2778
    3 "wheat"            2050.75
    4 "alfalfa"                0
    4 "barley"                 0
    4 "beans"            4656.75
    4 "cabbage"                0
    end
    
    bysort ID (production) : gen wanted1 = production[_N] >= 4000 
    
    egen max = max(production), by(ID)
    gen wanted2 = max >= 4000 if max < . 
    
    tabdisp ID, c(max wanted?)
    
    ----------------------------------------------
           ID |        max     wanted1     wanted2
    ----------+-----------------------------------
            0 |          0           0           0
            1 |   280.3889           0           0
            2 |   4555.321           1           1
            3 |   4537.111           1           1
            4 |    4656.75           1           1
    ----------------------------------------------
    The method of generating a (0, 1) indicator (*) by a generate with value 1 and a replace with value 0 is defensible if and only if you or your readers find that much clearer. (Your second code fragment didn't include a replace.)

    Otherwise it's worth exploiting the fact that true or false evaluations yield results of 1 or 0 automatically.

    See https://www.stata.com/support/faqs/d...rue-and-false/
    and https://journals.sagepub.com/doi/pdf...36867X19830921

    for more on this point.

    (*) See the paper just cited for more on why "dummy variable" is a term to be avoided.

    Comment


    • #3
      Thank you Nick Cox for the answer and the references. It really works. Absolutely, I wrote "greater", and I coded "greater or equal", my text was bad, my code was right .

      I would like to ask you, if you do not mind, what exactly does when we write a variable between parenthesis after bysort. What is it exactly the difference of writing the brackets or not?

      Comment


      • #4
        That is documented. See

        Code:
        help by
        In your case, a sort on production within identifier is needed to guarantee that the last observation for each identifier holds the maximum.

        As said, missing values are a problem for that device, as any missing values would sort to the end of each block of observations.



        Comment


        • #5
          I understand now! Thanks for the clarification.

          Diego.

          Comment

          Working...
          X