Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating dummy variables if a variable is in a range

    Hi all,

    I am working with a large firm level data. the firms are divided into various industry groups and sub-groups. For each industry group, I want to generate a dummy variable which takes value 1 if a given firm is in that industry and zero otherwise. Eg. in the data below, there are three main industry groups- food, textiles and chemicals (each with sub-groups).

    So for chemicals for instance, I want to define a dummy which = 1 if Industry_Code lies in the range [10101200000, 10101202500] and zero otherwise. Using "|" with the if qualifier does not seem to work.
    Industry_Code Industry
    10101110000 Food
    10101110101 Dairy products
    10101110105 Tea
    10101150000 Textiles
    10101150100 Cotton & blended yarn
    10101150500 Cloth
    10101151000 Man-made filaments & fibres
    10101151500 Textile processing
    10101200000 Chemicals
    10101200100 Inorganic chemicals
    10101200500 Caustic soda
    10101201000 Soda ash
    10101201500 Fertilisers
    10101202000 Pesticides
    10101202500 Dyes & pigments

    How can I do this?

    Best,
    Ananya

  • #2
    If it's a string variable

    Code:
    gen indicator = inrange(Industry_Code, "10101200000", "10101202500")
    and if it's a numeric variable omit the quotation marks. On unseen code that didn't work: Consider http://statadaily.ikonomiya.com/2011...ctions-inlist/

    Something of the form

    Code:
    gen foobar = answer == 42 | 43
    will be parsed as

    Code:
    gen foobar = (answer == 42) | 43
    and will always be evaluated as true as 43 is non-zero and therefore true, regardless of the first argument.

    Comment


    • #3
      thank you so much! that was very helpful!

      Comment


      • #4
        Nick Cox

        Hi Nick,

        For a DiD regression model, I need to generate a dummy variable Disaster according to the duration of the disaster. For example,
        tq Damage_Cost Duration Disaster
        2010:Q1 0 0 0
        2010:Q2 0 0 0
        2010:Q3 2356 3 1
        2010:Q4 0 0 1
        2011:Q1 0 0 1
        2011:Q2 0 0 0
        2011:Q3 0 0 0
        2011:Q4 0 0 0
        2012:Q1 4562 2 1
        2012:Q2 0 0 1
        2012:Q3 0 0 0
        2012:Q4 0 0 0
        2013:Q1 0 0 0
        2013:Q2 0 0 0
        ​​​​​​​To simply generate a dummy variable to consider whether a disaster happened or not using the following code:
        Code:
        gen disaster = 0
        Code:
        replace disaster =1 if Damage_Cost>0
        However, I am not sure how I can accommodate duration in the dummy variable. My sample size is almost one million. Therefore, it is impossible to do it manually.

        Thanks in advance.


        Kind Regards,
        Woahid
        Last edited by S. M. Woahid Murad; 08 Nov 2023, 05:32.

        Comment


        • #5
          #4 invites questions in return which are (a) How do you want to accommodate duration too? (b) Is this a good idea? Otherwise note that

          Code:
          gen disaster = Damage_Cost > 0
          is equivalent to your code and

          Code:
          gen disaster = Damage_Cost > 0  if Damage_Cost < .
          may be more appropriate if you have any missing values. For more see e.g. https://journals.sagepub.com/doi/pdf...36867X19830921


          Comment


          • #6
            Hi Nick, Thank you for your response. However, my query is how to accommodate duration in my dummy variable. Suppose, the duration of a flood is two quarters. In that case, my dummy variable would be 1 for two consecutive quarters.

            Comment


            • #7
              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str8 tq int damage_cost byte duration
              "2010:Q1 "    0 0
              "2010:Q2 "    0 0
              "2010:Q3 " 2356 3
              "2010:Q4 "    0 0
              "2011:Q1 "    0 0
              "2011:Q2 "    0 0
              "2011:Q3 "    0 0
              "2011:Q4 "    0 0
              "2012:Q1 " 4562 2
              "2012:Q2 "    0 0
              "2012:Q3 "    0 0
              "2012:Q4 "    0 0
              "2013:Q1 "    0 0
              "2013:Q2 "    0 0
              end
              
              
              gen qdate = quarterly(tq, "YQ")
              assert missing(qdate) == missing(tq)
              format qdate %tq
              drop tq
              
              gen `c(obs_t)' obs_no = _n
              tempfile copy
              save `copy'
              
              keep if damage_cost > 0 & !missing(damage_cost)
              gen qdate_end = qdate + duration - 1
              format qdate_end %tq
              
              rangejoin qdate qdate qdate_end using `copy'
              keep *_U
              rename *_U *
              merge 1:1 obs_no using `copy'
              gen byte disaster = _merge == 3
              drop _merge
              sort qdate
              -rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

              In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

              Comment


              • #8
                Hi Clyde, It works! Thank you so much.

                Comment


                • #9
                  S. M. Woahid Murad

                  My post in #5 was stupid. Your example did show your rule. Sorry about that.

                  Here is another way to do it. .

                  Note that if your dataset is really panel data, I think all you need to change is the tsset call. But I didn't test that.


                  Code:
                  * Example generated by -dataex-. For more info, type help dataex
                  clear
                  input str7 q int damage_cost byte(duration disaster)
                  "2010:Q1"    0 0 0
                  "2010:Q2"    0 0 0
                  "2010:Q3" 2356 3 1
                  "2010:Q4"    0 0 1
                  "2011:Q1"    0 0 1
                  "2011:Q2"    0 0 0
                  "2011:Q3"    0 0 0
                  "2011:Q4"    0 0 0
                  "2012:Q1" 4562 2 1
                  "2012:Q2"    0 0 1
                  "2012:Q3"    0 0 0
                  "2012:Q4"    0 0 0
                  "2013:Q1"    0 0 0
                  "2013:Q2"    0 0 0
                  end
                  
                  gen qdate = quarterly(q, "YQ")
                  format qdate %tq 
                  
                  tsset qdate 
                  clonevar countdown = duration
                  replace countdown = max(countdown, L.countdown - 1, 0) 
                  gen wanted = countdown > 0 
                  
                  list, sep(4)
                  
                       +-----------------------------------------------------------------------+
                       |       q   damage~t   duration   disaster    qdate   countd~n   wanted |
                       |-----------------------------------------------------------------------|
                    1. | 2010:Q1          0          0          0   2010q1          0        0 |
                    2. | 2010:Q2          0          0          0   2010q2          0        0 |
                    3. | 2010:Q3       2356          3          1   2010q3          3        1 |
                    4. | 2010:Q4          0          0          1   2010q4          2        1 |
                       |-----------------------------------------------------------------------|
                    5. | 2011:Q1          0          0          1   2011q1          1        1 |
                    6. | 2011:Q2          0          0          0   2011q2          0        0 |
                    7. | 2011:Q3          0          0          0   2011q3          0        0 |
                    8. | 2011:Q4          0          0          0   2011q4          0        0 |
                       |-----------------------------------------------------------------------|
                    9. | 2012:Q1       4562          2          1   2012q1          2        1 |
                   10. | 2012:Q2          0          0          1   2012q2          1        1 |
                   11. | 2012:Q3          0          0          0   2012q3          0        0 |
                   12. | 2012:Q4          0          0          0   2012q4          0        0 |
                       |-----------------------------------------------------------------------|
                   13. | 2013:Q1          0          0          0   2013q1          0        0 |
                   14. | 2013:Q2          0          0          0   2013q2          0        0 |
                       +-----------------------------------------------------------------------+

                  Comment


                  • #10
                    Hi Nick, thank you so much for your support once again. I am using panel data. A sample is given below:

                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input str7 tm long County double(damage_cost duration)
                    "2012m3"   5   91.32811737060547 2
                    "2012m5"   5  .04050000011920929 0
                    "2013m1"   5                   0 0
                    "2016m4"   5                   0 0
                    "2016m12"  5                   0 0
                    "2017m2"   5                   0 0
                    "2017m4"   5                   0 0
                    "2017m10"  5                   0 0
                    "2018m4"   5                   0 0
                    "2018m11"  5                   0 0
                    "2018m12"  5                   0 0
                    "2019m6"   5  .44005998969078064 2
                    "2019m7"   5                   0 0
                    "2021m7"   5 .036309998482465744 1
                    "2021m12"  5                   0 0
                    "1992m10"  9   .5514199733734131 30
                    "1996m1"   9                   0 0
                    "1996m5"   9                   0 0
                    "1999m11"  9                   0 0
                    "2001m1"   9                   0 0
                    "2001m4"   9                   0 0
                    "2008m6"   9 .010099999606609344 1
                    "2011m2"   9                   0 0
                    "2012m2"   9                   0 0
                    "2012m5"   9                   0 0
                    "2012m11"  9                   0 0
                    "2013m6"   9                   0 0
                    "2013m7"   9                   0 0
                    "2013m12"  9                   0 0
                    "2015m12"  9                   0 0
                    "2016m4"   9                   0 0
                    "2018m2"   9                   0 0
                    "1999m2"  10                   0 0
                    "2000m1"  10                   0 0
                    "2002m2"  10                   0 0
                    "2003m1"  10                   0 0
                    "2003m12" 10                   0 0
                    "2004m1"  10                   0 0
                    "2005m1"  10                   0 0
                    "2005m2"  10                   0 0
                    "2005m7"  10                   0 0
                    "2006m1"  10                   0 0
                    "2006m5"  10                   0 0
                    "2007m3"  10                   0 0
                    "2007m10" 10                   0 0
                    "2010m8"  10  .41451001167297363 8
                    "2011m3"  10                   0 0
                    "2011m12" 10                   0 0
                    "2013m6"  10  1.4656000137329102 1
                    "2014m12" 10                   0 0
                    "2015m2"  10  3.5347399711608887 15
                    "2016m2"  10                   0 0
                    end
                    
                    label values County County
                    label def County 5 "ADAMS", modify
                    label def County 9 "ALACHUA", modify
                    label def County 10 "ALAMANCE", modify
                    
                    gen t = monthly(tm, "YM")
                    format t %tm
                    xtset County t, monthly
                    
                    clonevar countdown = duration
                    by County: replace countdown = max(countdown, L.countdown - 1, 0)
                    by County: gen wanted = countdown>0
                    I am trying to get the wanted variable using this code. However, it does not work.

                    I have also tried to generate the variable using the following code as suggested by Clyde:

                    Code:
                    gen mdate = monthly(tm, "YM")
                    assert missing(mdate) == missing(tm)
                    format mdate %tm
                    drop tm
                    
                    sort County mdate
                    by County: gen `c(obs_t)' obs_no = _n
                    tempfile copy
                    save `copy'
                    
                    by County: keep if damage_cost > 0 & !missing(damage_cost)
                    by County: gen mdate_end = mdate + duration - 1
                    format mdate_end %tm
                    
                    rangejoin mdate mdate mdate_end using `copy', by(County)
                    keep *_U
                    rename *_U *
                    sort County mdate
                    quietly by  County mdate :  gen dup = cond(_N==1,0,_n)
                    drop if dup>1
                    drop dup
                    merge 1:1 County obs_no using `copy'
                    by County: gen byte disaster = _merge == 3
                    drop _merge
                    sort County mdate
                    Could any of you please identify my mistake in this code and fix it? Thanks in advance.



                    Comment


                    • #11
                      This dataset contains many gaps, but unlike the previous, so that may well break either code solution. I will look at this soon.

                      Comment


                      • #12
                        With regard to my solution in #7, it is robust to the gaps in the data. However, your original data example was a single time series, not panel data, and that solution therefore did not take into account the County variable. You made a mostly correct attempt to modify the code to work with panel data, but you got a few of the details wrong. Here's a corrected version:
                        Code:
                        gen mdate = monthly(tm, "YM")
                        assert missing(mdate) == missing(tm)
                        format mdate %tm
                        drop tm
                        
                        sort County mdate
                        by County: gen `c(obs_t)' obs_no = _n
                        tempfile copy
                        save `copy'
                        
                        by County: keep if damage_cost > 0 & !missing(damage_cost)
                        by County: gen mdate_end = mdate + duration - 1
                        format mdate_end %tm
                        
                        rangejoin mdate mdate mdate_end using `copy', by(County)
                        keep *_U County
                        rename *_U *
                        sort County mdate
                        quietly by  County mdate :  gen dup = cond(_N==1,0,_n)
                        drop if dup>1
                        drop dup
                        merge 1:1 County obs_no using `copy'
                        by County, sort: gen byte disaster = _merge == 3
                        drop _merge
                        sort County mdate
                        (Changes in bold face.)

                        I have no doubt that Nick can come up with a patch to his method on his own, but as I am on the list now and he, it seems, will be back later, I'll assume he won't mind if I jump in. He has correctly observed that the problem with his method comes in due to the gaps in the data. So a very simple solution is to fill in the gaps. Now, that also clutters up the data set with a lot of observations for county-months where you have no data. But it is easy enough to remove them afterward.
                        Code:
                        gen t = monthly(tm, "YM")
                        format t %tm
                        xtset County t, monthly
                        gen byte original = 1
                        tsfill
                        
                        clonevar countdown = duration
                        by County: replace countdown = max(countdown, L.countdown - 1, 0)
                        by County: gen wanted = countdown>0
                        keep if original == 1
                        (Changes, again, in bold face.)

                        So now you have two ways to do it. As between them, I like Nick's solution better. It is shorter and more transparent. In a large data set it will run faster. It also relies only on official Stata commands.

                        Comment


                        • #13
                          Perhaps you could spell out in more detail what you want.

                          For example, take county 5 and an observation at 2012m3 of duration 2. But there is no observation for 2012m4. Nothing to do there or do you need tsfill to add observations to the dataset?

                          Or county 10 and an observation at 2015m2 of duration 15, so 14 extra months, of which only one is in the dataset?

                          Comment


                          • #14
                            Nick makes a good point in #13. In my revision of his approach in #12, I remove all the extra observations at the end--I tacitly assumed you wanted to end up with only the same observations you started with. But it may be that you want to keep the extra observations, particularly if they turn out to be disaster months. So you could remove -keep if original == 1- so that you end up with a complete panel data set that has no gaps at all, or, to keep extra observations only if part of a disaster, you could change that line to -keep if inlist(1, original, disaster)-.

                            Comment


                            • #15
                              Hi Nick and Clyde, Thank you so much for your cooperation. I am happy with the Stata team members like you.

                              Comment

                              Working...
                              X