Generating dummy variables if a variable is in a range

Ananya Kotia

Join Date: Feb 2015
Posts: 22

Generating dummy variables if a variable is in a range

05 Mar 2015, 14:36

Hi all,

I am working with a large firm level data. the firms are divided into various industry groups and sub-groups. For each industry group, I want to generate a dummy variable which takes value 1 if a given firm is in that industry and zero otherwise. Eg. in the data below, there are three main industry groups- food, textiles and chemicals (each with sub-groups).

So for chemicals for instance, I want to define a dummy which = 1 if Industry_Code lies in the range [10101200000, 10101202500] and zero otherwise. Using "|" with the if qualifier does not seem to work.

Industry_Code	Industry
10101110000	Food
10101110101	Dairy products
10101110105	Tea
10101150000	Textiles
10101150100	Cotton & blended yarn
10101150500	Cloth
10101151000	Man-made filaments & fibres
10101151500	Textile processing
10101200000	Chemicals
10101200100	Inorganic chemicals
10101200500	Caustic soda
10101201000	Soda ash
10101201500	Fertilisers
10101202000	Pesticides
10101202500	Dyes & pigments

How can I do this?

Best,
Ananya

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35724
#2

05 Mar 2015, 14:45

If it's a string variable

Code:

gen indicator = inrange(Industry_Code, "10101200000", "10101202500")

and if it's a numeric variable omit the quotation marks. On unseen code that didn't work: Consider http://statadaily.ikonomiya.com/2011...ctions-inlist/

Something of the form

Code:

gen foobar = answer == 42 | 43

will be parsed as

Code:

gen foobar = (answer == 42) | 43

and will always be evaluated as true as 43 is non-zero and therefore true, regardless of the first argument.
Comment
Ananya Kotia

Join Date: Feb 2015

Posts: 22
#3

05 Mar 2015, 18:53

thank you so much! that was very helpful!
Comment

S. M. Woahid Murad

Join Date: Apr 2015
Posts: 35

08 Nov 2023, 05:25

Nick Cox

Hi Nick,

For a DiD regression model, I need to generate a dummy variable Disaster according to the duration of the disaster. For example,

tq	Damage_Cost	Duration	Disaster
2010:Q1	0	0	0
2010:Q2	0	0	0
2010:Q3	2356	3	1
2010:Q4	0	0	1
2011:Q1	0	0	1
2011:Q2	0	0	0
2011:Q3	0	0	0
2011:Q4	0	0	0
2012:Q1	4562	2	1
2012:Q2	0	0	1
2012:Q3	0	0	0
2012:Q4	0	0	0
2013:Q1	0	0	0
2013:Q2	0	0	0

To simply generate a dummy variable to consider whether a disaster happened or not using the following code:

Code:

gen disaster = 0

Code:

replace disaster =1 if Damage_Cost>0

However, I am not sure how I can accommodate duration in the dummy variable. My sample size is almost one million. Therefore, it is impossible to do it manually.

Thanks in advance.

Kind Regards,
Woahid

Last edited by S. M. Woahid Murad; 08 Nov 2023, 05:32.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

08 Nov 2023, 08:14

#4 invites questions in return which are (a) How do you want to accommodate duration too? (b) Is this a good idea? Otherwise note that

Code:

gen disaster = Damage_Cost > 0

is equivalent to your code and

Code:

gen disaster = Damage_Cost > 0 if Damage_Cost < .

may be more appropriate if you have any missing values. For more see e.g. https://journals.sagepub.com/doi/pdf...36867X19830921
Comment
S. M. Woahid Murad

Join Date: Apr 2015

Posts: 35
#6

08 Nov 2023, 08:21

Hi Nick, Thank you for your response. However, my query is how to accommodate duration in my dummy variable. Suppose, the duration of a flood is two quarters. In that case, my dummy variable would be 1 for two consecutive quarters.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#7

08 Nov 2023, 09:31

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str8 tq int damage_cost byte duration "2010:Q1 " 0 0 "2010:Q2 " 0 0 "2010:Q3 " 2356 3 "2010:Q4 " 0 0 "2011:Q1 " 0 0 "2011:Q2 " 0 0 "2011:Q3 " 0 0 "2011:Q4 " 0 0 "2012:Q1 " 4562 2 "2012:Q2 " 0 0 "2012:Q3 " 0 0 "2012:Q4 " 0 0 "2013:Q1 " 0 0 "2013:Q2 " 0 0 end gen qdate = quarterly(tq, "YQ") assert missing(qdate) == missing(tq) format qdate %tq drop tq gen `c(obs_t)' obs_no = _n tempfile copy save `copy' keep if damage_cost > 0 & !missing(damage_cost) gen qdate_end = qdate + duration - 1 format qdate_end %tq rangejoin qdate qdate qdate_end using `copy' keep *_U rename *_U * merge 1:1 obs_no using `copy' gen byte disaster = _merge == 3 drop _merge sort qdate

-rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
S. M. Woahid Murad

Join Date: Apr 2015

Posts: 35
#8

08 Nov 2023, 18:19

Hi Clyde, It works! Thank you so much.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

09 Nov 2023, 01:31

S. M. Woahid Murad

My post in #5 was stupid. Your example did show your rule. Sorry about that.

Here is another way to do it. .

Note that if your dataset is really panel data, I think all you need to change is the tsset call. But I didn't test that.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str7 q int damage_cost byte(duration disaster)
"2010:Q1"    0 0 0
"2010:Q2"    0 0 0
"2010:Q3" 2356 3 1
"2010:Q4"    0 0 1
"2011:Q1"    0 0 1
"2011:Q2"    0 0 0
"2011:Q3"    0 0 0
"2011:Q4"    0 0 0
"2012:Q1" 4562 2 1
"2012:Q2"    0 0 1
"2012:Q3"    0 0 0
"2012:Q4"    0 0 0
"2013:Q1"    0 0 0
"2013:Q2"    0 0 0
end

gen qdate = quarterly(q, "YQ")
format qdate %tq 

tsset qdate 
clonevar countdown = duration
replace countdown = max(countdown, L.countdown - 1, 0) 
gen wanted = countdown > 0 

list, sep(4)

     +-----------------------------------------------------------------------+
     |       q   damage~t   duration   disaster    qdate   countd~n   wanted |
     |-----------------------------------------------------------------------|
  1. | 2010:Q1          0          0          0   2010q1          0        0 |
  2. | 2010:Q2          0          0          0   2010q2          0        0 |
  3. | 2010:Q3       2356          3          1   2010q3          3        1 |
  4. | 2010:Q4          0          0          1   2010q4          2        1 |
     |-----------------------------------------------------------------------|
  5. | 2011:Q1          0          0          1   2011q1          1        1 |
  6. | 2011:Q2          0          0          0   2011q2          0        0 |
  7. | 2011:Q3          0          0          0   2011q3          0        0 |
  8. | 2011:Q4          0          0          0   2011q4          0        0 |
     |-----------------------------------------------------------------------|
  9. | 2012:Q1       4562          2          1   2012q1          2        1 |
 10. | 2012:Q2          0          0          1   2012q2          1        1 |
 11. | 2012:Q3          0          0          0   2012q3          0        0 |
 12. | 2012:Q4          0          0          0   2012q4          0        0 |
     |-----------------------------------------------------------------------|
 13. | 2013:Q1          0          0          0   2013q1          0        0 |
 14. | 2013:Q2          0          0          0   2013q2          0        0 |
     +-----------------------------------------------------------------------+

Comment

S. M. Woahid Murad

Join Date: Apr 2015
Posts: 35

#10

09 Nov 2023, 07:34

Hi Nick, thank you so much for your support once again. I am using panel data. A sample is given below:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str7 tm long County double(damage_cost duration)
"2012m3"   5   91.32811737060547 2
"2012m5"   5  .04050000011920929 0
"2013m1"   5                   0 0
"2016m4"   5                   0 0
"2016m12"  5                   0 0
"2017m2"   5                   0 0
"2017m4"   5                   0 0
"2017m10"  5                   0 0
"2018m4"   5                   0 0
"2018m11"  5                   0 0
"2018m12"  5                   0 0
"2019m6"   5  .44005998969078064 2
"2019m7"   5                   0 0
"2021m7"   5 .036309998482465744 1
"2021m12"  5                   0 0
"1992m10"  9   .5514199733734131 30
"1996m1"   9                   0 0
"1996m5"   9                   0 0
"1999m11"  9                   0 0
"2001m1"   9                   0 0
"2001m4"   9                   0 0
"2008m6"   9 .010099999606609344 1
"2011m2"   9                   0 0
"2012m2"   9                   0 0
"2012m5"   9                   0 0
"2012m11"  9                   0 0
"2013m6"   9                   0 0
"2013m7"   9                   0 0
"2013m12"  9                   0 0
"2015m12"  9                   0 0
"2016m4"   9                   0 0
"2018m2"   9                   0 0
"1999m2"  10                   0 0
"2000m1"  10                   0 0
"2002m2"  10                   0 0
"2003m1"  10                   0 0
"2003m12" 10                   0 0
"2004m1"  10                   0 0
"2005m1"  10                   0 0
"2005m2"  10                   0 0
"2005m7"  10                   0 0
"2006m1"  10                   0 0
"2006m5"  10                   0 0
"2007m3"  10                   0 0
"2007m10" 10                   0 0
"2010m8"  10  .41451001167297363 8
"2011m3"  10                   0 0
"2011m12" 10                   0 0
"2013m6"  10  1.4656000137329102 1
"2014m12" 10                   0 0
"2015m2"  10  3.5347399711608887 15
"2016m2"  10                   0 0
end

label values County County
label def County 5 "ADAMS", modify
label def County 9 "ALACHUA", modify
label def County 10 "ALAMANCE", modify

gen t = monthly(tm, "YM")
format t %tm
xtset County t, monthly

clonevar countdown = duration
by County: replace countdown = max(countdown, L.countdown - 1, 0)
by County: gen wanted = countdown>0

I am trying to get the wanted variable using this code. However, it does not work.

I have also tried to generate the variable using the following code as suggested by Clyde:

Code:

gen mdate = monthly(tm, "YM")
assert missing(mdate) == missing(tm)
format mdate %tm
drop tm

sort County mdate
by County: gen `c(obs_t)' obs_no = _n
tempfile copy
save `copy'

by County: keep if damage_cost > 0 & !missing(damage_cost)
by County: gen mdate_end = mdate + duration - 1
format mdate_end %tm

rangejoin mdate mdate mdate_end using `copy', by(County)
keep *_U
rename *_U *
sort County mdate
quietly by  County mdate :  gen dup = cond(_N==1,0,_n)
drop if dup>1
drop dup
merge 1:1 County obs_no using `copy'
by County: gen byte disaster = _merge == 3
drop _merge
sort County mdate

Could any of you please identify my mistake in this code and fix it? Thanks in advance.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35724
#11

09 Nov 2023, 07:47

This dataset contains many gaps, but unlike the previous, so that may well break either code solution. I will look at this soon.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#12

09 Nov 2023, 10:08

With regard to my solution in #7, it is robust to the gaps in the data. However, your original data example was a single time series, not panel data, and that solution therefore did not take into account the County variable. You made a mostly correct attempt to modify the code to work with panel data, but you got a few of the details wrong. Here's a corrected version:

Code:

gen mdate = monthly(tm, "YM") assert missing(mdate) == missing(tm) format mdate %tm drop tm sort County mdate by County: gen `c(obs_t)' obs_no = _n tempfile copy save `copy' by County: keep if damage_cost > 0 & !missing(damage_cost) by County: gen mdate_end = mdate + duration - 1 format mdate_end %tm rangejoin mdate mdate mdate_end using `copy', by(County) keep *_U County rename *_U * sort County mdate quietly by County mdate : gen dup = cond(_N==1,0,_n) drop if dup>1 drop dup merge 1:1 County obs_no using `copy' by County, sort: gen byte disaster = _merge == 3 drop _merge sort County mdate

(Changes in bold face.)

I have no doubt that Nick can come up with a patch to his method on his own, but as I am on the list now and he, it seems, will be back later, I'll assume he won't mind if I jump in. He has correctly observed that the problem with his method comes in due to the gaps in the data. So a very simple solution is to fill in the gaps. Now, that also clutters up the data set with a lot of observations for county-months where you have no data. But it is easy enough to remove them afterward.

Code:

gen t = monthly(tm, "YM") format t %tm xtset County t, monthly gen byte original = 1 tsfill clonevar countdown = duration by County: replace countdown = max(countdown, L.countdown - 1, 0) by County: gen wanted = countdown>0 keep if original == 1

(Changes, again, in bold face.)

So now you have two ways to do it. As between them, I like Nick's solution better. It is shorter and more transparent. In a large data set it will run faster. It also relies only on official Stata commands.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#13

09 Nov 2023, 10:17

Perhaps you could spell out in more detail what you want.

For example, take county 5 and an observation at 2012m3 of duration 2. But there is no observation for 2012m4. Nothing to do there or do you need tsfill to add observations to the dataset?

Or county 10 and an observation at 2015m2 of duration 15, so 14 extra months, of which only one is in the dataset?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#14

09 Nov 2023, 10:27

Nick makes a good point in #13. In my revision of his approach in #12, I remove all the extra observations at the end--I tacitly assumed you wanted to end up with only the same observations you started with. But it may be that you want to keep the extra observations, particularly if they turn out to be disaster months. So you could remove -keep if original == 1- so that you end up with a complete panel data set that has no gaps at all, or, to keep extra observations only if part of a disaster, you could change that line to -keep if inlist(1, original, disaster)-.
1 like
Comment
S. M. Woahid Murad

Join Date: Apr 2015

Posts: 35
#15

13 Nov 2023, 06:54

Hi Nick and Clyde, Thank you so much for your cooperation. I am happy with the Stata team members like you.
Comment

Announcement

Generating dummy variables if a variable is in a range

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment