Using OR condition for dummy variable (expression too long)

Yahya Ghazali

Join Date: Jul 2015

Posts: 51
#1

Using OR condition for dummy variable (expression too long)

07 Feb 2016, 16:24

Hi there

I am having problem with long expression due to the OR condition. I want to create a dummy variable, analyst=1, if a particular analyst is associated with an institution. In some cases, the names are the same but institution is different. As there are 800 analysts, the expression is very long as i use the following command:

by StockCode year, sort: gen analyst=1 if (AnalystName1=="Zhang Gengyun" & InstitutionCode=="BIZXZQ01" | AnalystName1=="Zhang Xiaoqing" & InstitutionCode=="BIDFZQ01")

The following is the data example

* Example generated by -dataex-. To install: ssc install dataex
clear
input long StockCode float year str10 InstitutionCode str20 AnalystName1
600036 2003 "BIZXZQ01" "Zhang Gengyun"
600422 2004 "BIDFZQ01" "Zhang Xiaoqing"
600057 2004 "BIGTJA01" "Wang Zhanqiang"
22 2004 "BIDFZQ01" "Zhang Dingjie"
542 2004 "BIGTJA01" "Wang Zhanqiang"
157 2004 "BIBHZQ01" "Wang Gang"
600020 2004 "BIJQZQ01" "Li Weiqi"
725 2003 "BIGTJA01" "Wei Xingyun"
157 2004 "BIGTJA01" "Xu Yunkai"
63 2003 "BIXJZQ01" "Pan Huanhuan"
600170 2003 "BIGTJA01" "Chen Liang"
600460 2003 "BIGTJA01" "Xiao Lijuan"
600528 2004 "BIGTJAQ1" "Chen Liang"
2031 2004 "BIGTJA01" "Xu Yunkai"
24 2003 "BILHZQ01" "Dai Lihong"
24 2003 "BIZXZQ01" "Shi Yanping"
600020 2003 "BIGTJA01" "Gao Guangxin"

Is there any short command which can save me from writing up all the 800 names and institution?

Thank you
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#2

07 Feb 2016, 16:47

Well, you can slightly shorten the code by applying some boolean algebra. The following code would give the same results as your example:

Code:

by StockCode year, sort: gen analyst=1 if AnalystName1=="Zhang Gengyun" & (InstitutionCode=="BIZXZQ01" | InstitutionCode=="BIDFZQ01")

And you can get some reduction in length by using the inlist() function:

Code:

by StockCode year, sort: gen analyst = 1 if AnalystName1 == Zhang Gengyun" & inlist(InstitutionCode, "BIZXZQ01", "BIDFZQ01")

I would also point out, though it is not directly relevant to your question, that the -by StockCode year, sort- prefix does nothing here since nothing in the -gen- command refers to within by-group data.

But if you have 800 different analysts to deal with, there is no way you will reduce this to a single command that doesn't exceed length limits. Not even close.

I would take a very different approach. I would create a separate data set with two variables: AnalystName1 and InstitutionCode. I would then put in that data set observations for each combination of AnalystName1 and InstitutionCode for which you want to assign analyst = 1 (and only for those combinations). Perhaps you even have a data set like that lying around somewhere (or perhaps you have it in a spreadsheet that you can import into Stata). In fact it's hard for me to imagine how you would even know which pairs of AnalystName1 and InstitutionCode you want to identify in this way if you don't have such a data file somewhere.

In any case, once that data set exists, let's call it primary_analyst_pairs, you could then do this:

Code:

// START WITH YOUR ORIGINAL DATA SET IN MEMORY merge m:1 AnalystName1 InstitutionCode using primary_analyst_pairs gen byte analyst = (_merge == 3)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30153
#3

07 Feb 2016, 18:22

Re-reading the original post and my response, I notice that my suggestions in the first two code blocks are incorrect: I had misread the original post as involving Zhang Gengyun in both of the expressions being ORed; but I know see that the second disjunct involves Zhang Xiaoqing. So the shortening I proposed would give different results.

I think the approach using -merge- is the only practical way for Yahya to proceed.
Comment
Yahya Ghazali

Join Date: Jul 2015

Posts: 51
#4

08 Feb 2016, 02:20

Its Great. Thank you Clyde for your feedback!
Comment

Announcement

Using OR condition for dummy variable (expression too long)

Comment

Comment

Comment