Generating dummy variable based on lagged quartile data

Wali Ullah

Join Date: Mar 2019

Posts: 51
#1

Generating dummy variable based on lagged quartile data

08 Aug 2021, 19:38

Hi everyone,

I am trying to generate a dummy variable that is conditional on whether the variable of interest (ma_score) is in the top quartile in both years t-2 and t-1. Specifically, the paper's instructions are as follows, "To identify high-ability managers, we first form quartiles (by industry and year) of the MA-Score. We define High-Ability Managers as those in the top quartile of MA-Score in both years t-2 and t-1. This approach reduces the likelihood that idiosyncratic performance in a single year affects our identification of high-ability managers. Note that we do not expect managerial ability to change in the short run. Rather, we consider the scores across 2 years to reduce possible measurement error".

Can anyone help me with these instructions? I don't understand how to make the variable based on quartiles (by industry and year) as well.

Sample data:
input double year long gvkey float sic_2 double MA_SCORE_2018_w
1984 1001 58 .1674201
1985 1001 58 .0530939
1983 1003 57 .048832
1984 1003 57 .0081078
1986 1003 57 .0695462
1987 1003 57 .1106393
1988 1003 57 .0730525
1989 1003 57 .0283304
1980 1004 50 -.0183764
1981 1004 50 -.0333748
1982 1004 50 -.0341477
1983 1004 50 -.0444578
1984 1004 50 -.0505183
1985 1004 50 -.0110314
1986 1004 50 -.0288378
1987 1004 50 -.0385843
1988 1004 50 -.0431447
1989 1004 50 -.0293015
1990 1004 50 -.0577608
1991 1004 50 -.0493341
1992 1004 50 -.0543336
1993 1004 50 -.0513123
1994 1004 50 -.0827768
1995 1004 50 -.0793594
1996 1004 50 -.0692754
1997 1004 50 .0072931
1998 1004 50 -.0387679
1999 1004 50 -.0730754
2000 1004 50 -.0696314
2001 1004 50 -.020128
2002 1004 50 -.0847194
2003 1004 50 -.0698555
2004 1004 50 -.0677616
2005 1004 50 -.0726101
2006 1004 50 -.0773012
2007 1004 50 -.0549248
2008 1004 50 -.089513
2009 1004 50 -.0799284
2010 1004 50 -.0529496
2011 1004 50 -.0370836
end
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

08 Aug 2021, 21:00

For any given industry year combination, your example data contains either no observations or just one, so there are no quartiles to be formed by industry and year. I'll assume that your actual data set is much larger and does not suffer this limitation.

Also, your example data contains no variable identifying managers. For the purposes of illustrating the code here, I will pretend that gvkey identifies managers. But you will need to replace gvkey by the actual manager identifier variable to make this work for you.

Code:

by sic_2 year, sort: egen ma_quartile = xtile(MA_SCORE_2018_w), nq(4) // CREATE QUARTILES xtset gvkey year // REPLACE gvkey WITH MANAGER ID VARIABLE gen byte high_MA = inlist(L1.ma_quartile, 3, 4) & inlist(L2.ma_quartile, 3, 4)

Notes:

1. Official Stata does not have an -egen, xtile()- function. It is, rather, to be found in the -egenmore- package, available from SSC. It is easier to go this route than to develop loops around the official -xtile- command.

2. I understand you are just following instructions here. But if the concern is dealing with measurement error, this approach seems rather poorly designed for it. A much better way to reduce measurement error would be to use the average of the two lagging MA_SCORE_2018_w values, rather than making a dichotomous variable that discards information and adds noise.
Comment
Wali Ullah

Join Date: Mar 2019

Posts: 51
#3

08 Aug 2021, 22:19

Hi Clyde Schechter , yes the full dataset has adequate observations to form quartiles by industry and year. Also the gvkey specifies the firms, which in turns identifies the managers here. But I tried running the quartile creation code, but its sitting for ages and not showing any outputs. What could be the reason for that?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

09 Aug 2021, 09:21

But I tried running the quartile creation code, but its sitting for ages and not showing any outputs. What could be the reason for that?

I have not encountered that difficulty with -egen, xtile()-. If your data set is very large, this can take a long time as it requires sorting. I don't know what you mean by "ages" and I don't know how many observations your data set has, so I can't assess whether there is really a problem or you just need more patience.

Here's another way to calculate the quartiles which should be faster. It also has the advantage that it will provide periodic updates on its progress in the Results window, so you will be able to know if something has gone wrong. I should also add that even this faster way can be speeded up dramatically if you -drop- from your data set any variables that are not needed for further calculation.

Code:

capture program drop one_group program define one_group xtile ma_quartile = MA_SCORE_2018_w, nq(4) exit end runby one_group, by(gvkey year) status

-runby- is written by Robert Picard and me, and is available from SSC.

Note: If there are any combinations of gvkey and year that do not have enough observations to generate quartiles, these will be identified as "by-groups with errors" in the final output on the Results screen, and those observations also will not appear in the final data.
Comment
Wali Ullah

Join Date: Mar 2019

Posts: 51
#5

10 Aug 2021, 19:22

Thanks Clyde Schechter. It was an issue with the size of the dataset, the code worked after around 10 mins of working. Thanks again.
Comment

Announcement

Generating dummy variable based on lagged quartile data

Comment

Comment

Comment

Comment