Lag Values Excluding Observation

Mansoor Afzali

Join Date: Feb 2019

Posts: 17
#1

Lag Values Excluding Observation

08 Feb 2019, 13:51

Hi,

First, I am sorry for not being able to find a better title than this. But my problem is a little more complicated. The data example below shows directors (with unique director ids) belonging to firms (with unique firm ids). Fiscal years are denoted by fyear while tsize, ind, city, and perf are size tercile the company belongs to, its industry code, city its headquarters are located at, and its past performance, respectively. I am trying to find the determinants of director hire. Specifically, I want to know how director's positions at other firms in the last year influence the probability of his selection in the current company (assuming the director is newly hired). For this reason, I need to compute whether, in the previous year, the director has served in a company of same size etc. I need to do that separately for industry and city. For performance, I have to calculate the average performance of all companies the director served on in the previous year.

Here's the example dataset.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int(gvkey fyear directorid) byte(tsize ind) str4 city float perf 1001 2000 12345 2 12 "Hel" .2 1001 2000 123 2 12 "Hel" .3 1001 2000 4531 2 12 "Hel" .1 1002 2000 98 3 15 "Tamp" .1 1002 2000 12345 3 15 "Tamp" .05 1002 2000 123 3 15 "Tamp" .04 1003 2000 100 1 9 "Hel" .12 1003 2000 98 1 9 "Hel" .02 1003 2000 109 1 9 "Hel" -.01 1001 2001 12345 2 12 "Hel" .2 1001 2001 123 2 12 "Hel" .3 1001 2001 4531 2 12 "Hel" .1 1002 2001 98 3 15 "Tamp" .1 1002 2001 12345 3 15 "Tamp" .05 1002 2001 123 3 15 "Tamp" .04 1003 2001 100 1 9 "Hel" .12 1003 2001 98 1 9 "Hel" .02 1003 2001 109 1 9 "Hel" -.01 end
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30121

08 Feb 2019, 14:31

Thank you for using -dataex- on your first post!

I believe you want the following:

Code:

tempfile copy
save `copy'

//    PAIR UP EACH OBSERVATION WITH ALL OF THE SAME DIRECTOR'S OBSERVATIONS FROM
//    THE PRECEDING YEAR
rangejoin fyear -1 -1 using `copy', by(directorid) prefix(prior_)

//    DETERMINE WHETHER ANY WERE IN THE SAME INDUSTRY OR FIRM OF SAME SIZE TERCILE
by directorid fyear, sort: egen byte same_industry = max(ind == prior_ind) if !missing(prior_ind)
by directorid fyear: egen byte same_size = max(tsize == prior_tsize)if !missing(prior_tsize)

//    CALCULATE MEAN PERFORMANCE IN PRIOR YEAR
by directorid fyear: egen mean_prior_perf = mean(prior_perf)

//    REDUCE TO ONE OBSERVATION PER DIRECTOR PER YEAR
drop prior_*
by directorid fyear: keep if _n == 1

-rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Comment

Mansoor Afzali

Join Date: Feb 2019

Posts: 17
#3

08 Feb 2019, 22:41

Thank you so much Clyde. This works like a charm. One more question, how would I assign 1,000 random directors (not picked) to every director who was picked. I have a big list of all directors along with their information and want to know how to pair the picked director with 1,000 random directors and then run a logit model.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#4

09 Feb 2019, 10:47

How large is the file of directors? How many are picked? How many are unpicked?
Comment
Mansoor Afzali

Join Date: Feb 2019

Posts: 17
#5

09 Feb 2019, 12:19

The list of picked directors is around 80,000 for 14 years but for each director selected, I have to assign 1,000 other directors randomly from the 80,000 as not picked ones. I am trying to replicate a paper which uses this technique to proxy for other potential candidates that could've been picked but weren't.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#6

09 Feb 2019, 13:40

I don't understand. There are 80,000 directors that are "picked" and the same list is also the "not picked" directors? I don't get it at all.

And in any case, you do realize that when you start with 80,000 picked directors and you match each of them with 1,000 other directors (regardless of where those come from), you end up with 80 million pairs of directors. Are you sure that is what you want?
Comment
Mansoor Afzali

Join Date: Feb 2019

Posts: 17
#7

09 Feb 2019, 14:08

Sorry for the confusion here. There are 80,000 new directors who are available to be picked. Some are picked by one firm and some by other. I have to retain the director who was chosen and randomly add 1,000 directors from the 79,999 left as "unpicked" directors. It will lead to an 80 million pairs and that's exactly what I am looking for.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#8

09 Feb 2019, 16:24

So, you will potentially run up against two problems here: memory issues and speed issues. So the first thing is to strip down the data set to just the director_id's. Later you can merge the results back to the original full data set to get the other variables.

The following code illustrates how you can do it. I have created a toy data set with just 8000 directors to demonstrate the code. (I don't have the patience to run it with 80,000.) You would actually begin by using your real data set, keeping only the directorid variable, dropping any duplicates and then running this code from where it says "MATCHING BEGINS HERE."

Code:

// CREATE TOY DATA SET clear* set obs 8000 gen long directorid = _n // MATCHING BEGINS HERE tempfile copy rename directorid unpicked save `copy' rename unpicked picked gen filename = `"`copy'"' set seed 1234 capture program drop one_match program define one_match local copy = filename[1] cross using `copy' drop if picked == unpicked gen double shuffle = runiform() sort shuffle keep in 1/1000 drop shuffle exit end runby one_match, by(picked) status drop filename

At no time will the data in memory be appreciably larger than just 80,000,000 observations, and those observations will each only contain two longs and a double. As for running time, with 8,000 observations it takes a bit under 2 minutes on my nothing-special computer. Since the execution time should go up as the square of the number of observations, I would expect this to take around 3 hours with 80,000 observations. The -status- option on -runby- will cause Stata to give you ongoing progress reports, so you won't have to wonder whether your computer is hung or is still working, and the progress report also gives you an estimate of the time remaining.

-runby- is written by Robert Picard and me, and is available from SSC.
Comment
Mansoor Afzali

Join Date: Feb 2019

Posts: 17
#9

09 Feb 2019, 19:57

Thank you so much once again, this works great. What would I have to do to make the picked and unpicked appear as one variable taking the value of 1 and 0, receptively.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#10

09 Feb 2019, 20:08

I don't follow what you want here. Right now you have a data set where each of the 80,000 picked directorid's appears in one variable in a block of 1000 observations--the other variable in the data set is the id of the matched unpicked director. It appears that you want to somehow stack them into one variable with a new variable that indicates picked vs unpicked status. But if you do that, you will lose the information about which picked director a given unpicked one is actually paired with. What am I missing here? Can you maybe show what the result you want would look like?
Comment
Mansoor Afzali

Join Date: Feb 2019

Posts: 17
#11

09 Feb 2019, 20:13

The data has a couple of other variables. Namely, company code and fiscal year. I want to create a dummy equal to 1 if the firm picks the director and 0 otherwise. By definition the firm will pick a director and I will have 1,000 directors unpicked for each company.
Comment

Mansoor Afzali

Join Date: Feb 2019
Posts: 17

#12

09 Feb 2019, 20:20

Will something like this work:

Code:

capture program drop one_match
program define one_match
    local copy = filename[1]
    cross using `copy'
    gen double shuffle = runiform()
    sort shuffle
    keep in 1/1001
    drop shuffle
    exit
end

gen dummy=(picked==unpicked)

end

Comment

Mansoor Afzali

Join Date: Feb 2019
Posts: 17

#13

09 Feb 2019, 20:22

I have divided the data by fiscal year to make things work more smoothly. Here's an example for 2005.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 gvkey float fyear double directorid
"001004" 2005 346677
"001004" 2005 487132
"001004" 2005 320199
"164506" 2005  63027
"164506" 2005  51772
"164506" 2005  51647
"164506" 2005 374419
"164506" 2005  51697
"164506" 2005 374415
"164506" 2005 374410
"164506" 2005 374412
"164506" 2005   2455
"065569" 2005 334583
"021686" 2005 347086
"061523" 2005 345630
"011903" 2005  14468
"001449" 2005 333115
"025180" 2005  36502
"011361" 2005 449622
"004090" 2005  91707
"004090" 2005 373195
"065576" 2005  65022
"065072" 2005 443707
"144520" 2005  89327
"061407" 2005 370434
"061407" 2005 347235
"162076" 2005  32197
"162076" 2005 372988
"029478" 2005 444680
"029478" 2005 349916
"030584" 2005 204185
"030584" 2005 335869
"009899" 2005  51246
"009899" 2005  33072
"009899" 2005  36684
"009899" 2005  36525
"029274" 2005 200742
"022308" 2005 343872
"182309" 2005  41971
"147101" 2005 516078
"147101" 2005 516076
"142156" 2005 372911
"142156" 2005 372903
"061253" 2005 141369
"061253" 2005 339432
"110760" 2005 347143
"063643" 2005 350093
"063643" 2005  51386
"013619" 2005  36560
"013619" 2005 140728
"121553" 2005 443657
"001081" 2005 331921
"141846" 2005 203582
"141846" 2005 349883
"129632" 2005 374956
"014999" 2005 324239
"014999" 2005 442517
"001094" 2005 446200
"032459" 2005 510359
"032459" 2005 510558
"031564" 2005  33949
"124374" 2005 346820
"230796" 2005 342542
"180405" 2005 342489
"001706" 2005 331619
"001706" 2005 141698
"014753" 2005  57662
"014753" 2005 310278
"001119" 2005  57662
"001119" 2005 310278
"163628" 2005 142264
"163628" 2005 483715
"061726" 2005 488605
"061726" 2005 492115
"061726" 2005 349624
"161037" 2005 180752
"012540" 2005  37206
"012540" 2005  49351
"012540" 2005 182352
"133724" 2005 271624
"003905" 2005  34718
"003905" 2005 326472
"003905" 2005 440347
"124934" 2005 347595
"124934" 2005 347606
"124934" 2005 347597
"124934" 2005 347599
"030576" 2005 204208
"030576" 2005 490428
"030576" 2005 204217
"161067" 2005 349769
"161067" 2005 451616
"145977" 2005 206315
"145977" 2005 206327
"002831" 2005 596717
"002831" 2005 596712
"002831" 2005 596719
"002831" 2005 596720
"163679" 2005  61752
"023616" 2005 347841
end

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30121

#14

09 Feb 2019, 20:35

OK. So here's some code, again starting with a toy data set. Use your actual data and go from "DATA MANAGEMENT BEGINS HERE"--although you may need to modify the -keep- commands so as to retain other important variables. The toy data has 2 firms with 5 years of data, and in each firm-year combination there is 1 picked director matched to 3 unpicked.

Code:

//    CREATE TOY DATA SET TO DEMONSTRATE CODE
clear*
tempfile controls
set obs 10
gen unpicked = _n
save `controls'
clear
set obs 10
gen firm_id = (_n <= 5)
by firm_id, sort: gen year = 2000+_n
gen picked = _n
cross using `controls'
drop if picked == unpicked
set seed 1234
gen double shuffle = runiform()
by firm_id year (shuffle), sort: keep if _n <= 3
drop shuffle

//    DATA MANAGEMENT BEGINS HERE
preserve
keep firm_id year unpicked
rename unpicked directorid
gen byte selection = 0
tempfile unpicked
save `unpicked'

restore
keep firm_id year picked
duplicates drop
rename picked directorid
gen byte selection = 1
append using `unpicked'
sort firm_id year selection directorid

Comment

Mansoor Afzali

Join Date: Feb 2019

Posts: 17
#15

10 Feb 2019, 00:34

Thank you so much Clyde. I am all set now.
Comment

Announcement