Lag Values Excluding Observation

Mansoor Afzali started a topic Lag Values Excluding Observation

08 Feb 2019, 13:51
Lag Values Excluding Observation
Hi,

First, I am sorry for not being able to find a better title than this. But my problem is a little more complicated. The data example below shows directors (with unique director ids) belonging to firms (with unique firm ids). Fiscal years are denoted by fyear while tsize, ind, city, and perf are size tercile the company belongs to, its industry code, city its headquarters are located at, and its past performance, respectively. I am trying to find the determinants of director hire. Specifically, I want to know how director's positions at other firms in the last year influence the probability of his selection in the current company (assuming the director is newly hired). For this reason, I need to compute whether, in the previous year, the director has served in a company of same size etc. I need to do that separately for industry and city. For performance, I have to calculate the average performance of all companies the director served on in the previous year.

Here's the example dataset.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int(gvkey fyear directorid) byte(tsize ind) str4 city float perf 1001 2000 12345 2 12 "Hel" .2 1001 2000 123 2 12 "Hel" .3 1001 2000 4531 2 12 "Hel" .1 1002 2000 98 3 15 "Tamp" .1 1002 2000 12345 3 15 "Tamp" .05 1002 2000 123 3 15 "Tamp" .04 1003 2000 100 1 9 "Hel" .12 1003 2000 98 1 9 "Hel" .02 1003 2000 109 1 9 "Hel" -.01 1001 2001 12345 2 12 "Hel" .2 1001 2001 123 2 12 "Hel" .3 1001 2001 4531 2 12 "Hel" .1 1002 2001 98 3 15 "Tamp" .1 1002 2001 12345 3 15 "Tamp" .05 1002 2001 123 3 15 "Tamp" .04 1003 2001 100 1 9 "Hel" .12 1003 2001 98 1 9 "Hel" .02 1003 2001 109 1 9 "Hel" -.01 end
Tags: None

Clyde Schechter replied

12 Feb 2019, 13:03

So it sounds like what you need is this:

Code:

preserve
tempfile matches
rename directorid unpicked
keep unpicked
duplicates drop
save `matches'

restore
rename directorid picked
gen filename = `"`matches'"'

set seed 1234

capture program drop one_match
program define one_match
    local copy = filename[1]
    cross using `copy'
    drop if picked == unpicked
    gen double shuffle = runiform()
    sort shuffle
    keep in 1/1000
    drop shuffle
    exit
end

runby one_match, by(picked) status    
drop filename

This retains the multiple observations of picked directors in each year or over multiple firms, but reduces the list of potential unpicked matches to a data set containing each directorid only once.

Leave a comment:

Mansoor Afzali replied

12 Feb 2019, 12:46
Thank you Clyde. I found the source of duplicates. The file I am using to start with in #14 has no duplicates in regards to firm_id, year, and director_id. However, a director could be hired by multiple firms in the same year. That is why there exist duplicate director_ids with different firm_ids. I am sorry for not pointing out that firm_id is the other dimension in the data set. Is it possible to do something like #14 given this kind of data set and avoid duplicates.
Leave a comment:
Clyde Schechter replied

12 Feb 2019, 12:27
The code in #8 and #14 does not introduce any duplicates that weren't in the data already. I think there was a misunderstanding back at #8. Your posts before that had included example data in which there were multiple observations per director, corresponding to different years. The code in #8 is predicated in starting from a file in which each directorid appears only once. If you were starting it instead from a data set which multiple observations per directorid, then it will retain and propagate those duplicates. So you need to first create a file that contains only distinct directorid values and then run the matching in #8 and the reorganization in #14.
Leave a comment:
Mansoor Afzali replied

12 Feb 2019, 11:00
Thank you Clyde. I just realised that #14 results in duplicate director matches to firms in the same year. How do I assign unique 1,000 unpicked directors to every director picked. I could drop the duplicates but that results in a decrease in the number of directors assigned.
Leave a comment:
Clyde Schechter replied

11 Feb 2019, 14:04
So it sounds like you want to enlarge this data set to include information about directorships in the prior year. That information is in your original data set, which I'll call original_data in the code below:

Code:

rangejoin fyear -1 -1 using original_data, by(directorid)

At the end of that each observation in the data from #16 will be paired up with all preceding year directorships.
Leave a comment:
Mansoor Afzali replied

11 Feb 2019, 13:56
Hi again. Here's the problem I am facing now. I have two data sets. One that looks like #1 and has all the information about firms and directors. And the other that looks like the example below based on your code in #14. I can calculate whether the director has served in the same size tercile or not based on #2 but that gives me the information relative to that specific firm. Suppose I want to know the information for the following example data set, what do I do then? In the data set below, all the seats a director holds in the previous year are not listed, therefore, if I want to know if the director has served in the same size tercile, I can't determine that.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str6 firmid double(fyear directorid) byte selection "001004" 2005 1147 0 "001004" 2005 2455 0 "001004" 2005 2615 0 "001004" 2005 3162 0 "001004" 2005 3162 0 "001004" 2005 3335 0 "001004" 2005 4669 0 "001004" 2005 5769 0 "001004" 2005 6098 0 "001004" 2005 6098 0 "001004" 2005 6897 0 "001004" 2005 7209 0 "001004" 2005 7683 0 "001004" 2005 7801 0 "001004" 2005 8036 0 "001004" 2005 8658 0 "001004" 2005 8803 0 "001004" 2005 9960 0 "001004" 2005 10122 0 "001004" 2005 10122 0 "001004" 2005 10339 0 "001004" 2005 13187 0 "001004" 2005 14651 0 "001004" 2005 15482 0 "001004" 2005 16567 0 "001004" 2005 16672 0 "001004" 2005 17103 0 "001004" 2005 17103 0 end

This data set is large since I have 1,000 random (unpicked) directors for every director picked.
Leave a comment:
Mansoor Afzali replied

10 Feb 2019, 00:34
Thank you so much Clyde. I am all set now.
Leave a comment:

Clyde Schechter replied

09 Feb 2019, 20:35

OK. So here's some code, again starting with a toy data set. Use your actual data and go from "DATA MANAGEMENT BEGINS HERE"--although you may need to modify the -keep- commands so as to retain other important variables. The toy data has 2 firms with 5 years of data, and in each firm-year combination there is 1 picked director matched to 3 unpicked.

Code:

//    CREATE TOY DATA SET TO DEMONSTRATE CODE
clear*
tempfile controls
set obs 10
gen unpicked = _n
save `controls'
clear
set obs 10
gen firm_id = (_n <= 5)
by firm_id, sort: gen year = 2000+_n
gen picked = _n
cross using `controls'
drop if picked == unpicked
set seed 1234
gen double shuffle = runiform()
by firm_id year (shuffle), sort: keep if _n <= 3
drop shuffle

//    DATA MANAGEMENT BEGINS HERE
preserve
keep firm_id year unpicked
rename unpicked directorid
gen byte selection = 0
tempfile unpicked
save `unpicked'

restore
keep firm_id year picked
duplicates drop
rename picked directorid
gen byte selection = 1
append using `unpicked'
sort firm_id year selection directorid

Leave a comment:

Mansoor Afzali replied

09 Feb 2019, 20:22

I have divided the data by fiscal year to make things work more smoothly. Here's an example for 2005.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 gvkey float fyear double directorid
"001004" 2005 346677
"001004" 2005 487132
"001004" 2005 320199
"164506" 2005  63027
"164506" 2005  51772
"164506" 2005  51647
"164506" 2005 374419
"164506" 2005  51697
"164506" 2005 374415
"164506" 2005 374410
"164506" 2005 374412
"164506" 2005   2455
"065569" 2005 334583
"021686" 2005 347086
"061523" 2005 345630
"011903" 2005  14468
"001449" 2005 333115
"025180" 2005  36502
"011361" 2005 449622
"004090" 2005  91707
"004090" 2005 373195
"065576" 2005  65022
"065072" 2005 443707
"144520" 2005  89327
"061407" 2005 370434
"061407" 2005 347235
"162076" 2005  32197
"162076" 2005 372988
"029478" 2005 444680
"029478" 2005 349916
"030584" 2005 204185
"030584" 2005 335869
"009899" 2005  51246
"009899" 2005  33072
"009899" 2005  36684
"009899" 2005  36525
"029274" 2005 200742
"022308" 2005 343872
"182309" 2005  41971
"147101" 2005 516078
"147101" 2005 516076
"142156" 2005 372911
"142156" 2005 372903
"061253" 2005 141369
"061253" 2005 339432
"110760" 2005 347143
"063643" 2005 350093
"063643" 2005  51386
"013619" 2005  36560
"013619" 2005 140728
"121553" 2005 443657
"001081" 2005 331921
"141846" 2005 203582
"141846" 2005 349883
"129632" 2005 374956
"014999" 2005 324239
"014999" 2005 442517
"001094" 2005 446200
"032459" 2005 510359
"032459" 2005 510558
"031564" 2005  33949
"124374" 2005 346820
"230796" 2005 342542
"180405" 2005 342489
"001706" 2005 331619
"001706" 2005 141698
"014753" 2005  57662
"014753" 2005 310278
"001119" 2005  57662
"001119" 2005 310278
"163628" 2005 142264
"163628" 2005 483715
"061726" 2005 488605
"061726" 2005 492115
"061726" 2005 349624
"161037" 2005 180752
"012540" 2005  37206
"012540" 2005  49351
"012540" 2005 182352
"133724" 2005 271624
"003905" 2005  34718
"003905" 2005 326472
"003905" 2005 440347
"124934" 2005 347595
"124934" 2005 347606
"124934" 2005 347597
"124934" 2005 347599
"030576" 2005 204208
"030576" 2005 490428
"030576" 2005 204217
"161067" 2005 349769
"161067" 2005 451616
"145977" 2005 206315
"145977" 2005 206327
"002831" 2005 596717
"002831" 2005 596712
"002831" 2005 596719
"002831" 2005 596720
"163679" 2005  61752
"023616" 2005 347841
end

Leave a comment:

Mansoor Afzali replied

09 Feb 2019, 20:20

Will something like this work:

Code:

capture program drop one_match
program define one_match
    local copy = filename[1]
    cross using `copy'
    gen double shuffle = runiform()
    sort shuffle
    keep in 1/1001
    drop shuffle
    exit
end

gen dummy=(picked==unpicked)

end

Leave a comment:

Mansoor Afzali replied

09 Feb 2019, 20:13
The data has a couple of other variables. Namely, company code and fiscal year. I want to create a dummy equal to 1 if the firm picks the director and 0 otherwise. By definition the firm will pick a director and I will have 1,000 directors unpicked for each company.
Leave a comment:
Clyde Schechter replied

09 Feb 2019, 20:08
I don't follow what you want here. Right now you have a data set where each of the 80,000 picked directorid's appears in one variable in a block of 1000 observations--the other variable in the data set is the id of the matched unpicked director. It appears that you want to somehow stack them into one variable with a new variable that indicates picked vs unpicked status. But if you do that, you will lose the information about which picked director a given unpicked one is actually paired with. What am I missing here? Can you maybe show what the result you want would look like?
Leave a comment:
Mansoor Afzali replied

09 Feb 2019, 19:57
Thank you so much once again, this works great. What would I have to do to make the picked and unpicked appear as one variable taking the value of 1 and 0, receptively.
Leave a comment:
Clyde Schechter replied

09 Feb 2019, 16:24
So, you will potentially run up against two problems here: memory issues and speed issues. So the first thing is to strip down the data set to just the director_id's. Later you can merge the results back to the original full data set to get the other variables.

The following code illustrates how you can do it. I have created a toy data set with just 8000 directors to demonstrate the code. (I don't have the patience to run it with 80,000.) You would actually begin by using your real data set, keeping only the directorid variable, dropping any duplicates and then running this code from where it says "MATCHING BEGINS HERE."

Code:

// CREATE TOY DATA SET clear* set obs 8000 gen long directorid = _n // MATCHING BEGINS HERE tempfile copy rename directorid unpicked save `copy' rename unpicked picked gen filename = `"`copy'"' set seed 1234 capture program drop one_match program define one_match local copy = filename[1] cross using `copy' drop if picked == unpicked gen double shuffle = runiform() sort shuffle keep in 1/1000 drop shuffle exit end runby one_match, by(picked) status drop filename

At no time will the data in memory be appreciably larger than just 80,000,000 observations, and those observations will each only contain two longs and a double. As for running time, with 8,000 observations it takes a bit under 2 minutes on my nothing-special computer. Since the execution time should go up as the square of the number of observations, I would expect this to take around 3 hours with 80,000 observations. The -status- option on -runby- will cause Stata to give you ongoing progress reports, so you won't have to wonder whether your computer is hung or is still working, and the progress report also gives you an estimate of the time remaining.

-runby- is written by Robert Picard and me, and is available from SSC.
Leave a comment:

Announcement