Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mansoor Afzali
    started a topic Lag Values Excluding Observation

    Lag Values Excluding Observation

    Hi,

    First, I am sorry for not being able to find a better title than this. But my problem is a little more complicated. The data example below shows directors (with unique director ids) belonging to firms (with unique firm ids). Fiscal years are denoted by fyear while tsize, ind, city, and perf are size tercile the company belongs to, its industry code, city its headquarters are located at, and its past performance, respectively. I am trying to find the determinants of director hire. Specifically, I want to know how director's positions at other firms in the last year influence the probability of his selection in the current company (assuming the director is newly hired). For this reason, I need to compute whether, in the previous year, the director has served in a company of same size etc. I need to do that separately for industry and city. For performance, I have to calculate the average performance of all companies the director served on in the previous year.

    Here's the example dataset.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int(gvkey fyear directorid) byte(tsize ind) str4 city float perf
    1001 2000 12345 2 12 "Hel"    .2
    1001 2000   123 2 12 "Hel"    .3
    1001 2000  4531 2 12 "Hel"    .1
    1002 2000    98 3 15 "Tamp"   .1
    1002 2000 12345 3 15 "Tamp"  .05
    1002 2000   123 3 15 "Tamp"  .04
    1003 2000   100 1  9 "Hel"   .12
    1003 2000    98 1  9 "Hel"   .02
    1003 2000   109 1  9 "Hel"  -.01
    1001 2001 12345 2 12 "Hel"    .2
    1001 2001   123 2 12 "Hel"    .3
    1001 2001  4531 2 12 "Hel"    .1
    1002 2001    98 3 15 "Tamp"   .1
    1002 2001 12345 3 15 "Tamp"  .05
    1002 2001   123 3 15 "Tamp"  .04
    1003 2001   100 1  9 "Hel"   .12
    1003 2001    98 1  9 "Hel"   .02
    1003 2001   109 1  9 "Hel"  -.01
    end

  • Clyde Schechter
    replied
    So it sounds like what you need is this:

    Code:
    preserve
    tempfile matches
    rename directorid unpicked
    keep unpicked
    duplicates drop
    save `matches'
    
    restore
    rename directorid picked
    gen filename = `"`matches'"'
    
    set seed 1234
    
    capture program drop one_match
    program define one_match
        local copy = filename[1]
        cross using `copy'
        drop if picked == unpicked
        gen double shuffle = runiform()
        sort shuffle
        keep in 1/1000
        drop shuffle
        exit
    end
    
    runby one_match, by(picked) status    
    drop filename
    This retains the multiple observations of picked directors in each year or over multiple firms, but reduces the list of potential unpicked matches to a data set containing each directorid only once.

    Leave a comment:


  • Mansoor Afzali
    replied
    Thank you Clyde. I found the source of duplicates. The file I am using to start with in #14 has no duplicates in regards to firm_id, year, and director_id. However, a director could be hired by multiple firms in the same year. That is why there exist duplicate director_ids with different firm_ids. I am sorry for not pointing out that firm_id is the other dimension in the data set. Is it possible to do something like #14 given this kind of data set and avoid duplicates.

    Leave a comment:


  • Clyde Schechter
    replied
    The code in #8 and #14 does not introduce any duplicates that weren't in the data already. I think there was a misunderstanding back at #8. Your posts before that had included example data in which there were multiple observations per director, corresponding to different years. The code in #8 is predicated in starting from a file in which each directorid appears only once. If you were starting it instead from a data set which multiple observations per directorid, then it will retain and propagate those duplicates. So you need to first create a file that contains only distinct directorid values and then run the matching in #8 and the reorganization in #14.

    Leave a comment:


  • Mansoor Afzali
    replied
    Thank you Clyde. I just realised that #14 results in duplicate director matches to firms in the same year. How do I assign unique 1,000 unpicked directors to every director picked. I could drop the duplicates but that results in a decrease in the number of directors assigned.

    Leave a comment:


  • Clyde Schechter
    replied
    So it sounds like you want to enlarge this data set to include information about directorships in the prior year. That information is in your original data set, which I'll call original_data in the code below:

    Code:
    rangejoin fyear -1 -1 using original_data, by(directorid)
    At the end of that each observation in the data from #16 will be paired up with all preceding year directorships.

    Leave a comment:


  • Mansoor Afzali
    replied
    Hi again. Here's the problem I am facing now. I have two data sets. One that looks like #1 and has all the information about firms and directors. And the other that looks like the example below based on your code in #14. I can calculate whether the director has served in the same size tercile or not based on #2 but that gives me the information relative to that specific firm. Suppose I want to know the information for the following example data set, what do I do then? In the data set below, all the seats a director holds in the previous year are not listed, therefore, if I want to know if the director has served in the same size tercile, I can't determine that.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str6 firmid double(fyear directorid) byte selection
    "001004" 2005  1147 0
    "001004" 2005  2455 0
    "001004" 2005  2615 0
    "001004" 2005  3162 0
    "001004" 2005  3162 0
    "001004" 2005  3335 0
    "001004" 2005  4669 0
    "001004" 2005  5769 0
    "001004" 2005  6098 0
    "001004" 2005  6098 0
    "001004" 2005  6897 0
    "001004" 2005  7209 0
    "001004" 2005  7683 0
    "001004" 2005  7801 0
    "001004" 2005  8036 0
    "001004" 2005  8658 0
    "001004" 2005  8803 0
    "001004" 2005  9960 0
    "001004" 2005 10122 0
    "001004" 2005 10122 0
    "001004" 2005 10339 0
    "001004" 2005 13187 0
    "001004" 2005 14651 0
    "001004" 2005 15482 0
    "001004" 2005 16567 0
    "001004" 2005 16672 0
    "001004" 2005 17103 0
    "001004" 2005 17103 0
    end
    This data set is large since I have 1,000 random (unpicked) directors for every director picked.

    Leave a comment:


  • Mansoor Afzali
    replied
    Thank you so much Clyde. I am all set now.

    Leave a comment:


  • Clyde Schechter
    replied
    OK. So here's some code, again starting with a toy data set. Use your actual data and go from "DATA MANAGEMENT BEGINS HERE"--although you may need to modify the -keep- commands so as to retain other important variables. The toy data has 2 firms with 5 years of data, and in each firm-year combination there is 1 picked director matched to 3 unpicked.

    Code:
    //    CREATE TOY DATA SET TO DEMONSTRATE CODE
    clear*
    tempfile controls
    set obs 10
    gen unpicked = _n
    save `controls'
    clear
    set obs 10
    gen firm_id = (_n <= 5)
    by firm_id, sort: gen year = 2000+_n
    gen picked = _n
    cross using `controls'
    drop if picked == unpicked
    set seed 1234
    gen double shuffle = runiform()
    by firm_id year (shuffle), sort: keep if _n <= 3
    drop shuffle
    
    //    DATA MANAGEMENT BEGINS HERE
    preserve
    keep firm_id year unpicked
    rename unpicked directorid
    gen byte selection = 0
    tempfile unpicked
    save `unpicked'
    
    restore
    keep firm_id year picked
    duplicates drop
    rename picked directorid
    gen byte selection = 1
    append using `unpicked'
    sort firm_id year selection directorid

    Leave a comment:


  • Mansoor Afzali
    replied
    I have divided the data by fiscal year to make things work more smoothly. Here's an example for 2005.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str6 gvkey float fyear double directorid
    "001004" 2005 346677
    "001004" 2005 487132
    "001004" 2005 320199
    "164506" 2005  63027
    "164506" 2005  51772
    "164506" 2005  51647
    "164506" 2005 374419
    "164506" 2005  51697
    "164506" 2005 374415
    "164506" 2005 374410
    "164506" 2005 374412
    "164506" 2005   2455
    "065569" 2005 334583
    "021686" 2005 347086
    "061523" 2005 345630
    "011903" 2005  14468
    "001449" 2005 333115
    "025180" 2005  36502
    "011361" 2005 449622
    "004090" 2005  91707
    "004090" 2005 373195
    "065576" 2005  65022
    "065072" 2005 443707
    "144520" 2005  89327
    "061407" 2005 370434
    "061407" 2005 347235
    "162076" 2005  32197
    "162076" 2005 372988
    "029478" 2005 444680
    "029478" 2005 349916
    "030584" 2005 204185
    "030584" 2005 335869
    "009899" 2005  51246
    "009899" 2005  33072
    "009899" 2005  36684
    "009899" 2005  36525
    "029274" 2005 200742
    "022308" 2005 343872
    "182309" 2005  41971
    "147101" 2005 516078
    "147101" 2005 516076
    "142156" 2005 372911
    "142156" 2005 372903
    "061253" 2005 141369
    "061253" 2005 339432
    "110760" 2005 347143
    "063643" 2005 350093
    "063643" 2005  51386
    "013619" 2005  36560
    "013619" 2005 140728
    "121553" 2005 443657
    "001081" 2005 331921
    "141846" 2005 203582
    "141846" 2005 349883
    "129632" 2005 374956
    "014999" 2005 324239
    "014999" 2005 442517
    "001094" 2005 446200
    "032459" 2005 510359
    "032459" 2005 510558
    "031564" 2005  33949
    "124374" 2005 346820
    "230796" 2005 342542
    "180405" 2005 342489
    "001706" 2005 331619
    "001706" 2005 141698
    "014753" 2005  57662
    "014753" 2005 310278
    "001119" 2005  57662
    "001119" 2005 310278
    "163628" 2005 142264
    "163628" 2005 483715
    "061726" 2005 488605
    "061726" 2005 492115
    "061726" 2005 349624
    "161037" 2005 180752
    "012540" 2005  37206
    "012540" 2005  49351
    "012540" 2005 182352
    "133724" 2005 271624
    "003905" 2005  34718
    "003905" 2005 326472
    "003905" 2005 440347
    "124934" 2005 347595
    "124934" 2005 347606
    "124934" 2005 347597
    "124934" 2005 347599
    "030576" 2005 204208
    "030576" 2005 490428
    "030576" 2005 204217
    "161067" 2005 349769
    "161067" 2005 451616
    "145977" 2005 206315
    "145977" 2005 206327
    "002831" 2005 596717
    "002831" 2005 596712
    "002831" 2005 596719
    "002831" 2005 596720
    "163679" 2005  61752
    "023616" 2005 347841
    end

    Leave a comment:


  • Mansoor Afzali
    replied
    Will something like this work:

    Code:
    capture program drop one_match
    program define one_match
        local copy = filename[1]
        cross using `copy'
        gen double shuffle = runiform()
        sort shuffle
        keep in 1/1001
        drop shuffle
        exit
    end
    
    gen dummy=(picked==unpicked)
    
    end

    Leave a comment:


  • Mansoor Afzali
    replied
    The data has a couple of other variables. Namely, company code and fiscal year. I want to create a dummy equal to 1 if the firm picks the director and 0 otherwise. By definition the firm will pick a director and I will have 1,000 directors unpicked for each company.

    Leave a comment:


  • Clyde Schechter
    replied
    I don't follow what you want here. Right now you have a data set where each of the 80,000 picked directorid's appears in one variable in a block of 1000 observations--the other variable in the data set is the id of the matched unpicked director. It appears that you want to somehow stack them into one variable with a new variable that indicates picked vs unpicked status. But if you do that, you will lose the information about which picked director a given unpicked one is actually paired with. What am I missing here? Can you maybe show what the result you want would look like?

    Leave a comment:


  • Mansoor Afzali
    replied
    Thank you so much once again, this works great. What would I have to do to make the picked and unpicked appear as one variable taking the value of 1 and 0, receptively.

    Leave a comment:


  • Clyde Schechter
    replied
    So, you will potentially run up against two problems here: memory issues and speed issues. So the first thing is to strip down the data set to just the director_id's. Later you can merge the results back to the original full data set to get the other variables.

    The following code illustrates how you can do it. I have created a toy data set with just 8000 directors to demonstrate the code. (I don't have the patience to run it with 80,000.) You would actually begin by using your real data set, keeping only the directorid variable, dropping any duplicates and then running this code from where it says "MATCHING BEGINS HERE."

    Code:
    //    CREATE TOY DATA SET
    clear*
    set obs 8000
    
    gen long directorid = _n
    
    //    MATCHING BEGINS HERE
    tempfile copy
    rename directorid unpicked
    save `copy'
    
    rename unpicked picked
    gen filename = `"`copy'"'
    
    set seed 1234
    
    capture program drop one_match
    program define one_match
        local copy = filename[1]
        cross using `copy'
        drop if picked == unpicked
        gen double shuffle = runiform()
        sort shuffle
        keep in 1/1000
        drop shuffle
        exit
    end
    
    runby one_match, by(picked) status    
    drop filename
    At no time will the data in memory be appreciably larger than just 80,000,000 observations, and those observations will each only contain two longs and a double. As for running time, with 8,000 observations it takes a bit under 2 minutes on my nothing-special computer. Since the execution time should go up as the square of the number of observations, I would expect this to take around 3 hours with 80,000 observations. The -status- option on -runby- will cause Stata to give you ongoing progress reports, so you won't have to wonder whether your computer is hung or is still working, and the progress report also gives you an estimate of the time remaining.

    -runby- is written by Robert Picard and me, and is available from SSC.

    Leave a comment:

Working...
X