Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying interlocks between string variables in a dataset with 261 872 observations and 3 variables

    Dear all,

    I tried googling this (and search statalist directly), and have found some useful threads, but I still can not solve this challenge, so I am posting here and would be very grateful for any help and advice.

    I have a data set containing three variables:
    year - self explanatory
    fid - firm id by ticker on stock market
    mid - manager id, simply the name of the manager

    My goal is to identify which (if any) firms that are connected through managers. The research is on interlocking directorates.

    Originally the data looked like this:
    year fid mid1 mid2 midx..
    2010 XYZ AA BB ...
    2011 XYZ AA BB ...
    2011 ABC CC AA ...
    2013 BDE DD EE ...

    After some research I found out that it would be beneficial to transform the data to this:

    year fid mid
    2010 XYZ AA
    2010 XYZ BB
    2011 XYZ AA
    2011 XYZ BB
    2011 ABC CC
    2011 ABC AA
    2013 BDE DD
    2013 BDE EE

    Thus, since manager AA sits on the board of firm XYZ and ABC, there is an interlock between the firms.

    I found a useful post here on the forum which I have tried. The code was the following:

    use "bi"

    rename mid mid2
    joinby fid using "bi"
    list fid mid mid2, noobs sepby(mid2)
    >

    * reduce to connected managers list
    sort mid mid2
    by mid mid2: keep if _n == 1
    drop if mid2 <= mid
    list mid mid2, noobs sepby(mid)
    save "connect"
    >

    * to get a complete list of all possibilities
    * form all pairwise combinations of managers
    use "bi", clear
    keep mid
    sort mid
    by mid: keep if _n == 1

    save "mgrs"

    use "mgrs"
    rename mid mid2
    cross using "mgrs"
    *cross is not working, get cross to work, buy a beer*
    sort mid mid2
    list mid mid2, noobs sepby(mid)
    drop if mid2 <= mid
    >

    * final list
    merge 1:1 mid mid2 using "connect"
    gen connect = _merge == 3
    sort mid mid2
    list mid mid2 connect, noobs sepby(mid)


    I tried this code with a small sample and received a desired result. However, when I tried with my real data set. I received the following message:
    sum of expand values exceed 2,147,483,620
    The dataset may not contain more than 2,147,483,620 observations.


    The dataset contains 261 872 observations.

    I have tried to google and look through the forum for any other alternative ways to solve my problem, but I keep banging my head against the wall with this one. There may be a rather obvious solution to this, but I am not that familiar with stata. Thus I reach out to you experts and hope someone can save me.

    If anyone knows how to work around this problem of the number of observations, or have an alternative solution in order to identify interlocks, I would deeply appreciate it.

    Thank you,
    Kim.


  • #2
    Welcome to Statalist, Kim.

    It would be helpful if you were to post the link to the discussion from which you obtained the code you copied, so that we could see whatever discussion surrounded its creation. It is possibly doing more than is needed to address the question you have.

    For that matter, is it correct to understand that you wish to determine which firms have interlocking boards separately in each year? Because the code you cite runs into problems because it does not distinguish between years, so even if the 2011 observations for XYZ were missing, you would still show XYZ and ABC interlocking.

    More generally, please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

    The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

    Comment


    • #3
      Thank you William.

      Thanks for the comments. Here is the link to the code I obtained https://www.stata.com/statalist/arch.../msg00910.html
      I do want for each separate year yes, but I could always just create one dataset per year, so can actually even remove that variable.

      I did actually look at the FAQ, and tried my best to follow the pointers there.

      Comment


      • #4
        Perhaps I am missing something, but it seems to me that all you want is:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input int year str3 fid str2 mid
        2010 "XYZ" "AA"
        2010 "XYZ" "BB"
        2011 "XYZ" "AA"
        2011 "XYZ" "BB"
        2011 "ABC" "CC"
        2011 "ABC" "AA"
        2013 "BDE" "DD"
        2013 "BDE" "EE"
        end
        
        by mid year, sort: keep if _N > 1
        by mid year (fid), sort: gen _j = _n
        
        reshape wide fid, i(mid year) j(_j)
        which should run pretty quickly, and shouldn't stress memory at all.

        Comment

        Working...
        X