Identifying interlocks between string variables in a dataset with 261 872 observations and 3 variables

Kim Andre Ophus

Join Date: Apr 2018

Posts: 2
#1

Identifying interlocks between string variables in a dataset with 261 872 observations and 3 variables

09 Apr 2018, 10:31

Dear all,

I tried googling this (and search statalist directly), and have found some useful threads, but I still can not solve this challenge, so I am posting here and would be very grateful for any help and advice.

I have a data set containing three variables:
year - self explanatory
fid - firm id by ticker on stock market
mid - manager id, simply the name of the manager

My goal is to identify which (if any) firms that are connected through managers. The research is on interlocking directorates.

Originally the data looked like this:
year fid mid1 mid2 midx..
2010 XYZ AA BB ...
2011 XYZ AA BB ...
2011 ABC CC AA ...
2013 BDE DD EE ...

After some research I found out that it would be beneficial to transform the data to this:

year fid mid
2010 XYZ AA
2010 XYZ BB
2011 XYZ AA
2011 XYZ BB
2011 ABC CC
2011 ABC AA
2013 BDE DD
2013 BDE EE

Thus, since manager AA sits on the board of firm XYZ and ABC, there is an interlock between the firms.

I found a useful post here on the forum which I have tried. The code was the following:

use "bi"

rename mid mid2
joinby fid using "bi"
list fid mid mid2, noobs sepby(mid2)
>

* reduce to connected managers list
sort mid mid2
by mid mid2: keep if _n == 1
drop if mid2 <= mid
list mid mid2, noobs sepby(mid)
save "connect"
>

* to get a complete list of all possibilities
* form all pairwise combinations of managers
use "bi", clear
keep mid
sort mid
by mid: keep if _n == 1

save "mgrs"

use "mgrs"
rename mid mid2
cross using "mgrs"
*cross is not working, get cross to work, buy a beer*
sort mid mid2
list mid mid2, noobs sepby(mid)
drop if mid2 <= mid
>

* final list
merge 1:1 mid mid2 using "connect"
gen connect = _merge == 3
sort mid mid2
list mid mid2 connect, noobs sepby(mid)

I tried this code with a small sample and received a desired result. However, when I tried with my real data set. I received the following message:
sum of expand values exceed 2,147,483,620
The dataset may not contain more than 2,147,483,620 observations.

The dataset contains 261 872 observations.

I have tried to google and look through the forum for any other alternative ways to solve my problem, but I keep banging my head against the wall with this one. There may be a rather obvious solution to this, but I am not that familiar with stata. Thus I reach out to you experts and hope someone can save me.

If anyone knows how to work around this problem of the number of observations, or have an alternative solution in order to identify interlocks, I would deeply appreciate it.

Thank you,
Kim.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

09 Apr 2018, 11:06

Welcome to Statalist, Kim.

It would be helpful if you were to post the link to the discussion from which you obtained the code you copied, so that we could see whatever discussion surrounded its creation. It is possibly doing more than is needed to address the question you have.

For that matter, is it correct to understand that you wish to determine which firms have interlocking boards separately in each year? Because the code you cite runs into problems because it does not distinguish between years, so even if the 2011 observations for XYZ were missing, you would still show XYZ and ABC interlocking.

More generally, please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

The more you help others understand your problem, the more likely others are to be able to help you solve your problem.
Comment
Kim Andre Ophus

Join Date: Apr 2018

Posts: 2
#3

09 Apr 2018, 12:59

Thank you William.

Thanks for the comments. Here is the link to the code I obtained https://www.stata.com/statalist/arch.../msg00910.html
I do want for each separate year yes, but I could always just create one dataset per year, so can actually even remove that variable.

I did actually look at the FAQ, and tried my best to follow the pointers there.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30063

09 Apr 2018, 14:00

Perhaps I am missing something, but it seems to me that all you want is:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int year str3 fid str2 mid
2010 "XYZ" "AA"
2010 "XYZ" "BB"
2011 "XYZ" "AA"
2011 "XYZ" "BB"
2011 "ABC" "CC"
2011 "ABC" "AA"
2013 "BDE" "DD"
2013 "BDE" "EE"
end

by mid year, sort: keep if _N > 1
by mid year (fid), sort: gen _j = _n

reshape wide fid, i(mid year) j(_j)

which should run pretty quickly, and shouldn't stress memory at all.

Announcement

Identifying interlocks between string variables in a dataset with 261 872 observations and 3 variables

Comment

Comment

Comment