Finding Pairs within a column

Edgar Goldemer

Join Date: Oct 2019
Posts: 24

Finding Pairs within a column

30 Oct 2019, 07:15

Dear Statalists,

my dataset is about directors and which companies they work for and looks like this:

Company_ID	Director_ID	Term Begin	Term End	new_app
40416	1	1977	1999	0
15456	1	1983	2013	0
76085	1	1994	2005	0
44601	2	1975	2005	0
15720	2	1988	2002	0
86594	2	1999	2005	1
44601	3	1990	2006	0
78172	3	2000	2005	1
87657	3	2000	2009	1
44601	4	1988	2013	0
44601	5	1985	1999	0
40707	5	1991	2003	0
57568	6	1995	2007	0
80316	6	2002	2009	1
52840	7	1987	2009	0
24221	7	1999	2005	1
24221	8	1999	2009	1
52840	9	1970	2005	0
62148	9	1998	2002	1
59459	9	1998	2003	1
24221	9	1999	2004	1
90601	9	2000	2012	1
10137	9	2003	2011	1

The variable "new_app" is 1 whenever the term of a director began between 1996 and 2014, otherwise it is 0.
Now I would like to create "company pairs". That means whenever a director was appointed by a company in the above mentioned time period (new_app = 1), I want to know all other companies he worked for in that year. For example, if we take Director 2, when he was appointed by company 86594 in 1999, he also worked for companies 44601 and 15720, as his terms at those companies ended in 2005/2002. So I need to identify the combinations or pairs of companies. Ideally, the output (only director 2) would look like this:

Company_1	Company_2	Director_ID	Year
86594	44601	2	1999
86594	15720	2	1999

Can anyone help me out with this? I´m willing to thank with a small reward!

Tags: None

Dave Airey

Join Date: Apr 2014

Posts: 396
#2

30 Oct 2019, 07:39

You should give your Stata version, and provide a dataex output, and also state whether you want all pairs, e.g., would company 44601 in your example also be a company 1?
Comment
Edgar Goldemer

Join Date: Oct 2019

Posts: 24
#3

30 Oct 2019, 08:00

Hey Dave,

my Stata version is MP 15.0. I want all pairs, for which one company has appointed the director while he was already working for the other one. It has no relevance for me which company is under company_1/ company_2, but for consistency, it might be useful to list all companies with new_app = 1 in one column and the corresponding pair in the 2nd column. But it also might be the case, that both companies will have new_app = 1.

Company 44601 might be a company_1 for another director, but for director 2, I just care about the two pairs above. For example, companies 44601 and 15720 shouldn´t be displayed as a pair.

What do you mean by dataex output?
Thanks!
Comment
Dave Airey

Join Date: Apr 2014

Posts: 396
#4

30 Oct 2019, 08:10

See FAQ 12.2 or type help dataex. Basically when you give an example dataset it ensures someone can easily get it from Statalist to their Stata to help you. No one wants to spend unnecessary data management time helping Statalist posters.
Comment
Edgar Goldemer

Join Date: Oct 2019

Posts: 24
#5

30 Oct 2019, 08:40

Sorry Dave,

I guess thats what you asked for:

clear
input long permno double didown int dirsince long year_term_ends float(year new_app)
40416 1 1977 1999 1998 0
15456 1 1983 2013 2012 0
76085 1 1994 2005 2004 0
44601 2 1975 2005 2004 0
15720 2 1988 2002 2001 0
86594 2 1999 2005 2002 1
44601 3 1990 2006 2005 0
78172 3 2000 2005 2003 1
87657 3 2000 2009 2007 1
44601 4 1988 2013 2012 0
44601 5 1985 1999 2009 0
40707 5 1991 2003 2002 0
57568 6 1995 2007 2006 0
80316 6 2002 2009 2008 1
52840 7 1987 2009 2006 0
24221 7 1999 2005 2002 1
24221 8 1999 2009 2008 1
52840 9 1970 2005 2002 0
62148 9 1998 2002 2001 1
59459 9 1998 2003 2002 1
24221 9 1999 2004 2001 1
90601 9 2000 2012 2011 1
10137 9 2003 2011 2010 1
65138 12 1983 2000 1999 0
21371 12 1987 2009 2008 0
75789 12 1990 2005 2004 0
65859 12 1991 1999 1998 0
66093 12 1999 2015 2014 1
85255 12 2003 2009 2007 1
21371 13 1971 2009 2007 0
65138 13 1987 2002 2001 0
15368 13 1994 2000 1999 0
59176 13 2002 2015 2014 1
57817 13 2008 2014 2013 1
85348 13 2008 2015 2014 1
51772 14 1987 1999 1998 0
25144 14 1990 1999 1998 0
47159 14 1999 2006 2003 1
87146 14 1999 2011 2010 1
76226 14 2006 2011 2010 1
44206 14 2012 2014 2013 1
end
1 like
Comment
Dave Airey

Join Date: Apr 2014

Posts: 396
#6

30 Oct 2019, 08:57

So I wonder if this is a task for SSC rangejoin?
Comment
Edgar Goldemer

Join Date: Oct 2019

Posts: 24
#7

30 Oct 2019, 10:52

Dave, big thanks!

I managed to generate the pairs and to drop duplicates (a pair of the same company). But for the companies with new_app=1, there are still duplicates. To illustrate that, I provide you with the dataex:

clear
input long permno double didown int dirsince long year_term_ends float new_app long permno_1 int app_year
44601 3 1990 2006 0 78172 2000
44601 3 1990 2006 0 87657 2000
78172 3 2000 2005 1 87657 2000
87657 3 2000 2009 1 78172 2000
end

As you can see for director 3, the companies 78172 and 87657 are displayed as a pair twice, each as company1 (permno) and then as company2 (permno_1). But in fact it´s only one pair.
Any idea to delete the duplicates?
Comment

Dave Airey

Join Date: Apr 2014
Posts: 396

30 Oct 2019, 11:49

I think it has to do with the fact that you are merging to more than you want to. For example here I'm just taking the first and then the second new_app == 1 obs within didown. You don't get the duplication, but this is not a programmatic solution.We need the authors of rangejoin to weigh in on this problem!

Code:

clear
input permno didown dirsince year_term_ends year new_app
40416 1 1977 1999 1998 0
15456 1 1983 2013 2012 0
76085 1 1994 2005 2004 0
44601 2 1975 2005 2004 0
15720 2 1988 2002 2001 0
86594 2 1999 2005 2002 1
44601 3 1990 2006 2005 0
78172 3 2000 2005 2003 1
87657 3 2000 2009 2007 1
44601 4 1988 2013 2012 0
44601 5 1985 1999 2009 0
40707 5 1991 2003 2002 0
57568 6 1995 2007 2006 0
80316 6 2002 2009 2008 1
52840 7 1987 2009 2006 0
24221 7 1999 2005 2002 1
24221 8 1999 2009 2008 1
52840 9 1970 2005 2002 0
62148 9 1998 2002 2001 1
59459 9 1998 2003 2002 1
24221 9 1999 2004 2001 1
90601 9 2000 2012 2011 1
10137 9 2003 2011 2010 1
65138 12 1983 2000 1999 0
21371 12 1987 2009 2008 0
75789 12 1990 2005 2004 0
65859 12 1991 1999 1998 0
66093 12 1999 2015 2014 1
85255 12 2003 2009 2007 1
21371 13 1971 2009 2007 0
65138 13 1987 2002 2001 0
15368 13 1994 2000 1999 0
59176 13 2002 2015 2014 1
57817 13 2008 2014 2013 1
85348 13 2008 2015 2014 1
51772 14 1987 1999 1998 0
25144 14 1990 1999 1998 0
47159 14 1999 2006 2003 1
87146 14 1999 2011 2010 1
76226 14 2006 2011 2010 1
44206 14 2012 2014 2013 1
end

gsort didown -new_app dirsince
by didown: generate obs = _n
list, sepby(didown)
save full.dta, replace

keep if new_app == 1 & obs == 1 // the first new_app within didown
keep permno didown dirsince
save part.dta, replace

use full.dta, clear
drop if new_app == 1 & obs == 1
keep permno didown dirsince year_term_ends

rangejoin dirsince dirsince year_term_ends using part.dta, by(didown)
drop if permno_U == .
list, sepby(didown)

use full.dta, replace
keep if new_app == 1 & obs == 2 // the second new_app within didown
keep permno didown dirsince
save part.dta, replace

use full.dta, clear
drop if new_app == 1 & inlist(obs, 1, 2)
keep permno didown dirsince year_term_ends

rangejoin dirsince dirsince year_term_ends using part.dta, by(didown)
drop if permno_U == .
list, sepby(didown)

Comment

Edgar Goldemer

Join Date: Oct 2019

Posts: 24
#9

30 Oct 2019, 12:14

Again, thank you!

You´re right. I think there are directors working for more than 10 companies. For those cases, it will become a long code
Will keep looking for a solution.
Comment

Dave Airey

Join Date: Apr 2014
Posts: 396

#10

30 Oct 2019, 12:40

How about putting things in a loop?

Code:

clear
input permno didown dirsince year_term_ends year new_app
40416 1 1977 1999 1998 0
15456 1 1983 2013 2012 0
76085 1 1994 2005 2004 0
44601 2 1975 2005 2004 0
15720 2 1988 2002 2001 0
86594 2 1999 2005 2002 1
44601 3 1990 2006 2005 0
78172 3 2000 2005 2003 1
87657 3 2000 2009 2007 1
44601 4 1988 2013 2012 0
44601 5 1985 1999 2009 0
40707 5 1991 2003 2002 0
57568 6 1995 2007 2006 0
80316 6 2002 2009 2008 1
52840 7 1987 2009 2006 0
24221 7 1999 2005 2002 1
24221 8 1999 2009 2008 1
52840 9 1970 2005 2002 0
62148 9 1998 2002 2001 1
59459 9 1998 2003 2002 1
24221 9 1999 2004 2001 1
90601 9 2000 2012 2011 1
10137 9 2003 2011 2010 1
65138 12 1983 2000 1999 0
21371 12 1987 2009 2008 0
75789 12 1990 2005 2004 0
65859 12 1991 1999 1998 0
66093 12 1999 2015 2014 1
85255 12 2003 2009 2007 1
21371 13 1971 2009 2007 0
65138 13 1987 2002 2001 0
15368 13 1994 2000 1999 0
59176 13 2002 2015 2014 1
57817 13 2008 2014 2013 1
85348 13 2008 2015 2014 1
51772 14 1987 1999 1998 0
25144 14 1990 1999 1998 0
47159 14 1999 2006 2003 1
87146 14 1999 2011 2010 1
76226 14 2006 2011 2010 1
44206 14 2012 2014 2013 1
end

gsort didown -new_app dirsince
by didown: generate obs = _n
list, sepby(didown)
save full.dta, replace

summarize obs if new_app == 1

foreach i of numlist 1/5 {
    use full.dta, clear
    keep if new_app == 1 & obs == `i' // the first new_app within didown
    keep permno didown dirsince
    save part.dta, replace

    use full.dta, clear
    drop if new_app == 1 & obs <= `i'
    keep permno didown dirsince year_term_ends

    rangejoin dirsince dirsince year_term_ends using part.dta, by(didown)
    drop if permno_U == .
    list, sepby(didown)
    save part`i'.dta, replace
}

use part1.dta, clear
foreach i of numlist 2/5 {
    append using part`i'.dta
}
gsort didown dirsince
list, sepby(didown)

Comment

Edgar Goldemer

Join Date: Oct 2019

Posts: 24
#11

30 Oct 2019, 13:08

Dave, what about this way?

I multiplied permno and permno_1 (company1 and company2) to get an identifier which should be unique but identical for the duplicate pairs.
By using histogram I check that each identifier appears max. 2 times per director and then I delete one of the two lines with the respect identifier.

Code:

generate duplics = permno*permno_1 sort didown duplics by didown duplics: generate count_duplics = _n by didown duplics: generate total_duplics = _N histogram total_duplics drop if count_duplics == 2 drop count_duplics total_duplics

Thank you again!
Comment
Dave Airey

Join Date: Apr 2014

Posts: 396
#12

30 Oct 2019, 13:23

Hey I think if it works that's great! Do we get the same results for the example file?
Comment
Edgar Goldemer

Join Date: Oct 2019

Posts: 24
#13

30 Oct 2019, 14:17

For director 9 I have definitely more results and that must be right.
But right now I am too tired to search for the errors

Thanks again!
Comment
Dave Airey

Join Date: Apr 2014

Posts: 396
#14

30 Oct 2019, 18:32

Sure thing. I did not have the time to look at the files my code was generating vs what you wanted, so I'm not surprised if things are incorrect. I was using the opportunity to download and try rangejoin again. I've used rangestat once or twice but not rangejoin.
Comment

Announcement

Finding Pairs within a column

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment