Identify overlapping observations between two variables

Frank Zhao

Join Date: Nov 2022

Posts: 5
#1

Identify overlapping observations between two variables

14 Nov 2022, 14:53

Hi everyone,

I have dataset which has event_date, firm_a, firm_b, person_ax (who are in firm_a) and person_by (who are in firm_b).

I'm trying to identify for each event_date, which individual shares position in both firm_a and firm_b.

For example, at event_date "5-Feb-08", individual "d" and "e" share position in both firm_a and firm_b. And I want to keep the observation with shared individual, like the following:

str9 event_date byte(firm_a firm_b) str1(person_ax person_by)
"5-Feb-08" 1 2 "d" "d"
"5-Feb-08" 1 2 "e" "e"

Much appreciated for your time and help.

I'm new the Stata and Statalist, please feel free to ask any questions if I've not make myself clear.

Cheers,
Frank

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str9 event_date byte(firm_a firm_b) str1(person_ax person_by) "5-Feb-08" 1 2 "a" "d" "5-Feb-08" 1 2 "b" "h" "5-Feb-08" 1 2 "c" "j" "5-Feb-08" 1 2 "d" "e" "5-Feb-08" 1 2 "e" "p" "5-Feb-08" 1 2 "f" "k" "27-Feb-08" 3 4 "a" "b" "27-Feb-08" 3 4 "b" "j" "27-Feb-08" 3 4 "j" "o" "27-Feb-08" 3 4 "d" "g" "27-Feb-08" 3 4 "q" "m" "28-Feb-08" 5 6 "d" "f" "28-Feb-08" 5 6 "c" "j" "28-Feb-08" 5 6 "w" "c" "28-Feb-08" 5 6 "t" "w" "28-Feb-08" 5 6 "y" "n" end
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 36053
#2

14 Nov 2022, 15:17

Code:

list if person_ax == person_by

seems to be what you're asking.
Comment
Frank Zhao

Join Date: Nov 2022

Posts: 5
#3

14 Nov 2022, 15:42

Originally posted by Nick Cox View Post

Code:

list if person_ax == person_by

seems to be what you're asking.

Thanks for the quick response Nick.

The code you provided can only capture the observation where person_ax and person_by are on the same row. But I want to match person_ax and person_by across different rows within same event_date group.

Cheers,
Frank
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#4

14 Nov 2022, 17:24

I have the nagging feeling there's a simpler way to do this. But this works in your example data. The hard part is not identifying the people who appear in both firms. The hard part is correctly creating the new observations that show them paired with themselves.

Code:

// SIMPLIFY NAMES THAT WILL GET IN THE WAY OF CODING IF LEFT AS IS rename person_ax person_a rename person_by person_b // MARK ORIGINAL SORT ORDER WITHIN EVENT DATE by event_date, sort: gen seq = _n // VERIFY KEY ASSUMPTIONS by event_date (firm_a), sort: assert firm_a[1] == firm_a[_N] by event_date (firm_b), sort: assert firm_b[1] == firm_b[_N] by event_date person_a, sort: assert _N == 1 by event_date person_b, sort: assert _N == 1 // GO TO LONG LAYOUT reshape long firm_@ person_@, i(event_date seq) j(which) string // IDENTIFY WHEN A PERSON APPEARS IN TWO DIFFERENT FIRMS FOR THE SAME EVENT DATE by event_date person_, sort: gen byte in_both = _N == 2 // CREATE NEW OBSERVATIONS FOR THE PERSON PAIRED WITH SELF summ seq, meanonly local start_at = r(max) + 1 frame put _all if in_both, into(both) frame both: by event_date (person), sort: replace seq = `start_at' + sum(person != person[_n-1]) // ADD THOSE OBSERVATIONS TO THE OTHER DATA frameappend both drop in_both reshape wide firm_ person_ , i(event_date seq) j(which) string by event_date (seq), sort: replace seq = _n

The code relies on certain assumptions which are true in the example data. Specifically, each person appears only once as person_ax and only once as person_by among the observations for any given event_date, and firm_a and firm_b are both constant within any given event date. These assumptions are verified in the code: it will stop and give error messages, without producing results, if they are not true in the full data set.

The new self-paired observations appear last within each batch of observations for a given event_date. The sequence variable seq may or may not be useful for you later. If it isn't, you can just drop it.

As an aside, the event_date variable as a string is likely to be unhelpful with further analysis. It doesn't matter for this particular task because the order of dates or intervals between them don't matter. But it is likely that you will need those things later, and you can't get them right with string variables. So consider converting it to a proper Stata internal format date variable using the -daily()- function.

Last edited by Clyde Schechter; 14 Nov 2022, 17:27.
Comment
Frank Zhao

Join Date: Nov 2022

Posts: 5
#5

14 Nov 2022, 17:33

Originally posted by Clyde Schechter View Post

I have the nagging feeling there's a simpler way to do this. But this works in your example data. The hard part is not identifying the people who appear in both firms. The hard part is correctly creating the new observations that show them paired with themselves.

Code:

// SIMPLIFY NAMES THAT WILL GET IN THE WAY OF CODING IF LEFT AS IS rename person_ax person_a rename person_by person_b // MARK ORIGINAL SORT ORDER WITHIN EVENT DATE by event_date, sort: gen seq = _n // VERIFY KEY ASSUMPTIONS by event_date (firm_a), sort: assert firm_a[1] == firm_a[_N] by event_date (firm_b), sort: assert firm_b[1] == firm_b[_N] by event_date person_a, sort: assert _N == 1 by event_date person_b, sort: assert _N == 1 // GO TO LONG LAYOUT reshape long firm_@ person_@, i(event_date seq) j(which) string // IDENTIFY WHEN A PERSON APPEARS IN TWO DIFFERENT FIRMS FOR THE SAME EVENT DATE by event_date person_, sort: gen byte in_both = _N == 2 // CREATE NEW OBSERVATIONS FOR THE PERSON PAIRED WITH SELF summ seq, meanonly local start_at = r(max) + 1 frame put _all if in_both, into(both) frame both: by event_date (person), sort: replace seq = `start_at' + sum(person != person[_n-1]) // ADD THOSE OBSERVATIONS TO THE OTHER DATA frameappend both drop in_both reshape wide firm_ person_ , i(event_date seq) j(which) string by event_date (seq), sort: replace seq = _n

The code relies on certain assumptions which are true in the example data. Specifically, each person appears only once as person_ax and only once as person_by among the observations for any given event_date, and firm_a and firm_b are both constant within any given event date. These assumptions are verified in the code: it will stop and give error messages, without producing results, if they are not true in the full data set.

The new self-paired observations appear last within each batch of observations for a given event_date. The sequence variable seq may or may not be useful for you later. If it isn't, you can just drop it.

As an aside, the event_date variable as a string is likely to be unhelpful with further analysis. It doesn't matter for this particular task because the order of dates or intervals between them don't matter. But it is likely that you will need those things later, and you can't get them right with string variables. So consider converting it to a proper Stata internal format date variable using the -daily()- function.

Much appreciated for your help and explanation! I'll try it out. Cheers~
Comment
Frank Zhao

Join Date: Nov 2022

Posts: 5
#6

14 Nov 2022, 18:46

Originally posted by Clyde Schechter View Post

I have the nagging feeling there's a simpler way to do this. But this works in your example data. The hard part is not identifying the people who appear in both firms. The hard part is correctly creating the new observations that show them paired with themselves.

Code:

// SIMPLIFY NAMES THAT WILL GET IN THE WAY OF CODING IF LEFT AS IS rename person_ax person_a rename person_by person_b // MARK ORIGINAL SORT ORDER WITHIN EVENT DATE by event_date, sort: gen seq = _n // VERIFY KEY ASSUMPTIONS by event_date (firm_a), sort: assert firm_a[1] == firm_a[_N] by event_date (firm_b), sort: assert firm_b[1] == firm_b[_N] by event_date person_a, sort: assert _N == 1 by event_date person_b, sort: assert _N == 1 // GO TO LONG LAYOUT reshape long firm_@ person_@, i(event_date seq) j(which) string // IDENTIFY WHEN A PERSON APPEARS IN TWO DIFFERENT FIRMS FOR THE SAME EVENT DATE by event_date person_, sort: gen byte in_both = _N == 2 // CREATE NEW OBSERVATIONS FOR THE PERSON PAIRED WITH SELF summ seq, meanonly local start_at = r(max) + 1 frame put _all if in_both, into(both) frame both: by event_date (person), sort: replace seq = `start_at' + sum(person != person[_n-1]) // ADD THOSE OBSERVATIONS TO THE OTHER DATA frameappend both drop in_both reshape wide firm_ person_ , i(event_date seq) j(which) string by event_date (seq), sort: replace seq = _n

The code relies on certain assumptions which are true in the example data. Specifically, each person appears only once as person_ax and only once as person_by among the observations for any given event_date, and firm_a and firm_b are both constant within any given event date. These assumptions are verified in the code: it will stop and give error messages, without producing results, if they are not true in the full data set.

The new self-paired observations appear last within each batch of observations for a given event_date. The sequence variable seq may or may not be useful for you later. If it isn't, you can just drop it.

As an aside, the event_date variable as a string is likely to be unhelpful with further analysis. It doesn't matter for this particular task because the order of dates or intervals between them don't matter. But it is likely that you will need those things later, and you can't get them right with string variables. So consider converting it to a proper Stata internal format date variable using the -daily()- function.

Hi Clyde,

Sorry to bother you again. But I've played with the code, and my full dataset indeed has duplicate person_, and it is more like the following:

Code:

input str9 event_date byte(firm_a firm_b) str1(person_ax person_by) "5-Feb-08" 1 2 "a" "d" "5-Feb-08" 1 2 "b" "h" "5-Feb-08" 1 2 "c" "j" "5-Feb-08" 1 2 "d" "e" "5-Feb-08" 1 2 "e" "p" "5-Feb-08" 1 2 "e" "k" "27-Feb-08" 3 4 "a" "b" "27-Feb-08" 3 4 "b" "j" "27-Feb-08" 3 4 "j" "o" "27-Feb-08" 3 4 "d" "g" "27-Feb-08" 3 4 "q" "m" "28-Feb-08" 5 6 "d" "f" "28-Feb-08" 5 6 "c" "j" "28-Feb-08" 5 6 "w" "c" "28-Feb-08" 5 6 "t" "w" "28-Feb-08" 5 6 "y" "w" end

May I ask how can I deal with the duplicated person please? Thanks again for the help. Cheers, Frank

Last edited by Frank Zhao; 14 Nov 2022, 18:48.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30355

14 Nov 2022, 19:48

OK, it's not terribly different.

Code:

 clear*
 input  str9 event_date byte(firm_a firm_b) str1(person_ax person_by)
"5-Feb-08"  1 2 "a" "d"
"5-Feb-08"  1 2 "b" "h"
"5-Feb-08"  1 2 "c" "j"
"5-Feb-08"  1 2 "d" "e"
"5-Feb-08"  1 2 "e" "p"
"5-Feb-08"  1 2 "e" "k"
"27-Feb-08" 3 4 "a" "b"
"27-Feb-08" 3 4 "b" "j"
"27-Feb-08" 3 4 "j" "o"
"27-Feb-08" 3 4 "d" "g"
"27-Feb-08" 3 4 "q" "m"
"28-Feb-08" 5 6 "d" "f"
"28-Feb-08" 5 6 "c" "j"
"28-Feb-08" 5 6 "w" "c"
"28-Feb-08" 5 6 "t" "w"
"28-Feb-08" 5 6 "y" "w"
end

rename person_ax person_a
rename person_by person_b


by event_date (firm_a), sort: assert firm_a[1] == firm_a[_N]
by event_date (firm_b), sort: assert firm_b[1] == firm_b[_N]

sort event_date firm_a firm_b, stable
by event_date firm_a firm_b, sort: gen long seq = _n
summ seq, meanonly
local start_from = r(max) + 1

frame put _all, into(working)
frame working {
    reshape long firm_ person_, i(event_date seq) j(which) string
    //  REMOVE DUPLICATE APPEARANCE OF THE SAME PERSON IN THE SAME FIRM, SAME EVENT_DATE
    by event_date firm_ person_, sort: keep if _n == 1

    //  NOW REDUCE TO PEOPLE WHO APPEAR IN BOTH FIRMS WITHIN AN EVENT DATE
    by event_date person_, sort: keep if _N > 1

    //  CREATE NEW OBSERVATIONS THAT PAIR THESE PEOPLE WITH THEMSELVES
    by event_date (person_ which), sort: replace seq = `start_from' + sum(person_ != person[_n-1] )
    reshape wide firm_ person_, i(event_date seq) j(which) string
}
frameappend working

by event_date (seq), sort: replace seq = _n

I forgot to mention in the earlier post that -frameappend- is written by Jeremy Freese, and is available from SSC.

This code eliminates the assumption that the same person only occurs once in a firm for a given event_date, but the other assumptions remain in force, and are verified. As before, the newly created self-paired observations appear at the end of the batch of observations for an event_id

Last edited by Clyde Schechter; 14 Nov 2022, 19:51.

Announcement

Identify overlapping observations between two variables

Comment

Comment

Comment

Comment

Comment

Comment