Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identify overlapping observations between two variables

    Hi everyone,

    I have dataset which has event_date, firm_a, firm_b, person_ax (who are in firm_a) and person_by (who are in firm_b).

    I'm trying to identify for each event_date, which individual shares position in both firm_a and firm_b.

    For example, at event_date "5-Feb-08", individual "d" and "e" share position in both firm_a and firm_b. And I want to keep the observation with shared individual, like the following:

    str9 event_date byte(firm_a firm_b) str1(person_ax person_by)
    "5-Feb-08" 1 2 "d" "d"
    "5-Feb-08" 1 2 "e" "e"

    Much appreciated for your time and help.

    I'm new the Stata and Statalist, please feel free to ask any questions if I've not make myself clear.

    Cheers,
    Frank

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
     
    input  str9 event_date byte(firm_a firm_b) str1(person_ax person_by)
    "5-Feb-08"  1 2 "a" "d"
    "5-Feb-08"  1 2 "b" "h"
    "5-Feb-08"  1 2 "c" "j"
    "5-Feb-08"  1 2 "d" "e"
    "5-Feb-08"  1 2 "e" "p"
    "5-Feb-08"  1 2 "f" "k"
    "27-Feb-08" 3 4 "a" "b"
    "27-Feb-08" 3 4 "b" "j"
    "27-Feb-08" 3 4 "j" "o"
    "27-Feb-08" 3 4 "d" "g"
    "27-Feb-08" 3 4 "q" "m"
    "28-Feb-08" 5 6 "d" "f"
    "28-Feb-08" 5 6 "c" "j"
    "28-Feb-08" 5 6 "w" "c"
    "28-Feb-08" 5 6 "t" "w"
    "28-Feb-08" 5 6 "y" "n"
    end

  • #2
    Code:
    list if person_ax == person_by
    seems to be what you're asking.

    Comment


    • #3
      Originally posted by Nick Cox View Post
      Code:
      list if person_ax == person_by
      seems to be what you're asking.
      Thanks for the quick response Nick.

      The code you provided can only capture the observation where person_ax and person_by are on the same row. But I want to match person_ax and person_by across different rows within same event_date group.

      Cheers,
      Frank

      Comment


      • #4
        I have the nagging feeling there's a simpler way to do this. But this works in your example data. The hard part is not identifying the people who appear in both firms. The hard part is correctly creating the new observations that show them paired with themselves.

        Code:
        //  SIMPLIFY NAMES THAT WILL GET IN THE WAY OF CODING IF LEFT AS IS
        rename person_ax person_a
        rename person_by person_b
        
        //  MARK ORIGINAL SORT ORDER WITHIN EVENT DATE
        by event_date, sort: gen seq = _n
        
        //  VERIFY KEY ASSUMPTIONS
        by event_date (firm_a), sort: assert firm_a[1] == firm_a[_N]
        by event_date (firm_b), sort: assert firm_b[1] == firm_b[_N]
        by event_date person_a, sort: assert _N == 1
        by event_date person_b, sort: assert _N == 1
        
        //  GO TO LONG LAYOUT
        reshape long firm_@ person_@, i(event_date seq) j(which) string
        
        //  IDENTIFY WHEN A PERSON APPEARS IN TWO DIFFERENT FIRMS FOR THE SAME EVENT DATE
        by event_date person_, sort: gen byte in_both = _N == 2
        
        //  CREATE NEW OBSERVATIONS FOR THE PERSON PAIRED WITH SELF
        summ seq, meanonly
        local start_at = r(max) + 1
        frame put _all if in_both, into(both)
        frame both: by event_date (person), sort: replace seq = `start_at' + sum(person != person[_n-1])
        
        //  ADD THOSE OBSERVATIONS TO THE OTHER DATA
        frameappend both
        drop in_both
        
        reshape wide firm_ person_ , i(event_date seq) j(which) string
        by event_date (seq), sort: replace seq = _n
        The code relies on certain assumptions which are true in the example data. Specifically, each person appears only once as person_ax and only once as person_by among the observations for any given event_date, and firm_a and firm_b are both constant within any given event date. These assumptions are verified in the code: it will stop and give error messages, without producing results, if they are not true in the full data set.

        The new self-paired observations appear last within each batch of observations for a given event_date. The sequence variable seq may or may not be useful for you later. If it isn't, you can just drop it.

        As an aside, the event_date variable as a string is likely to be unhelpful with further analysis. It doesn't matter for this particular task because the order of dates or intervals between them don't matter. But it is likely that you will need those things later, and you can't get them right with string variables. So consider converting it to a proper Stata internal format date variable using the -daily()- function.
        Last edited by Clyde Schechter; 14 Nov 2022, 17:27.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          I have the nagging feeling there's a simpler way to do this. But this works in your example data. The hard part is not identifying the people who appear in both firms. The hard part is correctly creating the new observations that show them paired with themselves.

          Code:
          // SIMPLIFY NAMES THAT WILL GET IN THE WAY OF CODING IF LEFT AS IS
          rename person_ax person_a
          rename person_by person_b
          
          // MARK ORIGINAL SORT ORDER WITHIN EVENT DATE
          by event_date, sort: gen seq = _n
          
          // VERIFY KEY ASSUMPTIONS
          by event_date (firm_a), sort: assert firm_a[1] == firm_a[_N]
          by event_date (firm_b), sort: assert firm_b[1] == firm_b[_N]
          by event_date person_a, sort: assert _N == 1
          by event_date person_b, sort: assert _N == 1
          
          // GO TO LONG LAYOUT
          reshape long firm_@ person_@, i(event_date seq) j(which) string
          
          // IDENTIFY WHEN A PERSON APPEARS IN TWO DIFFERENT FIRMS FOR THE SAME EVENT DATE
          by event_date person_, sort: gen byte in_both = _N == 2
          
          // CREATE NEW OBSERVATIONS FOR THE PERSON PAIRED WITH SELF
          summ seq, meanonly
          local start_at = r(max) + 1
          frame put _all if in_both, into(both)
          frame both: by event_date (person), sort: replace seq = `start_at' + sum(person != person[_n-1])
          
          // ADD THOSE OBSERVATIONS TO THE OTHER DATA
          frameappend both
          drop in_both
          
          reshape wide firm_ person_ , i(event_date seq) j(which) string
          by event_date (seq), sort: replace seq = _n
          The code relies on certain assumptions which are true in the example data. Specifically, each person appears only once as person_ax and only once as person_by among the observations for any given event_date, and firm_a and firm_b are both constant within any given event date. These assumptions are verified in the code: it will stop and give error messages, without producing results, if they are not true in the full data set.

          The new self-paired observations appear last within each batch of observations for a given event_date. The sequence variable seq may or may not be useful for you later. If it isn't, you can just drop it.

          As an aside, the event_date variable as a string is likely to be unhelpful with further analysis. It doesn't matter for this particular task because the order of dates or intervals between them don't matter. But it is likely that you will need those things later, and you can't get them right with string variables. So consider converting it to a proper Stata internal format date variable using the -daily()- function.
          Much appreciated for your help and explanation! I'll try it out. Cheers~

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            I have the nagging feeling there's a simpler way to do this. But this works in your example data. The hard part is not identifying the people who appear in both firms. The hard part is correctly creating the new observations that show them paired with themselves.

            Code:
            // SIMPLIFY NAMES THAT WILL GET IN THE WAY OF CODING IF LEFT AS IS
            rename person_ax person_a
            rename person_by person_b
            
            // MARK ORIGINAL SORT ORDER WITHIN EVENT DATE
            by event_date, sort: gen seq = _n
            
            // VERIFY KEY ASSUMPTIONS
            by event_date (firm_a), sort: assert firm_a[1] == firm_a[_N]
            by event_date (firm_b), sort: assert firm_b[1] == firm_b[_N]
            by event_date person_a, sort: assert _N == 1
            by event_date person_b, sort: assert _N == 1
            
            // GO TO LONG LAYOUT
            reshape long firm_@ person_@, i(event_date seq) j(which) string
            
            // IDENTIFY WHEN A PERSON APPEARS IN TWO DIFFERENT FIRMS FOR THE SAME EVENT DATE
            by event_date person_, sort: gen byte in_both = _N == 2
            
            // CREATE NEW OBSERVATIONS FOR THE PERSON PAIRED WITH SELF
            summ seq, meanonly
            local start_at = r(max) + 1
            frame put _all if in_both, into(both)
            frame both: by event_date (person), sort: replace seq = `start_at' + sum(person != person[_n-1])
            
            // ADD THOSE OBSERVATIONS TO THE OTHER DATA
            frameappend both
            drop in_both
            
            reshape wide firm_ person_ , i(event_date seq) j(which) string
            by event_date (seq), sort: replace seq = _n
            The code relies on certain assumptions which are true in the example data. Specifically, each person appears only once as person_ax and only once as person_by among the observations for any given event_date, and firm_a and firm_b are both constant within any given event date. These assumptions are verified in the code: it will stop and give error messages, without producing results, if they are not true in the full data set.

            The new self-paired observations appear last within each batch of observations for a given event_date. The sequence variable seq may or may not be useful for you later. If it isn't, you can just drop it.

            As an aside, the event_date variable as a string is likely to be unhelpful with further analysis. It doesn't matter for this particular task because the order of dates or intervals between them don't matter. But it is likely that you will need those things later, and you can't get them right with string variables. So consider converting it to a proper Stata internal format date variable using the -daily()- function.
            Hi Clyde,

            Sorry to bother you again. But I've played with the code, and my full dataset indeed has duplicate person_, and it is more like the following:

            Code:
             input  str9 event_date byte(firm_a firm_b) str1(person_ax person_by)
            "5-Feb-08"  1 2 "a" "d"
            "5-Feb-08"  1 2 "b" "h"
            "5-Feb-08"  1 2 "c" "j"
            "5-Feb-08"  1 2 "d" "e"
            "5-Feb-08"  1 2 "e" "p"
            "5-Feb-08"  1 2 "e" "k"
            "27-Feb-08" 3 4 "a" "b"
            "27-Feb-08" 3 4 "b" "j"
            "27-Feb-08" 3 4 "j" "o"
            "27-Feb-08" 3 4 "d" "g"
            "27-Feb-08" 3 4 "q" "m"
            "28-Feb-08" 5 6 "d" "f"
            "28-Feb-08" 5 6 "c" "j"
            "28-Feb-08" 5 6 "w" "c"
            "28-Feb-08" 5 6 "t" "w"
            "28-Feb-08" 5 6 "y" "w"
            end
            May I ask how can I deal with the duplicated person please? Thanks again for the help. Cheers, Frank
            Last edited by Frank Zhao; 14 Nov 2022, 18:48.

            Comment


            • #7
              OK, it's not terribly different.
              Code:
               clear*
               input  str9 event_date byte(firm_a firm_b) str1(person_ax person_by)
              "5-Feb-08"  1 2 "a" "d"
              "5-Feb-08"  1 2 "b" "h"
              "5-Feb-08"  1 2 "c" "j"
              "5-Feb-08"  1 2 "d" "e"
              "5-Feb-08"  1 2 "e" "p"
              "5-Feb-08"  1 2 "e" "k"
              "27-Feb-08" 3 4 "a" "b"
              "27-Feb-08" 3 4 "b" "j"
              "27-Feb-08" 3 4 "j" "o"
              "27-Feb-08" 3 4 "d" "g"
              "27-Feb-08" 3 4 "q" "m"
              "28-Feb-08" 5 6 "d" "f"
              "28-Feb-08" 5 6 "c" "j"
              "28-Feb-08" 5 6 "w" "c"
              "28-Feb-08" 5 6 "t" "w"
              "28-Feb-08" 5 6 "y" "w"
              end
              
              rename person_ax person_a
              rename person_by person_b
              
              
              by event_date (firm_a), sort: assert firm_a[1] == firm_a[_N]
              by event_date (firm_b), sort: assert firm_b[1] == firm_b[_N]
              
              sort event_date firm_a firm_b, stable
              by event_date firm_a firm_b, sort: gen long seq = _n
              summ seq, meanonly
              local start_from = r(max) + 1
              
              frame put _all, into(working)
              frame working {
                  reshape long firm_ person_, i(event_date seq) j(which) string
                  //  REMOVE DUPLICATE APPEARANCE OF THE SAME PERSON IN THE SAME FIRM, SAME EVENT_DATE
                  by event_date firm_ person_, sort: keep if _n == 1
              
                  //  NOW REDUCE TO PEOPLE WHO APPEAR IN BOTH FIRMS WITHIN AN EVENT DATE
                  by event_date person_, sort: keep if _N > 1
              
                  //  CREATE NEW OBSERVATIONS THAT PAIR THESE PEOPLE WITH THEMSELVES
                  by event_date (person_ which), sort: replace seq = `start_from' + sum(person_ != person[_n-1] )
                  reshape wide firm_ person_, i(event_date seq) j(which) string
              }
              frameappend working
              
              by event_date (seq), sort: replace seq = _n
              I forgot to mention in the earlier post that -frameappend- is written by Jeremy Freese, and is available from SSC.

              This code eliminates the assumption that the same person only occurs once in a firm for a given event_date, but the other assumptions remain in force, and are verified. As before, the newly created self-paired observations appear at the end of the batch of observations for an event_id
              Last edited by Clyde Schechter; 14 Nov 2022, 19:51.

              Comment

              Working...
              X