Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to mix the obs of two variables?

    Hello everyone,

    This is my database.
    Each org_uuid is matched with a specific investor_uuid. This are real matches.
    Then each org_uuid has its own employee and each investor_uuid has its own partners.

    Click image for larger version

Name:	Schermata 2021-10-19 alle 09.43.59.png
Views:	1
Size:	164.1 KB
ID:	1632377


    I want to build a counterfactual sample of all matches that did not happen. To do this I'm trying to mix the org_uuid and the investor_uuid, i.e, for the first org_uuid i want to match all the investor_uuid; then for the second org_uuid i want to match all the investor_uuid, and so on.
    After that i want to rematch to each org_uuid, its employees and to each investors_uuid all its partners.

    Each counterfactual march should be marked with 0, while each real match with 1, so that they can be recognised.

    If anyone can help me, it would be really appreciated!

    Thank you in advance!

    Charlotte

  • #2
    The structure of your dataset is not very clear to me, and consequently I don't understand your objective very well, but it sounds like
    Code:
    help joinby
    could be useful for you to create all of the possible counterfactual matches, with you then merging the "true" matches back into the dataset created by joinyb to mark the "real" matches.

    Be aware that depending on your situation joinby could create an extremely large dataset which you will struggle to work with. You may want to consider:
    1) trying on a smaller subset of the data first
    2) for the main dataset, making it smaller by dropping extraneous variables and using encode to convert these string variables into numeric variables, which typically take up much less space.

    Comment


    • #3
      Charlotte, see if this helps you move forward. Also, in the future it would be helpful if you could share your data via the dataex command. Since many posters are not familiar with dataex (it was created specifically to be able to share data here at statalist), I created a short Youtube video explaining it here.

      I think the command you want to use is fillin (which adds all possible combinations). I created some toy data of 5 startups with some different investors to help make it concrete:

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input byte startup_id str8 startup_name int investor_id str15 investor_name
      1 "Google"   101 "Sequoia "      
      2 "Facebook" 101 "Sequoia "      
      3 "Uber"     102 "Accel Partners"
      4 "Airbrb"   103 "Kleiner Perkins"
      5 "Dropbox"  102 "Accel Partners"
      end
      
      
      list, noobs abbrev(12)
        +-----------------------------------------------------------+
        | startup_id   startup_name   investor_id     investor_name |
        |-----------------------------------------------------------|
        |          1         Google           101          Sequoia  |
        |          2       Facebook           101          Sequoia  |
        |          3           Uber           102    Accel Partners |
        |          4         Airbrb           103   Kleiner Perkins |
        |          5        Dropbox           102    Accel Partners |
        +-----------------------------------------------------------+
      
      fillin startup_id investor_id
      list, noobs abbrev(12) sepby( startup_id)
      /* Note that fillin creates a variable _fillin that=0 for the originals, _fillin==1 for new observations that it creates  */
      
        +---------------------------------------------------------------------+
        | startup_id   startup_name   investor_id     investor_name   _fillin |
        |---------------------------------------------------------------------|
        |          1         Google           101          Sequoia          0 |
        |          1                          102                           1 |
        |          1                          103                           1 |
        |---------------------------------------------------------------------|
        |          2       Facebook           101          Sequoia          0 |
        |          2                          102                           1 |
        |          2                          103                           1 |
        |---------------------------------------------------------------------|
        |          3                          101                           1 |
        |          3           Uber           102    Accel Partners         0 |
        |          3                          103                           1 |
        |---------------------------------------------------------------------|
        |          4                          101                           1 |
        |          4                          102                           1 |
        |          4         Airbrb           103   Kleiner Perkins         0 |
        |---------------------------------------------------------------------|
        |          5                          101                           1 |
        |          5        Dropbox           102    Accel Partners         0 |
        |          5                          103                           1 |
        +---------------------------------------------------------------------+
      Code:
      gen byte real_match = (_fillin==0)   // real_match==1 if the investment really took place, 0 otherwise
      gsort startup_id -startup_name  // sorting such that the observation with the startup name is in the first position
      by startup_id: replace startup_name=startup_name[1] if startup_name==""    //  carrying forward the startup_name to fill in the blanks
      
      gsort investor_id  -investor_name
      by investor_id:  replace investor_name=investor_name[1] if investor_name==""  // carrying forward the investor_name to fill in blanks
      
      sort startup_id _fillin  // sorting by startup, this puts the real matches that took place in the first position
      list, noobs abbrev(12) sepby( startup_id)
      
        +----------------------------------------------------------------------------------+
        | startup_id   startup_name   investor_id     investor_name   _fillin   real_match |
        |----------------------------------------------------------------------------------|
        |          1         Google           101          Sequoia          0            1 |
        |          1         Google           102    Accel Partners         1            0 |
        |          1         Google           103   Kleiner Perkins         1            0 |
        |----------------------------------------------------------------------------------|
        |          2       Facebook           101          Sequoia          0            1 |
        |          2       Facebook           103   Kleiner Perkins         1            0 |
        |          2       Facebook           102    Accel Partners         1            0 |
        |----------------------------------------------------------------------------------|
        |          3           Uber           102    Accel Partners         0            1 |
        |          3           Uber           101          Sequoia          1            0 |
        |          3           Uber           103   Kleiner Perkins         1            0 |
        |----------------------------------------------------------------------------------|
        |          4         Airbrb           103   Kleiner Perkins         0            1 |
        |          4         Airbrb           102    Accel Partners         1            0 |
        |          4         Airbrb           101          Sequoia          1            0 |
        |----------------------------------------------------------------------------------|
        |          5        Dropbox           102    Accel Partners         0            1 |
        |          5        Dropbox           101          Sequoia          1            0 |
        |          5        Dropbox           103   Kleiner Perkins         1            0 |
        +----------------------------------------------------------------------------------+
      A couple of notes. First, if you have lots of firms and lots of investors, this can become very large, very quickly. Also, there might be more efficient ways to fill in the missing startup names and investor names. There is a community-created command called carryforward that can also carry forward non-missing values (help carryforward).
      Last edited by David Benson; 20 Oct 2021, 18:02. Reason: NOTE: Edited to fix a mistake in the code

      Comment

      Working...
      X