How to mix the obs of two variables?

Charlotte Tomba

Join Date: Oct 2021

Posts: 2
#1

How to mix the obs of two variables?

19 Oct 2021, 01:46

Hello everyone,

This is my database.
Each org_uuid is matched with a specific investor_uuid. This are real matches.
Then each org_uuid has its own employee and each investor_uuid has its own partners.

I want to build a counterfactual sample of all matches that did not happen. To do this I'm trying to mix the org_uuid and the investor_uuid, i.e, for the first org_uuid i want to match all the investor_uuid; then for the second org_uuid i want to match all the investor_uuid, and so on.
After that i want to rematch to each org_uuid, its employees and to each investors_uuid all its partners.

Each counterfactual march should be marked with 0, while each real match with 1, so that they can be recognised.

If anyone can help me, it would be really appreciated!

Thank you in advance!

Charlotte
Tags: None
Matthew Lala

Join Date: Aug 2018

Posts: 37
#2

19 Oct 2021, 06:02

The structure of your dataset is not very clear to me, and consequently I don't understand your objective very well, but it sounds like

Code:

help joinby

could be useful for you to create all of the possible counterfactual matches, with you then merging the "true" matches back into the dataset created by joinyb to mark the "real" matches.

Be aware that depending on your situation joinby could create an extremely large dataset which you will struggle to work with. You may want to consider:
1) trying on a smaller subset of the data first
2) for the main dataset, making it smaller by dropping extraneous variables and using encode to convert these string variables into numeric variables, which typically take up much less space.
1 like
Comment

David Benson

Join Date: Oct 2018
Posts: 489

20 Oct 2021, 17:41

Charlotte, see if this helps you move forward. Also, in the future it would be helpful if you could share your data via the dataex command. Since many posters are not familiar with dataex (it was created specifically to be able to share data here at statalist), I created a short Youtube video explaining it here.

I think the command you want to use is fillin (which adds all possible combinations). I created some toy data of 5 startups with some different investors to help make it concrete:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte startup_id str8 startup_name int investor_id str15 investor_name
1 "Google"   101 "Sequoia "      
2 "Facebook" 101 "Sequoia "      
3 "Uber"     102 "Accel Partners"
4 "Airbrb"   103 "Kleiner Perkins"
5 "Dropbox"  102 "Accel Partners"
end


list, noobs abbrev(12)
  +-----------------------------------------------------------+
  | startup_id   startup_name   investor_id     investor_name |
  |-----------------------------------------------------------|
  |          1         Google           101          Sequoia  |
  |          2       Facebook           101          Sequoia  |
  |          3           Uber           102    Accel Partners |
  |          4         Airbrb           103   Kleiner Perkins |
  |          5        Dropbox           102    Accel Partners |
  +-----------------------------------------------------------+

fillin startup_id investor_id
list, noobs abbrev(12) sepby( startup_id)
/* Note that fillin creates a variable _fillin that=0 for the originals, _fillin==1 for new observations that it creates  */

  +---------------------------------------------------------------------+
  | startup_id   startup_name   investor_id     investor_name   _fillin |
  |---------------------------------------------------------------------|
  |          1         Google           101          Sequoia          0 |
  |          1                          102                           1 |
  |          1                          103                           1 |
  |---------------------------------------------------------------------|
  |          2       Facebook           101          Sequoia          0 |
  |          2                          102                           1 |
  |          2                          103                           1 |
  |---------------------------------------------------------------------|
  |          3                          101                           1 |
  |          3           Uber           102    Accel Partners         0 |
  |          3                          103                           1 |
  |---------------------------------------------------------------------|
  |          4                          101                           1 |
  |          4                          102                           1 |
  |          4         Airbrb           103   Kleiner Perkins         0 |
  |---------------------------------------------------------------------|
  |          5                          101                           1 |
  |          5        Dropbox           102    Accel Partners         0 |
  |          5                          103                           1 |
  +---------------------------------------------------------------------+

Code:

gen byte real_match = (_fillin==0)   // real_match==1 if the investment really took place, 0 otherwise
gsort startup_id -startup_name  // sorting such that the observation with the startup name is in the first position
by startup_id: replace startup_name=startup_name[1] if startup_name==""    //  carrying forward the startup_name to fill in the blanks

gsort investor_id  -investor_name
by investor_id:  replace investor_name=investor_name[1] if investor_name==""  // carrying forward the investor_name to fill in blanks

sort startup_id _fillin  // sorting by startup, this puts the real matches that took place in the first position
list, noobs abbrev(12) sepby( startup_id)

  +----------------------------------------------------------------------------------+
  | startup_id   startup_name   investor_id     investor_name   _fillin   real_match |
  |----------------------------------------------------------------------------------|
  |          1         Google           101          Sequoia          0            1 |
  |          1         Google           102    Accel Partners         1            0 |
  |          1         Google           103   Kleiner Perkins         1            0 |
  |----------------------------------------------------------------------------------|
  |          2       Facebook           101          Sequoia          0            1 |
  |          2       Facebook           103   Kleiner Perkins         1            0 |
  |          2       Facebook           102    Accel Partners         1            0 |
  |----------------------------------------------------------------------------------|
  |          3           Uber           102    Accel Partners         0            1 |
  |          3           Uber           101          Sequoia          1            0 |
  |          3           Uber           103   Kleiner Perkins         1            0 |
  |----------------------------------------------------------------------------------|
  |          4         Airbrb           103   Kleiner Perkins         0            1 |
  |          4         Airbrb           102    Accel Partners         1            0 |
  |          4         Airbrb           101          Sequoia          1            0 |
  |----------------------------------------------------------------------------------|
  |          5        Dropbox           102    Accel Partners         0            1 |
  |          5        Dropbox           101          Sequoia          1            0 |
  |          5        Dropbox           103   Kleiner Perkins         1            0 |
  +----------------------------------------------------------------------------------+

A couple of notes. First, if you have lots of firms and lots of investors, this can become very large, very quickly. Also, there might be more efficient ways to fill in the missing startup names and investor names. There is a community-created command called carryforward that can also carry forward non-missing values (help carryforward).

Last edited by David Benson; 20 Oct 2021, 18:02. Reason: NOTE: Edited to fix a mistake in the code

Announcement

How to mix the obs of two variables?

Comment

Comment