Lost in the proces of matching data

Ali Joser

Join Date: Mar 2021
Posts: 14

Lost in the proces of matching data

01 Mar 2021, 02:25

Hello everyone,

I have learned a lot on the Statalist forum but I am not running into something that I could use your expertise on. I am doing a study of which one part consists of the structure of boards. For this, I am using BoardEx. From BoardEx, I have exported three files, which together contain the information I need. I will be talking about the three files below. The main goal of what I am trying to achieve is to calculate the ratio of female directors in a given firm-year. I want to do this by dividing the number of females in a given firm-year by the total board size.

In the first file, the genders of the board members are stated per DirectorID. Here is a short preview.

Code:

input str21 Title str30 Forename1 str11 DOB str3 Age str1 Gender str36 Nationality long DirectorID int NetworkSize
"Admiral"               "Anna"             "Jan 1972"    "47" "F" ""           1530600    68
"Director"               "Aziz"       "1966"        "54" "M" ""           1356558   245
"Admiral"               "Bobby"            "04 Apr 1931" "89" "M" "American"     33004  4323
"Chairman"               "John"          "10 Jul 1953" "66" "M" "French"     1381271   120
end

In the second file, the education per DirectorID is stated. I will use this later on as a control variable.

Code:

input str64 DirectorName str128 CompanyName str85 Qualification long(DirectorID CompanyID) int AwardDate
"Doctor Christopher Albrecht"        "Universitat Basel (University of Basel)"                                                                                     "Graduated"                        39   62183     .
"Doctor Christopher Albrecht"        "Universitat Basel (University of Basel)"                                                                                     "PhD"                              39   62183  1461
"John Loudon"                        "Yale University"                                                                                                             "BA"                               52   62981 -1095
"John Loudon"                        "Université Paris Sorbonne - Paris IV (Paris Sorbonne University - Paris IV)"                                                "Graduated"                        52   63794     .
end
format %tdnn/dd/CCYY AwardDate

I have successfully managed to merge these two files. However, now comes the part where I am lost. The last export from BoardEx contains the board size per company per year.

Code:

input long(BoardID DirectorID) str4 Year str12 ISIN byte NumberDirectors
1990357 31 "2016" "JE00BD3QJR55"  3
1990357 31 "2015" "JE00BD3QJR55"  3
1990357 31 "2014" "JE00BD3QJR55"  3
1990357 31 "2017" "JE00BD3QJR55"  3
1990357 31 "2019" "JE00BD3QJR55"  4
1990357 31 "2018" "JE00BD3QJR55"  3
  17834 36 "2002" "IE0004906560" 18
  17834 36 "2006" "IE0004906560" 19
  17834 36 "2004" "IE0004906560" 19
  17834 36 "2011" "IE0004906560" 15
  17834 36 "2005" "IE0004906560" 19
  17834 36 "2010" "IE0004906560" 15
  17834 36 "2007" "IE0004906560" 18
  17834 36 "2003" "IE0004906560" 19
  17834 36 "2008" "IE0004906560" 15
  17834 36 "2009" "IE0004906560" 15
  32910 37 "2002" "DE0007664005" 28
  32910 37 "2002" "DE0007664039" 28
   3447 39 "2002" "CH0012410517" 12
  20144 42 "2003" "IT0000062957" 23
   8678 42 "2003" "FR0000120644" 13
  15520 42 "2002" "IT0001353173" 15
end

When I am trying to merge the gender and education file with this file, I get the message

Code:

variable DirectorID does not uniquely identify observations in the master data

. I get that this is because the DirectorID's in the last file is not unique but repeating, but that shouldn't matter if I want to have the gender in it.

So, my question is: how can I realize this? And how can I then calculate the female ratio (number of females/number of directors)?

Thank you for your time and efforts. If something is not clear please let me know. If I can in any way improve my future posts I would gladly like to hear that as well.

Last edited by Ali Joser; 01 Mar 2021, 02:26. Reason: Deleted unrelevant information.

Tags: None

Iuri Gavronski

Join Date: Jun 2018

Posts: 49
#2

01 Mar 2021, 03:52

Hi Ali,

Could you please post the command you used to merge your data in the third file?

It seems to be a problem with the specification of the "1:m" or "m:m" options but I wanted to be sure.

Best,

Iuri.
Comment
Ali Joser

Join Date: Mar 2021

Posts: 14
#3

01 Mar 2021, 04:02

Hello Luri,

Thank you for getting back to me. For the merging between the first and second file (gender and education, respectively), I used the

Code:

merge 1:m using education.dta

, since the gender file had less observations. This was successful since in both files DirectorID only appears once.

Matching with the file that includes the size of the board per respective year and DirectorID, I used both

Code:

merge 1:m using merged.dta

and

Code:

merge m:1 using merged.dta

. Both throw me

Code:

variable DirectorID does not uniquely identify observations in the master data

The information that this file contains is about boards over many years. Since a DirectorID can be on multiple boards or on one board but multiple years, I think I am getting this error. Have I helped you with my comment?

Thank you for your time.

Best regards,
Ali
Comment
Iuri Gavronski

Join Date: Jun 2018

Posts: 49
#4

01 Mar 2021, 04:20

Have you tried?

Code:

merge m:m DirectorID using merged.data, keep(match)
Comment
Ali Joser

Join Date: Mar 2021

Posts: 14
#5

01 Mar 2021, 05:36

Wow, that seems to work! Thank you very much, Luri. I was actually avoiding the m:m command since I read everywhere that it should very rarely be used.

Do you maybe also know how to realize my second question in my opening post? In my final file (the one that has everything merged), I have the gender per DirectorID and which boards they attended in what years. I also have the variable NumberDirectors, which shows the total board size in a given firm-year. How can divide the amount of 'F' (females) by the total board size for each firm-year? I have tried many combinations of plugging the variables with egen and bysort, but it seems like I am doing it completely wrong.
Comment
Iuri Gavronski

Join Date: Jun 2018

Posts: 49
#6

01 Mar 2021, 07:20

Do you mind pasting part of your merged file?
Comment

Ali Joser

Join Date: Mar 2021
Posts: 14

01 Mar 2021, 07:31

Sure thing, thank you. Here is a part of the merged file:

Code:

input long(DirectorID BoardID) str1 Gender byte NumberDirectors long CompanyID str12 ISIN str4 Year
 797026 32430 "M" 9      . "AN9225711045" "2008"
 797030 32430 "M" 9      . "AN9225711045" "2008"
 797036 32430 "F" 9      . "AN9225711045" "2008"
 797010 32430 "M" 9      . "AN9225711045" "2008"
 796988 32430 "M" 9      . "AN9225711045" "2008"
 797004 32430 "M" 9 602911 "AN9225711045" "2008"
1139245 70756 "F" 8  38777 "ANN2879J1070" "2010"
1139247 70756 "M" 8  64366 "ANN2879J1070" "2010"
1139247 70756 "M" 8  64366 "ANN2879J1070" "2011"
1139245 70756 "F" 8  38777 "ANN2879J1070" "2011"
end

Thank you for your time.

Comment

Iuri Gavronski

Join Date: Jun 2018

Posts: 49
#8

01 Mar 2021, 07:47

Is that what you are looking for?

Code:

bys BoardID Gender : egen thisgender = count(1) gen double pct_gen =thisgender / NumberDirectors
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

01 Mar 2021, 08:33

Iuri Gavronski -

I regret to say that the advice to use merge m:m in your post #4 almost certainly produced unhelpful results.

The following is copied word-for-word from the documentation of the merge command in the Stata Data Management Reference Manual PDF included in the Stata installation and accessible from Stata's Help menu.
m:m merges

m:m specifies a many-to-many merge and is a bad idea. In an m:m merge, observations are matched within equal values of the key variable(s), with the first observation being matched to the first; the second, to the second; and so on. If the master and using have an unequal number of observations within the group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group. Thus m:m merges are dependent on the current sort order—something which should never happen.

Because m:m merges are such a bad idea, we are not going to show you an example. If you think that you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.
So with that advice from the authors of Stata in mind, if one is thinking about using merge m:m, it is a near certainty that at least one of the following is true:

1. Your merge key actually does uniquely identify the observations in one of your data sets, so you can use merge 1:m or merge m:1 or maybe even merge 1:1.

2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, so you can use merge 1:m or merge m:1 or merge 1:1 with the expanded merge key.

3. You are really trying to accomplish what joinby, a different command, does: create in the output dataset every possible combination of an observation from the first dataset and an observation from the second dataset, both having the same key. (SQL users take note! I fell for this myself: to Stata m-to-m does not mean m-by-m.)

4. You actually need to append your datasets rather than merge them.

5. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.

Last edited by William Lisowski; 01 Mar 2021, 08:49.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

01 Mar 2021, 08:47

Ali Joser

In your post #3 the merge commands you display do not include the variables you used as merge keys. It is difficult to advise on what you might need to change in your merges to get the results you need.

Perhaps the following would have done what you needed.

Code:

use gender merge m:1 DirectorID using education joinby DirectorID using board

Alternatively, perhaps you could begin by using reshape or collapse to create a new_education dataset by reducing your education dataset to one observation per DirectorID summarizing the director's education experience, after which

Code:

use gender merge 1:1 DirectorID using new_education merge 1:m DirectorID using board

would do what you need.
1 like
Comment
Ali Joser

Join Date: Mar 2021

Posts: 14
#11

01 Mar 2021, 10:06

William Lisowski

Thank you for the information and the commands. I took a closer look at my data from the previous m:m merge, and I indeed saw some weird things. Had to look better before commenting that it worked.

Your joinby command, on the other hand, worked perfectly. After doing some manual checks I see no mistakes at all. Thank you very much for this.

Do you, by any chance, know how I can calculate the ratio of female directors per firm-year? The command provided above unfortunately does not provide the correct results for some reason.

I have created the total board size myself now, by using

Code:

bysort ISIN Year : gen boardsize = _N

. How can I now count the amount of 'F' per year and firm (by ISIN) and divide it by my new variable 'boardsize'.

Thank you for your time and efforts.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#12

01 Mar 2021, 10:43

You have a difficulty in your final dataset, because for example DirectorID 39 from your board dataset

Code:

  +--------------------------------------------------------------+
  | BoardID   DirectorID   Year           ISIN   NumberDirectors |
  |--------------------------------------------------------------|
  |    3447           39   2002   CH0012410517                12 |
  +--------------------------------------------------------------+

will appear twice in each year for BoardID 3447 because they have two observations in the education dataset

Code:

  +----------------------------------------------------------------------+
  | DirectorID                               CompanyName   Qualification |
  |----------------------------------------------------------------------|
  |         39   Universitat Basel (University of Basel)       Graduated |
  |         39   Universitat Basel (University of Basel)             PhD |
  +----------------------------------------------------------------------+

So director 39 will add 2 to the boardsize variable you calculated in post #1. Should you not instead use NumberDirectors that you already have?

To count the number of "F" by ISIN and year, you will need to ensure that each board member is counted only once. The egen command's tag() function will help with that.

Code:

egen tocount = tag(ISIN Year DirectorID)
bysort ISIN Year: egen NF = total(Gender=="F" & tocount==1)

Comment

Ali Joser

Join Date: Mar 2021

Posts: 14
#13

01 Mar 2021, 11:15

Dear William,

Thank you very much for your insight. I would've missed the double observation otherwise. I used my

I used my own boardsize command because NumberDirectors reported different results for the same firm-year. Starting over and doing the merging and joinby again, however, they seem correct. Maybe I saw different values when I merged m:m. They seem correct now, so I will be using NumberDirectors.

For counting the number of "F" by ISIN and Year, your command works perfectly. Thank you very much. To calculate the ratio, I am using the following command and it seems to report the correct ratio. However, is this a good practice? The commands I see reported online to calculate percentages with Stata are way more complex, so I do not know if I am overlooking something.

Code:

generate femaleratio = NF/NumberDirectors

Thank you for your efforts.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#14

01 Mar 2021, 13:26

I used my own boardsize command because NumberDirectors reported different results for the same firm-year.

The lesson for the future is, if your data - in this case, NumberDirectors - looks wrong after you've done something, there are two possibilities: the data needs correcting, or what you did needs correcting. For me, I am always humbled by the number and variety of ways I can do something wrong or misunderstand what I am seeing: orders of magnitude more frequent than having incorrect input from an authoritative source.

The command you show in post #13 is exactly what I expected would be used.

Added complexity in other examples may be from trying to do it all in a single command, perhaps, or may be the result of a more complex data structure than you have.

Last edited by William Lisowski; 01 Mar 2021, 13:29.
1 like
Comment
Ali Joser

Join Date: Mar 2021

Posts: 14
#15

02 Mar 2021, 01:49

William Lisowski

That is a very good lesson and one that I will keep in mind. I have accomplished what I wanted thanks to your help, and I now know how to do similar tasks in the future. Thank you for your time and efforts to show me how these type of tasks are done.
Comment

Announcement

Lost in the proces of matching data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment