Checking one variable with another variable from different dataset

Isidora Vergara

Join Date: May 2019

Posts: 18
#1

Checking one variable with another variable from different dataset

24 Oct 2019, 09:01

Hello,

I have a large dataset, with 2.800.000 observations and 10 variables aproximately, and I would like to check if the values of one string variable are contained in the values of other string variable of another dataset.

The idea is to check if the first names of a list of participants to a class are actually names and not last names, comparing this with a list of actual names (120.000 observations).

So, this is an abbreviated version of the list of participants to the class:

Dataset: list of participants.
obs first_name last_name

1 john cohen

2 arthur williams

3 fox rachel

4 robert foster

This is an abbreviated version of the list of names:

Dataset: list of names.
obs first_name

1 lane

2 david

3 arthur

4 robert

5 rachel

6 john

7 lucy

And this is an example of the result I would like to obtain:

obs first_name last_name first_name_control

1 john cohen 1

2 arthur williams 1

3 fox rachel 0

4 robert foster 1

I do not have any "wrong results" since I do not know how to proceed, but I would really appreciate your help.

In case this information is important, I am currently using Stata 14.

I hope I have fulfilled all the Statalist forum discussion recommendations, thanks in advance,
Isidora.
Tags: None
Igor Paploski

Join Date: Oct 2014

Posts: 174
#2

24 Oct 2019, 10:43

Hi Isidora, if I understood correctly, you want to see if the first names that exists on your list of participants are valid, by checking if they appear on your second database of list of first names.

What I would do is to simply merge your second database to the first one using m:1. What this does is it allows entries on your second database (list of first names) to match more than one entry on your first data (list of participants). For the entries that match, it means that the first name of the participant was listed as possible name on your list of first names. Those that don't match could either be because the first name of the participant is not a valid first name of your list (which could be because name was written LAST FIRST instead of FIRST LAST, the case you want to detect), but it could also be that you have a valid first name on your first name list that simply doesn't appear on your list of participants.

One thing to watch if for capitalization, since Stata is case-sensitive (meaning that John and john will not match). You might want to put all names on both files as lower or uppercase to avoid this kind of errors.

If you need help with merge, type help merge. The using dataset needs to be in dta format.
Comment
Isidora Vergara

Join Date: May 2019

Posts: 18
#3

24 Oct 2019, 12:38

Thanks for answering me Igor!

I hesitated because in this case I would need you use merge m:m, which I always try to avoid, but I think it worked. Thank you!
Comment
Igor Paploski

Join Date: Oct 2014

Posts: 174
#4

24 Oct 2019, 12:40

You should not need to use m:m. Your second dataset (list of possible first names) should not allow for the entry of repeated observations (names), meaning that "John" should be present in one and only one line in this list.
Comment
Isidora Vergara

Join Date: May 2019

Posts: 18
#5

24 Oct 2019, 13:13

You were right, thank you
Comment

obs	first_name	last_name
1	john	cohen
2	arthur	williams
3	fox	rachel
4	robert	foster

obs	first_name
1	lane
2	david
3	arthur
4	robert
5	rachel
6	john
7	lucy

obs	first_name	last_name	first_name_control
1	john	cohen	1
2	arthur	williams	1
3	fox	rachel	0
4	robert	foster	1

Announcement

Checking one variable with another variable from different dataset

Comment

Comment

Comment

Comment