Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Checking one variable with another variable from different dataset

    Hello,

    I have a large dataset, with 2.800.000 observations and 10 variables aproximately, and I would like to check if the values of one string variable are contained in the values of other string variable of another dataset.

    The idea is to check if the first names of a list of participants to a class are actually names and not last names, comparing this with a list of actual names (120.000 observations).

    So, this is an abbreviated version of the list of participants to the class:

    Dataset: list of participants.
    obs first_name last_name
    1 john cohen
    2 arthur williams
    3 fox rachel
    4 robert foster







    This is an abbreviated version of the list of names:

    Dataset: list of names.
    obs first_name
    1 lane
    2 david
    3 arthur
    4 robert
    5 rachel
    6 john
    7 lucy











    And this is an example of the result I would like to obtain:
    obs first_name last_name first_name_control
    1 john cohen 1
    2 arthur williams 1
    3 fox rachel 0
    4 robert foster 1








    I do not have any "wrong results" since I do not know how to proceed, but I would really appreciate your help.

    In case this information is important, I am currently using Stata 14.


    I hope I have fulfilled all the Statalist forum discussion recommendations, thanks in advance,
    Isidora.







  • #2
    Hi Isidora, if I understood correctly, you want to see if the first names that exists on your list of participants are valid, by checking if they appear on your second database of list of first names.

    What I would do is to simply merge your second database to the first one using m:1. What this does is it allows entries on your second database (list of first names) to match more than one entry on your first data (list of participants). For the entries that match, it means that the first name of the participant was listed as possible name on your list of first names. Those that don't match could either be because the first name of the participant is not a valid first name of your list (which could be because name was written LAST FIRST instead of FIRST LAST, the case you want to detect), but it could also be that you have a valid first name on your first name list that simply doesn't appear on your list of participants.

    One thing to watch if for capitalization, since Stata is case-sensitive (meaning that John and john will not match). You might want to put all names on both files as lower or uppercase to avoid this kind of errors.

    If you need help with merge, type help merge. The using dataset needs to be in dta format.

    Comment


    • #3
      Thanks for answering me Igor!

      I hesitated because in this case I would need you use merge m:m, which I always try to avoid, but I think it worked. Thank you!

      Comment


      • #4
        You should not need to use m:m. Your second dataset (list of possible first names) should not allow for the entry of repeated observations (names), meaning that "John" should be present in one and only one line in this list.

        Comment


        • #5
          You were right, thank you

          Comment

          Working...
          X