Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do you create unique ids for two datasets with different data points and merge them?

    Hey guys,

    I have two datasets: one with 38,62 data points, and the other 25,000.

    I need to merge the large dataset into the small dataset (1:1).

    Neither data sets have unique ids. However, they have three variables (region, commune, and village) in common. Three variables together make the data point unique, so I would like to create a unique id based on the three variables.

    Now, after I create the unique ids for the smaller dataset, how do I merge these unique ids into the larger dataset, using the 3 key variables?

    In other words, how do I make sure that the data points in my large datasets have the same unique ids asthe data points in the smaller dataset, if they have the same region, commune, and village names?

    In addition, the large dataset has many more data points than the smaller one. I only need a consistent list of unique ids for both datasets, so I can merge the large one into the smaller one.

    Thanks!

    Last edited by Xiao Roland Tan; 08 Sep 2017, 12:00.

  • #2
    You don't need to have a single unique-id variable to do this -merge-. If you are correct in your assertion that the three variables, region, commune, and village are common to both data sets and together uniquely identify the observations in each, then you can just do the -merge- with:

    Code:
    use dataset_4000, clear
    merge 1:1 region commune village using dataset_90000
    Then if you have some other reason for still wanting a unique id variable you can get that with:

    Code:
    gen long unique_id = group(region commune village)
    But if I understand your post, you don't actually need this id once you have merged the data sets. So just forget about it and do the merge.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      You don't need to have a single unique-id variable to do this -merge-. If you are correct in your assertion that the three variables, region, commune, and village are common to both data sets and together uniquely identify the observations in each, then you can just do the -merge- with:

      Code:
      use dataset_4000, clear
      merge 1:1 region commune village using dataset_90000
      Then if you have some other reason for still wanting a unique id variable you can get that with:

      Code:
      gen long unique_id = group(region commune village)
      But if I understand your post, you don't actually need this id once you have merged the data sets. So just forget about it and do the merge.
      Thanks! I didn't see your second part. Thanks!!!
      Last edited by Xiao Roland Tan; 08 Sep 2017, 12:33.

      Comment


      • #4
        The second block of code in #2 shows you how to create a unique identifier after you have merged the data sets. It is the best way to do it, in my opinion.

        If you want to make a unique identifier out of the two data sets before you merge them, then you have to make it some computation from the variables region, commune, and village. Probably the simplest approach is just to concatenate them:

        Code:
        egen unique_id = concat(region commune village), punct("_")
        This will produce a unique identifier that is consistent across the two data sets but can be calculated in each data set separately.

        I don't know your workplace, and perhaps you are not permitted to question your manager, but if I were in this situation I would want to know my manager's reasons for insisting on having a unique_id. If the manager has a good reason, then I will learn something from that. If my manager can't come up with a good reason, then perhaps he/she will change his/her mind. If there is no reason but the manager continues to insist on something that is unnecessary, then I will have learned something about the manager that will be helpful in planning future career moves.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          You don't need to have a single unique-id variable to do this -merge-. If you are correct in your assertion that the three variables, region, commune, and village are common to both data sets and together uniquely identify the observations in each, then you can just do the -merge- with:

          Code:
          use dataset_4000, clear
          merge 1:1 region commune village using dataset_90000
          Then if you have some other reason for still wanting a unique id variable you can get that with:

          Code:
          gen long unique_id = group(region commune village)
          But if I understand your post, you don't actually need this id once you have merged the data sets. So just forget about it and do the merge.
          Hey I tried to create unique ids using your code, but it returns "group not found?" Should I generate a new var called group?

          Nvm. It should be egen. problem solved. thank you for your help; you really saved me
          Last edited by Xiao Roland Tan; 08 Sep 2017, 13:06.

          Comment


          • #6
            Yes, sorry about that typo. -group()- is an -egen- function.

            Comment

            Working...
            X