Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issue in Merging Two Data Sets

    Greetings Everyone,

    Hopefully, you all will be fine, I am merging two datasets, under 1:m to many key variables. But I have an issue in merging "variable symbol (companies) does not uniquely identify in the master data

    Regards

  • #2
    try joinby

    you may have duplicates.

    HTML Code:
    https://www.statalist.org/forums/forum/general-stata-discussion/general/87327-variable-date-does-not-uniquely-identify-observations-in-the-master-data-stata/page2

    Comment


    • #3
      There is no "may" about it. You do have duplicates in your master data set. I have never known Stata to be wrong when it says you have duplicates in this context. Using -joinby- instead of -merge- will join the two data sets together, but in a way that may differ from what you have in mind. Anyway, what -joinby- will do is take every observation in the master data set that has a particular set of values of the key variables and pair each one of them up with every observation in the using data set that has the same values of the key variables. If that's what you want, then the advice in #2 is correct.

      But that usually isn't what people want in this situation. Usually they really do expect there to be only one observation for each combination of values of the key variables in the master data set and they really do want a 1:m merge. So the question becomes: why do the key variables not uniquely identify observations. Why are there surplus observations, and where did they come from? There are several possible answers and you need to figure out which applies to your data set.
      1. The list of key variables is incomplete. For example, perhaps your key variables consist of a firm and a date, but there are multiple observations for the same firm and date, each reflecting a subdivision of the firm. In that case, the subdivision variable needs to be added to the -merge- key.
      2. The list of key variables is complete, but there are nevertheless duplicate observations for the keys. So you need to find those duplicates and gain an understanding of how they got there. Start with -duplicates tag key_variables, gen(flag)- and -browse if flag-. Now you will see the offending observations in front of your eyes. There are several possibilities that you can distinguish.
        • The duplicates are pure duplicates, that is, they agree on all variables in the data set. In that case, you can eliminate the pure duplicates with the -duplicates drop- command and no information will be lost.
        • The duplicates sometimes disagree on other variables in the data set. This is a serious problem because it means you have inconsistent data, data that contradicts itself. You then need to figure out how this came about and fix it. This requires a detailed review of the data management that created this data set, all the way back to its sources, to identify where the conflicting observations came from and why they were not somehow reconciled along the way. In the end, you need to figure out which, if any, of the observations is correct and keep only that. Or perhaps none are correct and you need to get rid of them all and perhaps create an entirely new observation with the correct values. Be prepared, by the way, for the possibility that in this process you may discover other errors made in creating the data set--and you should correct those too while you are there.
        • Sometimes data sets have a bunch of "empty" observations (observations where every variable has a missing value). This is particularly common with data that is originally imported from spreadsheets. It is reasonable to just delete such observations, provided you are confident that they don't represent some coding error in the data management that inadvertently wiped out useful information from observations that previously had it.

      Comment

      Working...
      X