I know that the Stata user guide recommends not using the merge m:m command, but believe it may be the best option in this scenario.
I have two datasets of different sizes (DatasetA contains 1,000,000+ observations; DatasetB contains 500,000+ observations).
I'd like to join the two datasets by the merge command but run into problems here. The column common to both datasets is a uniqueID assigned to individuals. However, in each dataset, it is possible that this uniqueID appears once, more than once, or not at all. Take the following example of four individuals:
Dataset1
Dataset2
I would like to merge the two datasets. merge 1:1 is not appropriate since UniqueID does not uniquely identify single observations in both datasets. Neither are merge 1:m or merge m:1 appropriate since in neither dataset do the UniqueID's identify single observations.
I'd assumed therefore that the only appropriate merge was merge m:m. However, this causes additional problems as it adds observations to the merged dataset.
I understand that this process entrains a whole new class of problems with the dataset, but do not see a better way to merge. If anyone has any suggestions or can provide a better understanding of where I'm going wrong please let me know.
Thank you.
I have two datasets of different sizes (DatasetA contains 1,000,000+ observations; DatasetB contains 500,000+ observations).
I'd like to join the two datasets by the merge command but run into problems here. The column common to both datasets is a uniqueID assigned to individuals. However, in each dataset, it is possible that this uniqueID appears once, more than once, or not at all. Take the following example of four individuals:
Dataset1
UniqueID | Variable1 |
1 | x |
1 | y |
2 | z |
2 | h |
2 | j |
3 | k |
4 | m |
Dataset2
UniqueID | Variable2 |
1 | a |
2 | b |
4 | c |
2 | d |
I would like to merge the two datasets. merge 1:1 is not appropriate since UniqueID does not uniquely identify single observations in both datasets. Neither are merge 1:m or merge m:1 appropriate since in neither dataset do the UniqueID's identify single observations.
I'd assumed therefore that the only appropriate merge was merge m:m. However, this causes additional problems as it adds observations to the merged dataset.
I understand that this process entrains a whole new class of problems with the dataset, but do not see a better way to merge. If anyone has any suggestions or can provide a better understanding of where I'm going wrong please let me know.
Thank you.
Comment