Merging Unbalanced Firm Expenditure Datasets

Packard Otto

Join Date: Jun 2022

Posts: 4
#1

Merging Unbalanced Firm Expenditure Datasets

04 Dec 2022, 11:28

Hello everyone. I found firm expenditure data that I was attempting to merge into one master document, but I am having issues along the way. There are multiple documents containing order numbers "PONumber", but the a few issues I have which I am unable to circumvent. The datasets are unbalanced, there are repeating time values from a firm making multiple purchases within the same month in dataset 1, and while dataset 2 has no time determinant. I cut the code short but the months run across a three year period.

Dataset 1:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long PONumber str7 Month 6939958 "2014-01" 7012089 "2014-01" 7015553 "2014-01" 7031109 "2014-01" 7036141 "2014-01" 7039713 "2014-01" 7051564 "2014-01" 7051564 "2014-01" 7054862 "2014-01" 7054955 "2014-01" 7054955 "2014-01"

Dataset 2

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long(PONumber MaterialNo) 6939958 408957 7012089 408957 7015553 408957 7031109 824624 7036141 754692 7039713 754692 7051564 824624 7051564 824624 7054862 408957 7054955 754692 7054955 754692

Every order number in dataset 1 is in dataset 2, but when I attempt to merge m:1, m:m, and 1:1 by the order number I continue to receive "variable PONumber does not uniquely identify observations in the master data". Is there a way I can pair the corresponding Material Number from dataset 2 into dataset 1's orders?

Last edited by Packard Otto; 04 Dec 2022, 11:34.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

04 Dec 2022, 13:29

In dataset 1 we see

Code:

7054955 "2014-01" 7054955 "2014-01"

and in dataset 2 we see

Code:

7054955 754692 7054955 754692

Why does dataset 2 have repeated observations for the same value of PONumber? In what you show, duplicated PONumber observations in dataset 2 have the same MaterialNo. Is that always the case?

In brief, you seem to have two choices:
1) remove duplicates from dataset 2 and then with dataset 1 as the primary dataset and dataset 2 as the using dataset, merge m:1.
2) use the joinby command

Here's the detailed lecture on merging, since you say you attempted to merge m:m (which however would not have given you the error message that the other options gave).

The following is copied word-for-word from the documentation of the merge command in the Stata Data Management Reference Manual PDF included in the Stata installation and accessible from Stata's Help menu.

m:m merges

m:m specifies a many-to-many merge and is a bad idea. In an m:m merge, observations are matched within equal values of the key variable(s), with the first observation being matched to the first; the second, to the second; and so on. If the master and using have an unequal number of observations within the group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group. Thus m:m merges are dependent on the current sort order—something which should never happen.

Because m:m merges are such a bad idea, we are not going to show you an example. If you think that you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.

If you are thinking about using merge m:m, it is a near certainty that at least one of the following is true:

1. Your merge key actually does uniquely identify the observations in one of your data sets, so you can use merge 1:m or merge m:1 or maybe even merge 1:1.

2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, so you can use merge 1:m or merge m:1 or merge 1:1 with the expanded merge key.

3. You are really trying to accomplish what joinby, a different command, does: create in the output dataset every possible combination of an observation from the first dataset and an observation from the second dataset, both having the same key. (SQL users take note! I fell for this myself: to Stata m-to-m does not mean m-by-m.)

4. You actually need to append your datasets rather than merge them.

5. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.
Comment

Announcement

Merging Unbalanced Firm Expenditure Datasets

Comment