Error merging master data with using data keeping the merge key as city and statefip

Tariq Abdullah

Join Date: Apr 2021
Posts: 366

Error merging master data with using data keeping the merge key as city and statefip

24 Oct 2022, 21:19

I need to convert my master data to county level. For that in my master data I have statefip and city name. In my using data I have city , statefip , county_fip and county name.

The issue arises, since plenty of city in different states have multiple county code/county name as I'm showing a part of duplicates ( from my using data ) data with same city and state but different county name /couty_fip code.

Code:

list city state_id county_name, sepby(city state_id) noobs

  +---------------------------------------+
  |        city   state_id    county_name |
  |---------------------------------------|
  |      Midway         FL        Gadsden |
  |      Midway         FL     Santa Rosa |
  |      Midway         FL       Seminole |
  |---------------------------------------|
  |    Woodbury         NY         Nassau |
  |    Woodbury         NY         Orange |
  |---------------------------------------|
  |     Oakwood         OH       Cuyahoga |
  |     Oakwood         OH     Montgomery |
  |     Oakwood         OH       Paulding |
  |---------------------------------------|
  |    Franklin         PA        Cambria |
  |    Franklin         PA        Venango |
  |---------------------------------------|
  |  Georgetown         PA         Beaver |
  |  Georgetown         PA      Lancaster |
  |  Georgetown         PA        Luzerne |
  |---------------------------------------|

This is how my master data looks like :

dataex id city statefip if city=="Georgetown" & statefip=="PA"

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str14 id str43 city str3 statefip
"11311793" "Georgetown" "PA"
"12054868" "Georgetown" "PA"
"13761470" "Georgetown" "PA"
"63145796" "Georgetown" "PA"
"63058645" "Georgetown" "PA"
"11415903" "Georgetown" "PA"
"62891646" "Georgetown" "PA"
"10933939" "Georgetown" "PA"
"17385483" "Georgetown" "PA"
"10925795" "Georgetown" "PA"
end

This is how my using data looks like

dataex city statefip county_fips county_name if city== "Georgetown" & statefip=="PA"

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str35 city str2 statefip long county_fips str21 county_name
"Georgetown" "PA" 42007 "Beaver"  
"Georgetown" "PA" 42071 "Lancaster"
"Georgetown" "PA" 42079 "Luzerne"  
end

When I'm using m:1

Code:

 
merge m:1 city statefip using "using.dta", keep(master match) nogen
variables city statefip do not uniquely identify observations in the using data
r(459);

end of do-file

r(459);

Whenever I'm using 1:m the following error message shows up

Code:

merge 1:m city statefip using "using.dta", keep(master match) nogen
variables city statefip do not uniquely identify observations in the master data
r(459);

end of do-file

r(459);

What direction should I go to given this particular problem ?

Last edited by Tariq Abdullah; 24 Oct 2022, 21:23.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#2

25 Oct 2022, 13:30

If I understand correctly what you are trying to do, it is simply impossible. The data sets simply do not share any common variable(s) that form a unique identifier in either data set. In terms specific to your problem, your master data has multiple observations with the same city name and state fip. These, presumably, are cities with the same name in different counties. But the master data set has nothing that identifies which of these observations corresponds to which county. The using data set has that information, but it does not have the variable id that appears in the master data and that would be needed to link these properly. There is no way for Stata (or, for that matter, a person looking at the data) to know which observation for Georgetown PA in the master data should be linked to which observation for Georgetown PA in the using data.

What you can do is create a much larger data set in which every observation for Georgetown PA in the master data is linked to every observation for Georgetown PA in the using data. But that probably will not be useful. So what I would do is carefully review the documentation for the master data set. (If you don't have documentation for it, do contact whoever provided that data set for it.) There may be some explanation of the id variable that will let you figure out which county goes with which id, or there may be some other way to figure out the county from other variables in the master data set. But without more information, the task you are attempting cannot be done.
Comment
Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#3

25 Oct 2022, 15:24

Mr. Schechter,

It's been stumbling me for a while , and this is so so helpful to know that this is the kind of stumbling block which needs something different than the normal procedure.

therefore I went ahead and tried this merge m:m

Code:

merge m:m city statefip using "city_countyfipshort.dta", keep(master match) nogen Result Number of obs ----------------------------------------- Not matched 28,631,789 from master 28,631,789 from using 0 Matched 6,767

I believe the merge m:m is did was matching up all the observations with same cities and statefip from master data to all the same cities , statefip , county code etc from using data.

So, should I drop all the merge data and then try the following command

Code:

merge m:1 city statefip using "using.dta", keep(master match) nogen

Since, I dropped all the observation with same cities ( from merging m:m ) which lead me to multiple counties ,do you think this might work ? After merging then I can save the data.

then I can go ahead and reupload the master data. Do the merge m:m city statefip using "using.dta", keep(master match) nogen - then save the data. This will give me the 6,767 observations like above

After that, I'll append both of this dataset which I got after doing merge m;1 and merge m:m command separately.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#4

25 Oct 2022, 15:48

No, -merge m:m- should never be used. It creates an arbitrary and, in most cases meaningless, association between some observations in the master data and others in the using data. There are almost no real-world circumstances where what it produces is usable, and in your case the results are simply data salad.

Please re-read and carefully consider what I wrote in #2. There is no code wizardry that can solve this problem. The problem lies in the absence of sufficient information in the data sets (in particular in the master data set) to resolve the problem. If there is any solution, it is going to come from finding the information about the county somewhere in the master data set. I think the most likely place for that is in the variable called id, which may somehow encode this. Or perhaps there are other variables in the master data set that you have overlooked from which the county can be deduced. But if the information isn't there, no code, no matter how clever, can create it.
1 like
Comment
Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#5

25 Oct 2022, 16:00

Ok thanks so much for warning me before diving into this procedure! I appreciate your thoughtful suggestion as usual and will go ahead to find something to merge my data succesfully!
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

25 Oct 2022, 19:17

I'm late to this discussion, but it is a successor to the earlier discussion at

https://www.statalist.org/forums/for...nd-state-names

The point of my post #5 there is the same as the point of Clyde's post #2 here: There is no information in the data you show us that allows one to know, for a given observation identified as "Woodbury NY" whether it is the Woodbury in Orange County or the Woodbury in Nassau County.

If your primary data does not have some other information that allows you to identify the county, you cannot "find something to merge my data succesfully" other than to remove from your using data all the duplicated observations - for example, both observations for Woodbury NY - using the duplicates command with logic similar to mine from the previous discussion, and then merge your data.

The result will have no matches for Woodbury NY but I'm guessing that a dataset constructed such that the cities cannot be correctly identified will likely have other observations - along with the ambiguous ones - that will fail to match to your city data.

And for your benefit, so you never again make the mistake of thinking that merge m:m will solve a problem rather than create worse problems, here's some material from my FQA (Frequent Question Answers) that I post under these circumstances.

The following is copied word-for-word from the documentation of the merge command in the Stata Data Management Reference Manual PDF included in the Stata installation and accessible from Stata's Help menu.

m:m merges

m:m specifies a many-to-many merge and is a bad idea. In an m:m merge, observations are matched within equal values of the key variable(s), with the first observation being matched to the first; the second, to the second; and so on. If the master and using have an unequal number of observations within the group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group. Thus m:m merges are dependent on the current sort order—something which should never happen.

Because m:m merges are such a bad idea, we are not going to show you an example. If you think that you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.

If you are thinking about using merge m:m, it is a near certainty that at least one of the following is true:

1. Your merge key actually does uniquely identify the observations in one of your data sets, so you can use merge 1:m or merge m:1 or maybe even merge 1:1.

2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, so you can use merge 1:m or merge m:1 or merge 1:1 with the expanded merge key.

3. You are really trying to accomplish what joinby, a different command, does: create in the output dataset every possible combination of an observation from the first dataset and an observation from the second dataset, both having the same key. (SQL users take note! I fell for this myself: to Stata m-to-m does not mean m-by-m.)

4. You actually need to append your datasets rather than merge them.

5. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.
2 likes
Comment
Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#7

26 Oct 2022, 06:50

Thanks for making me understand the problem behind using m:m command - which I wasn't so sure of how effective this is. Now, I know better.

Very humbled again to learn this stuff from both of you , and will try to reevaluate my code and data to come up with an effective solution as per your suggestion !

I appreciate your time !
Comment

Announcement

Error merging master data with using data keeping the merge key as city and statefip

Comment

Comment

Comment

Comment

Comment

Comment