Variables do not uniquely identify observations in the master data

Anthony Gonzalez

Join Date: Oct 2022

Posts: 9
#1

Variables do not uniquely identify observations in the master data

31 Oct 2022, 16:17

Hello,

I keep getting the error "variables providerid year do not uniquely identify observations in the master data". However, I am not sure how else to uniquely identify the master data as I have tried adding more variables to it.

Here is my code:

*Identify 2014 General Acute Hospitals (GAH)
import delimited "/Users/markflores/Desktop/POS files/HOSArchive_Revised_Flatfiles_20141023/Hospital General Information.csv", stringcols(1) clear

gen temp=substr(providerid,3,4)
gen genhospital=0
replace genhospital=1 if real(temp)>=0001 & real(temp)<=0879
keep if genhospital==1
rename providerid oldproviderid

keep oldproviderid address city state zipcode
isid address city state zipcode

save 2014noncahdata.dta, replace

*Identify 2015 Critical Access Hospitals (CAH)
import delimited "/Users/markflores/Desktop/POS files/hos_revised_flatfiles_archive_10_2015/Hospital General Information.csv", stringcols(1) clear

gen temp=substr(providerid,3,4)
gen cah=0
replace cah=1 if real(temp)>=1300 & real(temp)<=1399

keep if cah==1
save 2015cahdata.dta, replace

*Merge To Identify Possible Conversion
merge 1:1 address city state zipcode using 2014noncahdata.dta
keep providerid _merge

//drop if _merge==2

gen year=2015
save 2015cahmergedata.dta, replace

*Identify 2015 General Acute Hospitals (GAH)
import delimited "/Users/markflores/Desktop/POS files/hos_revised_flatfiles_archive_10_2015/Hospital General Information.csv", stringcols(1) clear

gen temp=substr(providerid,3,4)
gen genhospital=0
replace genhospital=1 if real(temp)>=0001 & real(temp)<=0879
keep if genhospital==1
rename providerid oldproviderid

keep oldproviderid address city state zipcode
isid address city state zipcode

save 2015noncahdata.dta, replace

*Identify 2016 Critical Access Hospitals (CAH)
import delimited "/Users/markflores/Desktop/hospital data/hospitals_archive_2016/hos_revised_flatfiles_archive_11_2016/Hospital General Information.csv", stringcols(1) clear

gen temp=substr(providerid,3,4)
gen cah=0
replace cah=1 if real(temp)>=1300 & real(temp)<=1399

keep if cah==1
save 2016cahdata.dta, replace

*Merge To Identify Possible Conversion
merge 1:1 address city state zipcode using 2015noncahdata.dta
keep providerid _merge

//drop if _merge==2

gen year=2016
save 2016cahmergedata.dta, replace

*Identify 2016 General Acute Hospitals (GAH)
import delimited "/Users/markflores/Desktop/hospital data/hospitals_archive_2016/hos_revised_flatfiles_archive_11_2016/Hospital General Information.csv", stringcols(1) clear

gen temp=substr(providerid,3,4)
gen genhospital=0
replace genhospital=1 if real(temp)>=0001 & real(temp)<=0879
keep if genhospital==1
rename providerid oldproviderid

keep oldproviderid address city state zipcode
isid address city state zipcode

save 2016noncahdata.dta, replace

*Identify 2017 Critical Access Hospitals (CAH)
import delimited "/Users/markflores/Desktop/hospital data/hospitals_archive_2017/hos_revised_flatfiles_archive_10_2017/Hospital General Information.csv", stringcols(1) clear

gen temp=substr(providerid,3,4)
gen cah=0
replace cah=1 if real(temp)>=1300 & real(temp)<=1399

keep if cah==1
save 2017cahdata.dta, replace

*Merge To Identify Possible Conversion
merge 1:1 address city state zipcode using 2016noncahdata.dta
keep providerid _merge

//drop if _merge==2

gen year=2017
save 2017cahmergedata.dta, replace

*Identify 2017 General Acute Hospitals (GAH)
import delimited "/Users/markflores/Desktop/hospital data/hospitals_archive_2017/hos_revised_flatfiles_archive_10_2017/Hospital General Information.csv", stringcols(1) clear

gen temp=substr(providerid,3,4)
gen genhospital=0
replace genhospital=1 if real(temp)>=0001 & real(temp)<=0879
keep if genhospital==1
rename providerid oldproviderid

keep oldproviderid address city state zipcode
isid address city state zipcode

save 2017noncahdata.dta, replace

*Identify 2018 Critical Access Hospitals (CAH)
import delimited "/Users/markflores/Desktop/hospital data/hospitals_archive_2018/hos_revised_flatfiles_archive_10_2018/Hospital General Information.csv", stringcols(1) clear

gen temp=substr(providerid,3,4)
gen cah=0
replace cah=1 if real(temp)>=1300 & real(temp)<=1399

keep if cah==1
save 2018cahdata.dta, replace

*Merge To Identify Possible Conversion
merge 1:1 address city state zipcode using 2017noncahdata.dta
keep providerid _merge

//drop if _merge==2

gen year=2018
save 2018cahmergedata.dta, replace

*Identify 2018 General Acute Hospitals (GAH)
import delimited "/Users/markflores/Desktop/hospital data/hospitals_archive_2018/hos_revised_flatfiles_archive_10_2018/Hospital General Information.csv", stringcols(1) clear

gen temp=substr(providerid,3,4)
gen genhospital=0
replace genhospital=1 if real(temp)>=0001 & real(temp)<=0879
keep if genhospital==1
rename providerid oldproviderid

keep oldproviderid address city state zipcode
isid address city state zipcode

save 2018noncahdata.dta, replace

*Identify 2019 Critical Access Hospitals (CAH)
import delimited "/Users/markflores/Desktop/hospital data/hospitals_archive_2019/hos_revised_flatfiles_archive_10_2019/Hospital General Information.csv", stringcols(1) clear

rename facilityid providerid
gen temp=substr(providerid,3,4)
gen cah=0
replace cah=1 if real(temp)>=1300 & real(temp)<=1399

keep if cah==1
save 2019cahdata.dta, replace

*Merge To Identify Possible Conversion
merge 1:1 address city state zipcode using 2018noncahdata.dta
keep providerid _merge

//drop if _merge==2

gen year=2019
save 2019cahmergedata.dta, replace

*Appending Data for Hospitals That Converted to CAH
use 2015cahmergedata.dta, clear

append using 2016cahmergedata.dta
append using 2017cahmergedata.dta
append using 2018cahmergedata.dta
append using 2019cahmergedata.dta

save cahconversiondata.dta, replace

*Merge to CAH Master List
use cahconversiondata.dta, clear
merge 1:1 providerid year using fullymergedcahdata.dta, gen (_new_merge)

variables providerid year do not uniquely identify observations in the master
data
r(459);

end of do-file

r(459);

Thanks!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#2

31 Oct 2022, 17:19

Well, it is easy enough to identify the offending observations:

Code:

use cahconversiondata, clear duplicates tag providerid year, gen(flag) browse if flag

Then you have to look at them and figure out what is going wrong. If the surplus observations are pure duplicates, i.e. agree with each other on all variables, then you can just do -duplicates drop- and they will go away. But before you do that, I urge you to revisit the data management leading up to the creation of cahconversiondata, and check at what point the duplicate observations crept in. It is likely that that point reflects a data management error. And where one error is found, others my lurk unseen. So before pushing ahead with -duplicates drop-, look for and correct the errors.

Life is harder for you if the observations that have common values of providerid and year disagree on some other variables. Then you have not only to find out how these surplus, contradictory observations crept into your data, but you must also figure out which, if any, of them is the correct one to retain (or perhaps the data from them needs to be combined in some way to get an accurate picture.)

Here's my hunch (but it is only a hunch) where things go awry. When you create the yearly cahmergedata files, you do that by merge-ing two files on address city state and zipcode. I think it is quite plausible that a large health care center might have two separate facilities, about which it reports separate data into the databases you are working with, located at the same address. For example, in New York City, Morgan Stanley Children's Hospital and New York Presbyterian Hospital are both located in the same large complex and might well be reported as separate hospitals with the same address. I'm sure other similar situations can arise. If I'm right about this, what I cannot tell you is what else might be in your data set that will enable you to disambiguate these.

Or, it might be something else. Hopefully, when you see the offending observations it will be, or soon become, clear to you.
1 like
Comment
Anthony Gonzalez

Join Date: Oct 2022

Posts: 9
#3

01 Nov 2022, 06:12

Thank you very much! If I were to drop duplicates, where would be a good place to place this code at?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#4

01 Nov 2022, 11:05

Probably before the -save cahconversiondata.dta- command.

But let me re-emphasize that it is a bad idea to just -drop- the duplicates, even when they are complete duplicates. It is better to first revisit the creation of that data set to see how the duplicates got in there in the first place. They don't belong there and it is likely that there is some error in the code that created cahconversiondata.dta, or in the original data sets from which it was built. It is far preferable to trace the problem back to its roots and fix it than just "shoot the alligators downstream." We know that there is something wrong with cahconversiondata.dta. We do not yet know if there are other problems with that data set as well. Where one problem is found, there is a good chance others remain. Sooner or later, if there are other errors, they will crop up somewhere. Better to nip that in the bud than wait until a future time when somebody relies to their detriment on your analysis of incorrect data. Never rush to get the wrong answers.
Comment

Announcement

Variables do not uniquely identify observations in the master data

Comment

Comment

Comment