Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variables do not uniquely identify observations in the master data

    Hello,

    I keep getting the error "variables providerid year do not uniquely identify observations in the master data". However, I am not sure how else to uniquely identify the master data as I have tried adding more variables to it.

    Here is my code:

    *Identify 2014 General Acute Hospitals (GAH)
    import delimited "/Users/markflores/Desktop/POS files/HOSArchive_Revised_Flatfiles_20141023/Hospital General Information.csv", stringcols(1) clear

    gen temp=substr(providerid,3,4)
    gen genhospital=0
    replace genhospital=1 if real(temp)>=0001 & real(temp)<=0879
    keep if genhospital==1
    rename providerid oldproviderid

    keep oldproviderid address city state zipcode
    isid address city state zipcode

    save 2014noncahdata.dta, replace

    *Identify 2015 Critical Access Hospitals (CAH)
    import delimited "/Users/markflores/Desktop/POS files/hos_revised_flatfiles_archive_10_2015/Hospital General Information.csv", stringcols(1) clear

    gen temp=substr(providerid,3,4)
    gen cah=0
    replace cah=1 if real(temp)>=1300 & real(temp)<=1399

    keep if cah==1
    save 2015cahdata.dta, replace

    *Merge To Identify Possible Conversion
    merge 1:1 address city state zipcode using 2014noncahdata.dta
    keep providerid _merge

    //drop if _merge==2

    gen year=2015
    save 2015cahmergedata.dta, replace



    *Identify 2015 General Acute Hospitals (GAH)
    import delimited "/Users/markflores/Desktop/POS files/hos_revised_flatfiles_archive_10_2015/Hospital General Information.csv", stringcols(1) clear

    gen temp=substr(providerid,3,4)
    gen genhospital=0
    replace genhospital=1 if real(temp)>=0001 & real(temp)<=0879
    keep if genhospital==1
    rename providerid oldproviderid

    keep oldproviderid address city state zipcode
    isid address city state zipcode

    save 2015noncahdata.dta, replace

    *Identify 2016 Critical Access Hospitals (CAH)
    import delimited "/Users/markflores/Desktop/hospital data/hospitals_archive_2016/hos_revised_flatfiles_archive_11_2016/Hospital General Information.csv", stringcols(1) clear

    gen temp=substr(providerid,3,4)
    gen cah=0
    replace cah=1 if real(temp)>=1300 & real(temp)<=1399

    keep if cah==1
    save 2016cahdata.dta, replace

    *Merge To Identify Possible Conversion
    merge 1:1 address city state zipcode using 2015noncahdata.dta
    keep providerid _merge

    //drop if _merge==2

    gen year=2016
    save 2016cahmergedata.dta, replace



    *Identify 2016 General Acute Hospitals (GAH)
    import delimited "/Users/markflores/Desktop/hospital data/hospitals_archive_2016/hos_revised_flatfiles_archive_11_2016/Hospital General Information.csv", stringcols(1) clear

    gen temp=substr(providerid,3,4)
    gen genhospital=0
    replace genhospital=1 if real(temp)>=0001 & real(temp)<=0879
    keep if genhospital==1
    rename providerid oldproviderid

    keep oldproviderid address city state zipcode
    isid address city state zipcode

    save 2016noncahdata.dta, replace

    *Identify 2017 Critical Access Hospitals (CAH)
    import delimited "/Users/markflores/Desktop/hospital data/hospitals_archive_2017/hos_revised_flatfiles_archive_10_2017/Hospital General Information.csv", stringcols(1) clear

    gen temp=substr(providerid,3,4)
    gen cah=0
    replace cah=1 if real(temp)>=1300 & real(temp)<=1399

    keep if cah==1
    save 2017cahdata.dta, replace

    *Merge To Identify Possible Conversion
    merge 1:1 address city state zipcode using 2016noncahdata.dta
    keep providerid _merge

    //drop if _merge==2

    gen year=2017
    save 2017cahmergedata.dta, replace



    *Identify 2017 General Acute Hospitals (GAH)
    import delimited "/Users/markflores/Desktop/hospital data/hospitals_archive_2017/hos_revised_flatfiles_archive_10_2017/Hospital General Information.csv", stringcols(1) clear

    gen temp=substr(providerid,3,4)
    gen genhospital=0
    replace genhospital=1 if real(temp)>=0001 & real(temp)<=0879
    keep if genhospital==1
    rename providerid oldproviderid

    keep oldproviderid address city state zipcode
    isid address city state zipcode

    save 2017noncahdata.dta, replace

    *Identify 2018 Critical Access Hospitals (CAH)
    import delimited "/Users/markflores/Desktop/hospital data/hospitals_archive_2018/hos_revised_flatfiles_archive_10_2018/Hospital General Information.csv", stringcols(1) clear

    gen temp=substr(providerid,3,4)
    gen cah=0
    replace cah=1 if real(temp)>=1300 & real(temp)<=1399

    keep if cah==1
    save 2018cahdata.dta, replace

    *Merge To Identify Possible Conversion
    merge 1:1 address city state zipcode using 2017noncahdata.dta
    keep providerid _merge

    //drop if _merge==2

    gen year=2018
    save 2018cahmergedata.dta, replace



    *Identify 2018 General Acute Hospitals (GAH)
    import delimited "/Users/markflores/Desktop/hospital data/hospitals_archive_2018/hos_revised_flatfiles_archive_10_2018/Hospital General Information.csv", stringcols(1) clear

    gen temp=substr(providerid,3,4)
    gen genhospital=0
    replace genhospital=1 if real(temp)>=0001 & real(temp)<=0879
    keep if genhospital==1
    rename providerid oldproviderid

    keep oldproviderid address city state zipcode
    isid address city state zipcode

    save 2018noncahdata.dta, replace

    *Identify 2019 Critical Access Hospitals (CAH)
    import delimited "/Users/markflores/Desktop/hospital data/hospitals_archive_2019/hos_revised_flatfiles_archive_10_2019/Hospital General Information.csv", stringcols(1) clear

    rename facilityid providerid
    gen temp=substr(providerid,3,4)
    gen cah=0
    replace cah=1 if real(temp)>=1300 & real(temp)<=1399

    keep if cah==1
    save 2019cahdata.dta, replace

    *Merge To Identify Possible Conversion
    merge 1:1 address city state zipcode using 2018noncahdata.dta
    keep providerid _merge

    //drop if _merge==2

    gen year=2019
    save 2019cahmergedata.dta, replace



    *Appending Data for Hospitals That Converted to CAH
    use 2015cahmergedata.dta, clear

    append using 2016cahmergedata.dta
    append using 2017cahmergedata.dta
    append using 2018cahmergedata.dta
    append using 2019cahmergedata.dta

    save cahconversiondata.dta, replace



    *Merge to CAH Master List
    use cahconversiondata.dta, clear
    merge 1:1 providerid year using fullymergedcahdata.dta, gen (_new_merge)

    variables providerid year do not uniquely identify observations in the master
    data

    r(459);

    end of do-file

    r(459);

    Thanks!

  • #2
    Well, it is easy enough to identify the offending observations:
    Code:
    use cahconversiondata, clear
    duplicates tag providerid year, gen(flag)
    browse if flag
    Then you have to look at them and figure out what is going wrong. If the surplus observations are pure duplicates, i.e. agree with each other on all variables, then you can just do -duplicates drop- and they will go away. But before you do that, I urge you to revisit the data management leading up to the creation of cahconversiondata, and check at what point the duplicate observations crept in. It is likely that that point reflects a data management error. And where one error is found, others my lurk unseen. So before pushing ahead with -duplicates drop-, look for and correct the errors.

    Life is harder for you if the observations that have common values of providerid and year disagree on some other variables. Then you have not only to find out how these surplus, contradictory observations crept into your data, but you must also figure out which, if any, of them is the correct one to retain (or perhaps the data from them needs to be combined in some way to get an accurate picture.)

    Here's my hunch (but it is only a hunch) where things go awry. When you create the yearly cahmergedata files, you do that by merge-ing two files on address city state and zipcode. I think it is quite plausible that a large health care center might have two separate facilities, about which it reports separate data into the databases you are working with, located at the same address. For example, in New York City, Morgan Stanley Children's Hospital and New York Presbyterian Hospital are both located in the same large complex and might well be reported as separate hospitals with the same address. I'm sure other similar situations can arise. If I'm right about this, what I cannot tell you is what else might be in your data set that will enable you to disambiguate these.

    Or, it might be something else. Hopefully, when you see the offending observations it will be, or soon become, clear to you.

    Comment


    • #3
      Thank you very much! If I were to drop duplicates, where would be a good place to place this code at?

      Comment


      • #4
        Probably before the -save cahconversiondata.dta- command.

        But let me re-emphasize that it is a bad idea to just -drop- the duplicates, even when they are complete duplicates. It is better to first revisit the creation of that data set to see how the duplicates got in there in the first place. They don't belong there and it is likely that there is some error in the code that created cahconversiondata.dta, or in the original data sets from which it was built. It is far preferable to trace the problem back to its roots and fix it than just "shoot the alligators downstream." We know that there is something wrong with cahconversiondata.dta. We do not yet know if there are other problems with that data set as well. Where one problem is found, there is a good chance others remain. Sooner or later, if there are other errors, they will crop up somewhere. Better to nip that in the bud than wait until a future time when somebody relies to their detriment on your analysis of incorrect data. Never rush to get the wrong answers.

        Comment

        Working...
        X