Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating unique ID for use with merge

    Hello All,

    I was wondering whether someone might help me with understanding how to create a unique variable upon which to merge 2 panel datasets in which individuals can have more than one observation per time point?
    Specifically, my 2 datasets, one pharmacy claims, one medical claims are in long format where one individual can use more than 1 drug or use more than one medical service per month. I’d like to produce a variable which would merge the pharmacy and medical claims based on the combo of individual id (“mbr_ckey”) and month of use (“cmonth”). The datasets look something like the following:
    Pharmacy
    mbr_ckey cMonth Drug name
    1 0 Aspirin
    1 0 Oxycodone
    1 1 Aspirin
    1 2 Aspirin
    1 3 Aspirin
    2 0 simvastatin
    2 1 .
    2 2 Simvastatin
    2 3 Simvastatin
    2 3 warfarin
    Medical
    mbr_ckey cMonth CPT Code
    1 0 .
    1 1 10021
    1 2 .
    1 3 99341
    2 0 .
    2 1 .
    2 2 .
    2 3 .
    I’ve tsset and tssfill, full both datasets so all individuals have the entire range of time observations (0-3 here), regardless if they’ve incurred a claim during a particular month.

    What I’m thinking to do is to generate a new variable which would uniquely identify each mbr_ckey and cMonth combo by combing mbr_ckey and cMonth names for the observation and numbering mbr_ckey and cMonth combos sequentially within cmonths. For example, in the pharmacy dataset the new variable might look like:

    Pharmacy
    mbr_ckey cMonth Drug name Mergekey
    1 0 Aspirin 1-0-1
    1 0 Oxycodone 1-0-2
    1 1 Aspirin 1-1-1
    1 2 Aspirin 1-2-1
    1 3 Aspirin 1-3-1
    2 0 simvastatin 2-0-1
    2 1 . 2-1-1
    2 2 Simvastatin 2-2-1
    2 3 Simvastatin 2-3-1
    2 3 warfarin 2-3-2
    After doing so for both datasets, I could merge 1:1 on Mergekey.

    To restate my question:
    1. Is there an even better way to do what I am proposing?
    2. If not, what would be the Stata code to generate the Mergekey?
    Many Thanks!
    Karl
    I am using Stata SE x64 ver 13.1 with Win 7 x64 and with 8 GB of ram.

  • #2
    Appologies for the column widths, seems like Statalist did not like it when I copied and pasted from the table I created in word.
    I am using Stata SE x64 ver 13.1 with Win 7 x64 and with 8 GB of ram.

    Comment


    • #3
      Why don't you just merge on the combination of mbr_ckey and cMonth? (As you'll see in the help file, merge allows you to merge on a varlist, not just a single variable)

      Comment


      • #4
        As I understand your data, you have one observation per person and month in the medical dataset, but you may have more in the pharmacy dataset. This calls for a 1:m merge:
        Code:
        use medical.dta
        merge 1:m mbr_ckey cMonth using pharmacy.dta
        Hope this helps

        Comment


        • #5
          Hi Karl,

          Phil and Svend are right – there's no need in generating an additional merging identifier in your data. By the way, it would be useless unless you could generate it in both data sets.

          Svend is correct on suggesting that you can simply merge 1:m with your existing data. But be cautious when doing so: the resulting data set will not be a "simple" panel data set (with one observation per "mbr_ckey" and "cmonth"), but will contain these observations duplicated where necessary in order to reflect several "drug_name"s per "mbr_ckey" and "cmonth". If you want to avoid this, you should first -reshape- your pharmacy data, and afterwards 1:1 merge it with your medical data.

          For giving you an illustration, I create two simple data sets containing your data:
          Code:
          clear
          input    mbr_ckey    cmonth    str11 drug_name
              1        0    "Aspirin"
              1        0    "Oxycodone"
              1        1    "Aspirin"
              1        2    "Aspirin"
              1        3    "Aspirin"
              2        0    "simvastatin"
              2        1    ""
              2        2    "Simvastatin"
              2        3    "Simvastatin"
              2        3    "warfarin"
          end
          label variable cmonth "cMonth"
          label variable drug_name "Drug name"
          save "pharmacy.dta" , replace
          clear
          input    mbr_ckey    cmonth    cpt_code
              1        0    .
              1        1     10021
              1        2     .
              1        3     99341
              2        0     .
              2        1     .
              2        2     .
              2        3     .
          end
          label variable cmonth "cMonth"
          label variable cpt_code "CPT Code"
          save "medical.dta" , replace
          Now here's a working minimal example for both ways of combining the two data sets; be sure to have a look at the result data with -browse- after each merge procedure:
          Code:
          // Svend's straightforward way: merge 1:m with your data
          use "medical.dta"
          merge 1:m mbr_ckey cmonth using "pharmacy.dta"
          sort mbr_ckey cmonth
          // --> -browse- result: one observation per "drug_name" inside of "mbr_ckey" and "cmonth"
          
          // my cautious way: first -reshape-, then merge 1:1
          use "pharmacy.dta" , clear
          bysort mbr_ckey cmonth : generate entry=_n // this is essential: we generate a running number per given drug_name
          reshape wide drug_name , i(mbr_ckey cmonth) j(entry)
          save "pharmacy_reshaped.dta" , replace
          use "medical.dta"
          merge 1:1 mbr_ckey cmonth using "pharmacy_reshaped.dta"
          // --> -browse- result: one observation per "mbr_ckey" and "cmonth"
          Note the difference between the two data sets generated by the merge procedures: Whilst the first one delivers one observation per given "drug_name" inside any "mbr_ckey" and "cmonth", the second one delivers several given "drug_name"s in several variables.

          Which way fits best is largely determined by the analysis you want to perform at the end. Both may be appropriate.

          Regards
          Bela

          Comment


          • #6
            In the example data you show you only have one medical claim per month. In that case the strategy Svend suggests should work for you.

            However, based on your text and my experience working with claims data, I'm guessing that you actually have multiple medical claims per month for some people.

            I think that if that is the case I would use reshape wide on both data sets to get a single observation per person individual each month. On this list you'll often see advice suggesting against wide form data. However, I think it's the only way to sensibly combine the data if you have multiple claims of each type per month.

            Once you have wide data for each you can do a 1:1 merge using mbr_ckey and cMonth as your merge id.

            The process might look something like this (note I'm guessing a bit about variable names, etc. If you have other variables that vary by month you'll want to include them in the reshape command as well).

            Code:
             use medical
            **create a sequence number to use in the reshape
            bysort mbr_ckey cMonth : gen seq=_n
            
            reshape wide cpt, i(mbr_ckey cMonth) j(seq)
            
            save medical_wide
            
            
            **now do the same with the pharmacy claims
            use pharmacy
            bysort mbr_ckey cMonth : gen seq=_n
            
            reshape wide drugname, i(mbr_ckey cMonth) j(seq)
            
            
            **merge the two sets.  mbr_ckey and cMonth should now uniquely identify observations in both data sets
            merge 1:1 mbr_ckey cMonth using medical_wide
            See the help for reshape and merge for more info on both.

            Comment


            • #7
              Thank you all for your responses. Sarah, you are correct in your observations and I have been trying to implement your suggested code, although it has exposed that some individuals fill >20 different drugs per month (which causes the data set to become impossibly big to fit into working memory) and I am in the process of reviewing these observations for errors.
              I am using Stata SE x64 ver 13.1 with Win 7 x64 and with 8 GB of ram.

              Comment


              • #8
                I tried to take Sarah's advice and reshape the data into wide format. However, I've found that some individuals fill >20 different drugs per month (each of which are currently individual observations in the long format). I have verified that these are legitimate drug fills and not errors) and reshaping the data into wide causes the data set to become impossibly big to fit into working memory since the datasets contain ~16,000 individuals observed monthly over a 4 year period.

                Might there be a different way?

                I am using STATA SE x64 ver 13.1 with Win 7 x64 and with 8 GB of ram.

                Karl
                I am using Stata SE x64 ver 13.1 with Win 7 x64 and with 8 GB of ram.

                Comment


                • #9
                  Well, 16,000 individuals observed monthly for four years comes to 768,000 observations, which, in the general scheme of things, is a large, but not huge amount of data. And it doesn't sound like enough to overwhelm your setup unless each observation is very large (but see 5 below). So to shrink the size of each observation you can try a few things:

                  1. -encode- the drug name variable and then -compress- the resulting variable. Currently they are strings, perhaps long ones, and you can reduce them to integers (2 bytes) , or maybe even to byte (with just a little overhead for the label). Do the same for any other string variable that is being used to represent only a limited range of values (e.g. sex, ethnicity, insurance provider), if there are any such.

                  2. -compress- everything!

                  3. Review the entire list of drugs that appear in your data set: perhaps some of them are not relevant for your purposes and those observations can be dropped. (e.g. do you care about multivitamins or dry-skin lotion for present purposes?)

                  4. Are there variables in your data set that you don't need for this particular analysis? Drop them.

                  5. Your operating system may be restricting the amount of memory you can access to far less than 8GB. If it's not your personal computer, check with your IT department to see if that can be changed.

                  If these steps don't help, perhaps somebody else on the forum has other suggestions.

                  Comment


                  • #10
                    How you approach this task depends a great deal on what your end goals are.
                    The basic logic here is that you can't sensibly do a many to many merge so you need to figure out some way to reduce the data in such a way that both datasets share an identifier (or identifiers) and at least one of the datasets has a single observation per identifier (or unique combination of identifiers). What the right way to do that is will depend a great deal on what you plan to do after merging the data.

                    When I've worked with data of this type in the past I've usually avoided merging claims observations to claims observations entirely. Instead I create summary measures of interest from one or both datasets and then merge together the summary measures. So, for example, if you're interested in how many medical claims per month of a certain type each person has, you would use the medical claims data to create that measure. Then you would have a single observation per person per month that you could merge to your pharmacy data using a 1:m merge on person id and month id. Or, better yet, you can create whatever measures you need from the pharmacy claims and then merge both summary files together using a 1:1 merge.

                    Of course, this kind of strategy won't work if you really need to be able to reference information from individual medical claims and pharmacy claims at the same time to create your measures. As a first step you should outline for yourself exactly what information you need out of each dataset and figure out exactly how the pieces should fit together in your final dataset. The most important part of any merge is figuring out exactly what the data should look like when you're done. Once you have that clearly worked out it's much easier to work out the specifics of the data management tasks.

                    With regard to my original advice, I'm surprised that you're having difficulty with the reshape. Reshaping wide with 20 or so observations per month should be fine (though >20 is not specific and I could see problems developing if some people had many many more fills than 20 in a month). Are you reshaping in a way that gives you a single observation per person per month as I suggested? If you're trying to reshape to a single observation per person I could see the reshape process being too unwieldy to work, but I wouldn't expect that to be an issue reshaping to a person-month dataset.

                    If you want to pursue the reshaping option and Clyde's advice doesn't solve your problem you should give us more information about exactly what variables you have in your data. Do you have many more variable per claim than what you showed? After describing your data your should show us what you typed to do the reshape and then show what information Stata gave you. Without knowing more about what you're doing it's hard to figure out why you're having problems.

                    Comment

                    Working...
                    X