No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging many datasets - conflicts with string and numeric vars

    Hello! This probably isn't the most efficient way to merge multiple databases (suggestions are welcome), however my dilemma is Stata treating my "nonmissing conflict _merge==5" values as missing when I use the replace and force options when merging.

    I'm working with health system data that changes over time, in both added and dropped patients, as well as status changes over time (i.e. "most recent pain assessment rating"). The example code below runs through roughly 25 datasets, each with 3000 observations and 30 variables. For variables like the pain rating above, it is sometimes coded as a "3" on a Likert-type scale or as "Moderate" pain; therefore some databases are numeric/double while others are string/str3. I'm looking for a way to generate a new var in the middle of the merge process, something like "gen newvar = oldvar if _merge==5" that way the values of the oldvar are not overwritten/turned into missing data and the newvar values are retained in their own varlist.

    use "DatabaseA.dta"
    merge 1:1 studyId using "DatabaseB.dta", replace update force
    drop _merge
    save "DatabaseAB.dta"
    use "DatabaseAB.dta"
    merge 1:1 studyId using "DatabaseC.dta", replace update force
    drop _merge
    save "DatabaseABC.dta"
    use "DatabaseABC.dta"
    merge 1:1 studyId using "DatabaseD.dta", replace update force
    drop _merge
    save "DatabaseABCD.dta"
    This post seems similar to what I'm asking, but was not quite resolved:


  • #2
    Hi, Josh,

    The easiest way to get what you want is to first loop over all data sets, converting relevant variables to the same type. Then merge data sets. Do the following separately for data sets where your problem variable(s) is(are) numeric or string. Change file paths and variable names as needed. Variable names should be the same across all files.

    For string (Loop over files to encode string variables)

    foreach file in "C:\DATA\HSD_1.txt" "C:\DATA\HSD_3.txt" "C:\DATA\HSD_5.txt" ... {
    use `file', clear
    encode painrating, gen(newpainrating)
    drop painrating
    rename newpainrating painrating
    save , replace

    For numeric (Loop over files to add value labels to numeric variables)

    foreach file in "C:\DATA\HSD_2.txt" "C:\DATA\HSD_4.txt" "C:\DATA\HSD_6.txt" ... {
    use `file', clear
    label define rating 1 "Very Low" 2 "Low" 3 "Moderate" 4 "High" 5 "Very High"
    label values painrating rating
    save , replace

    Now merge the files.

    The code above is meant as an example only. Consult Stata resources for more on how to encode (help encode) and defining labels (help label). You want to make sure the encoding of strings accurately reflects proper numeric ordering.

    I hope that helps. And I hope it works.

    Last edited by David Torres; 11 Jul 2018, 15:37.


    • #3
      The advice in post #2 to convert all variables to the same type before trying to merge is good. But the techniques suggested are a problem.

      As the output of help encode tells us, encode should not be used to convert numbers stored as strings into strings. Consider the following examples.
      * Example generated by -dataex-. To install: ssc install dataex
      input str16(a b c d)
      "1" "1" "0" "Low"     
      "2" "2" "1" "Mild"    
      "3" "2" "2" "Moderate"
      "4" "4" "3" "Very"    
      "5" "5" "4" "Extreme" 
      encode a, generate(new_a)
      encode b, generate(new_b)
      encode c, generate(new_c)
      encode d, generate(new_d)
      . list *a *b *c *d, clean noobs nolabel
          a   new_a   b   new_b   c   new_c          d   new_d  
          1       1   1       1   0       1        Low       2  
          2       2   2       2   1       2       Mild       3  
          3       3   2       2   2       3   Moderate       4  
          4       4   4       3   3       4       Very       5  
          5       5   5       4   4       5    Extreme       1
      The following code generates more appropriate numeric values.
      destring a, generate(new_a)
      destring b, generate(new_b)
      destring c, generate(new_c)
      label define dlabel 1 "Low" 2 "Mild" 3 "Moderate" 4 "Very" 5 "Extreme"
      encode d, generate(new_d) label(dlabel)
      . list *a *b *c *d, clean noobs nolabel
          a   new_a   b   new_b   c   new_c          d   new_d  
          1       1   1       1   0       0        Low       1  
          2       2   2       2   1       1       Mild       2  
          3       3   2       2   2       2   Moderate       3  
          4       4   4       4   3       3       Very       4  
          5       5   5       5   4       4    Extreme       5


      • #4
        Thanks, William.

        And good luck, Josh.



        • #5
          David and William,

          Thank you both for your responses! I used the foreach loop to label and encode the datasets I knew contained vars as strings, then generated an identical newvar for the set that was already numeric (I suppose I could have just renamed the oldvar as well). All of my flies were successfully merged.

          foreach file in "20171005_Tracking" "20171020_Tracking" ... {
          use `file', clear
          label define conf 1 "Not at all confident" 2 "A little bit confident" 3 "Somewhat confident" 4 "Quite a bit confident" 5 "Extremely confident"
          encode Confident, generate(Confi2) label(conf)
          recast byte Confi2
          label define testever 1 "Yes" 2 "No"
          encode Col, generate(Col2) label(testever)
          encode StoolTestEver, generate(Stool2) label(testever)
          recast byte Col2
          recast byte Stool2
          label define worry 1 "Not at all" 2 "Slightly" 3 "Somewhat" 4 "Moderately" 5 "Extremely"
          encode Worri, generate(Worri2) label(worry)
          recast byte Worri2
          label define likely 1 "Very low" 2 "Somewhat low" 3 "Moderate" 4 "Somewhat high" 5 "Very high"
          encode HowLikely, generate(HowL2) label(likely)
          recast byte HowL2
          label define decision 1 "I prefer to make the decision" 2 "I prefer to make the decision, after seriously considering my doctor’s opinion" 3 "I prefer that my doctor and I share responsibility for the decision" 4 "I prefer that my doctor makes the decision, but seriously considers my opinion" 5 "I prefer to leave the decision to my doctor"
          encode Decision, generate(Decision2) label(decision)
          recast byte Decision2
          label define thoughts 1 "I have had colon cancer testing in the past 6 months" 2 "I have decided to have colon cancer testing in the next 6 months" 3 "I have decided to have colon cancer testing, but not in the next 6 months" 4 "I am undecided about whether to have colon cancer testing" 5 "I have decided against having colon cancer testing"
          encode Thoughts, generate(Thought2) label(thoughts)
          recast byte Thought2
          save, replace
          foreach file in "20180301_Tracking" "20180316_Tracking" "20180323_Tracking"... {
          use `file', clear
          generate Confi2 = Confident
          generate Col2 = Col
          generate Stool2 = StoolTestEver
          generate Worri2 = Worri
          generate HowL2 = HowLikely
          gen Decision2 = Decision
          generate Thought2 = Thought
          save, replace
          Thanks again,