Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge accuracy using str format when most contain only numbers

    Dear stata user,
    I have a question regarding the merge accuracy of str. I have dataset A whose firm_id are in string format, but most of them actually contain only numbers, those not jusr numbers are like the following:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str8 patent_id
    "RE43814"
    "RE43864"
    "RE43868"
    "RE43956"
    "RE43986"
    "RE43997"
    "RE44164"
    "RE44215"
    "RE44861"
    "RE44874"
    "RE44924"
    "RE44930"
    "RE44958"
    "RE44993"
    "RE45248"
    "RE45348"
    "RE45418"
    "RE45473"
    "RE45539"
    "RE45733"
    "RE45782"
    "RE45804"
    "RE45956"
    "RE45962"
    "RE45990"
    "RE46020"
    "RE46089"
    "RE46096"
    "RE46176"
    "RE46193"
    "RE46351"
    "RE46409"
    "RE46436"
    "RE46436"
    "RE46436"
    "RE46473"
    "RE46488"
    "RE46518"
    "RE46558"
    "RE46564"
    "RE46630"
    "RE46686"
    "RE46703"
    "RE46746"
    "RE46850"
    "RE46891"
    "RE47055"
    "RE47257"
    "RE47341"
    "RE47342"
    "RE47351"
    "RE47425"
    "RE47487"
    "RE47553"
    "RE47663"
    "RE47698"
    "RE47715"
    "RE47736"
    "RE47737"
    "RE47761"
    "RE47763"
    "RE47813"
    "RE47857"
    "RE47949"
    "RE48267"
    "RE48274"
    "RE48308"
    "RE48359"
    "RE48378"
    "RE48446"
    "RE48524"
    "RE48532"
    "RE48599"
    "RE48641"
    "RE48695"
    "RE48702"
    "T100501"
    "T958006"
    "T962010"
    "T964006"
    "T965001"
    "T988005"
    end
    And I have dataset B whose firm_id contains only numbers and are long format.
    Now I want to merge them using firm_id as a key, I have 2 options:
    1. turn str to long
    2. turn long to str
    For the 1st one I think using long format to merge will be more accurate, but I have to drop those firms with characters. And I don't know how to test those contain characters and drop them
    For the 2nd one I don't need to give up any observations but I wonder the accuracy of merge using str format, will this be accurate when most of them contain only numbers?

    Thanks!

  • #2
    I think you want to turn your str firm_id to a long firm_id, because if you happen to have any string firm_id with a leading zero "0123" and you try to turn a numeric firm 123 to string, you won't know that the string version needs a leading zero to match.

    Code:
    . * Example generated by -dataex-. For more info, type help dataex
    . clear
    
    . input str7 firm_id
    
           firm_id
      1. "RE43814"
      2. "RE43864"
      3. "1234567"
      4. "0000123"
      5. end
    
    . rename firm_id firm_id_s
    
    . generate long firm_id = real(firm_id_s)
    (2 missing values generated)
    
    . list, clean noobs abbreviate(12)
    
        firm_id_s   firm_id  
          RE43814         .  
          RE43864         .  
          1234567   1234567  
          0000123       123  
    
    .
    Now the firm_id that were numeric will match to those in the other dataset, and the firm_id that had letters will fail to match, and youthe value of _merge will be 1 or 2 for them (and for any numeric firm_id that don't appear in the other dataset).

    Comment


    • #3
      With respect to accuracy, you can merge with any variables (except strLs) and the accuracy will always demand a perfect match. In this sense, merge is perfectly accurate. Where we often see discrepancies from our expectations are when the values look the same by eye (but not by bytes) or if there is round-off error between -float- and -double- (admittedly this is not as a common for merge operations).

      Building slightly off of William's example, you can then create a new firm id as a string, that takes on either the IDs with alphanumeric characters, or the numeric ID (after converting to a number and back to a string). The same transformation can be applied to both datasets so that you have a commonly constructed firm ID to use for merging. This may be useful if you did have datasets with mixed ID formats, but might reduce some work using merge (or frlink).

      Code:
      clear *
      cls
      
      input str7 firm_id
      "RE43814"
      "RE43864"
      "1234567"
      "0000123"
      end
      
      rename firm_id firm_id_s
      generate long firm_id_l = real(firm_id_s)
      gen firm_id = cond(mi(firm_id_l), firm_id_s, strofreal(firm_id_l))
      list

      Comment

      Working...
      X