Merge accuracy using str format when most contain only numbers

Lucrecia Lei

Join Date: Feb 2022
Posts: 35

Merge accuracy using str format when most contain only numbers

25 Jun 2022, 04:36

Dear stata user,
I have a question regarding the merge accuracy of str. I have dataset A whose firm_id are in string format, but most of them actually contain only numbers, those not jusr numbers are like the following:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str8 patent_id
"RE43814"
"RE43864"
"RE43868"
"RE43956"
"RE43986"
"RE43997"
"RE44164"
"RE44215"
"RE44861"
"RE44874"
"RE44924"
"RE44930"
"RE44958"
"RE44993"
"RE45248"
"RE45348"
"RE45418"
"RE45473"
"RE45539"
"RE45733"
"RE45782"
"RE45804"
"RE45956"
"RE45962"
"RE45990"
"RE46020"
"RE46089"
"RE46096"
"RE46176"
"RE46193"
"RE46351"
"RE46409"
"RE46436"
"RE46436"
"RE46436"
"RE46473"
"RE46488"
"RE46518"
"RE46558"
"RE46564"
"RE46630"
"RE46686"
"RE46703"
"RE46746"
"RE46850"
"RE46891"
"RE47055"
"RE47257"
"RE47341"
"RE47342"
"RE47351"
"RE47425"
"RE47487"
"RE47553"
"RE47663"
"RE47698"
"RE47715"
"RE47736"
"RE47737"
"RE47761"
"RE47763"
"RE47813"
"RE47857"
"RE47949"
"RE48267"
"RE48274"
"RE48308"
"RE48359"
"RE48378"
"RE48446"
"RE48524"
"RE48532"
"RE48599"
"RE48641"
"RE48695"
"RE48702"
"T100501"
"T958006"
"T962010"
"T964006"
"T965001"
"T988005"
end

And I have dataset B whose firm_id contains only numbers and are long format.
Now I want to merge them using firm_id as a key, I have 2 options:
1. turn str to long
2. turn long to str
For the 1st one I think using long format to merge will be more accurate, but I have to drop those firms with characters. And I don't know how to test those contain characters and drop them
For the 2nd one I don't need to give up any observations but I wonder the accuracy of merge using str format, will this be accurate when most of them contain only numbers?

Thanks!

Tags: None

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

25 Jun 2022, 07:44

I think you want to turn your str firm_id to a long firm_id, because if you happen to have any string firm_id with a leading zero "0123" and you try to turn a numeric firm 123 to string, you won't know that the string version needs a leading zero to match.

Code:

. * Example generated by -dataex-. For more info, type help dataex . clear . input str7 firm_id firm_id 1. "RE43814" 2. "RE43864" 3. "1234567" 4. "0000123" 5. end . rename firm_id firm_id_s . generate long firm_id = real(firm_id_s) (2 missing values generated) . list, clean noobs abbreviate(12) firm_id_s firm_id RE43814 . RE43864 . 1234567 1234567 0000123 123 .

Now the firm_id that were numeric will match to those in the other dataset, and the firm_id that had letters will fail to match, and youthe value of _merge will be 1 or 2 for them (and for any numeric firm_id that don't appear in the other dataset).
1 like
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2457
#3

25 Jun 2022, 09:37

With respect to accuracy, you can merge with any variables (except strLs) and the accuracy will always demand a perfect match. In this sense, merge is perfectly accurate. Where we often see discrepancies from our expectations are when the values look the same by eye (but not by bytes) or if there is round-off error between -float- and -double- (admittedly this is not as a common for merge operations).

Building slightly off of William's example, you can then create a new firm id as a string, that takes on either the IDs with alphanumeric characters, or the numeric ID (after converting to a number and back to a string). The same transformation can be applied to both datasets so that you have a commonly constructed firm ID to use for merging. This may be useful if you did have datasets with mixed ID formats, but might reduce some work using merge (or frlink).

Code:

clear * cls input str7 firm_id "RE43814" "RE43864" "1234567" "0000123" end rename firm_id firm_id_s generate long firm_id_l = real(firm_id_s) gen firm_id = cond(mi(firm_id_l), firm_id_s, strofreal(firm_id_l)) list
2 likes
Comment

Announcement

Merge accuracy using str format when most contain only numbers

Comment

Comment