Mystery using merge 1:1

Celine Robin

Join Date: Sep 2020

Posts: 12
#1

Mystery using merge 1:1

29 Apr 2021, 12:28

Hello Stata list,

I am facing a situation I can't explain, I'm using the merge command failing to merge only one observation with extacly the same names, after removing all potential blanks. I am using Stata 15.

The first datasets is the publicly available data on COVID by ECDC available here : https://www.ecdc.europa.eu/en/public...-rate-covid-19
I use the excel version of it, saved on my laptop, and keep only the last available week and only cases, so that one country = one observations and I have no duplicates.

Code:

keep if year_week=="2021-16" keep if indicator=="cases"

Then I change some of the upper cases to lower cases for the country names, and I remove any additional blanks (there were actually none):

Code:

replace country=subinstr(country, " And ", " and ", 1) replace country=subinstr(country, " Of ", " of ", 1) replace country=strtrim(stritrim(country))

I merge it with a file I created which has the translation of each countries. I've also deleted all potential additional blanks beforehand, using the same command as above (and again, there were none).

Code:

merge 1:1 country using Translate.dta br country pays if _merge!=3

The browse command give the following output:

country pays _merge
Antigua and Barbuda master only (1)
Antigua and Barbuda Antigua-et-Barbuda using only (2)

I don't see the difference between the country names.
Yet, when I copy paste the country name directly from the browse table, the first one and then the second one (which look completely identical to me), using the following commands,

Code:

br country pays if country=="Antigua and Barbuda"

and

Code:

br country pays if country=="Antigua and Barbuda"

The output yields in one observation for each: the one I copied respectively.

Following the same logic, when I tried to modify the name in my master dataset, it does make a replacement :

replace country="Antigua and Barbuda" if country=="Antigua and Barbuda"
(1 real change made)

Are there any hidden characters we should be wary of?
It's the first time I encountering such situation in Stata.

Thank you in anticipation for enlighting me here.
Tags: None

Sergiy Radyakin

Join Date: Apr 2014
Posts: 1867

29 Apr 2021, 12:54

Here is an example of such a dataset:

Code:

do https://www.radyakin.org/statalist/2021/1606702.do

Code:

Duplicates in terms of country

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |            4             0
--------------------------------------

Code:

     +---------------------+
     |             country |
     |---------------------|
  1. | Antigua and Barbuda |
  2. | Аntigua and Barbuda |
  3. | Antigua and Вarbuda |
  4. | Antiguа and Barbuda |
     +---------------------+

Comment

Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#3

29 Apr 2021, 13:07

To solve 'mysteries' like these, the ascii() function of mata is very handy:
1 like
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

29 Apr 2021, 13:13

After running Sergiy's code, using the community-contributed chartab command installed from SSC with

Code:

ssc install chartab

tells us that in his example several character are in fact not ASCII characters, even though they appear to be; they are instead Unicode characters from foreign alphabets.

Code:

. help chartab

. chartab *

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+--------------------------------------------
        32       \u0020             |             8    SPACE
        65       \u0041       A     |             3    LATIN CAPITAL LETTER A
        66       \u0042       B     |             3    LATIN CAPITAL LETTER B
        97       \u0061       a     |            15    LATIN SMALL LETTER A
        98       \u0062       b     |             4    LATIN SMALL LETTER B
       100       \u0064       d     |             8    LATIN SMALL LETTER D
       103       \u0067       g     |             4    LATIN SMALL LETTER G
       105       \u0069       i     |             4    LATIN SMALL LETTER I
       110       \u006e       n     |             8    LATIN SMALL LETTER N
       114       \u0072       r     |             4    LATIN SMALL LETTER R
       116       \u0074       t     |             4    LATIN SMALL LETTER T
       117       \u0075       u     |             8    LATIN SMALL LETTER U
     1,040       \u0410       А     |             1    CYRILLIC CAPITAL LETTER A
     1,042       \u0412       В     |             1    CYRILLIC CAPITAL LETTER VE
     1,072       \u0430       а     |             1    CYRILLIC SMALL LETTER A
------------------------------------+--------------------------------------------

                                    freq. count   distinct
ASCII characters              =              73         12
Multibyte UTF-8 characters    =               3          3
Unicode replacement character =               0          0
Total Unicode characters      =              76         15

This explains how the problem can occur; it does not tell us a convenient way to rectify it.

Install and run chartab on each of your two datasets and report back, and someone can likely advise you how to proceed.

Comment

Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#5

29 Apr 2021, 13:21

Code:

merge 1:1 country using Translate.dta keep if _merge!=3 mata ascii(`"`=country[1]'"') mata ascii(`"`=country[2]'"')

Should indicate where the difference is.
Most likely the spelling is non-canonical in the hand-made Translate.dta
Comment
Celine Robin

Join Date: Sep 2020

Posts: 12
#6

29 Apr 2021, 14:31

Thank you to both of you! That is very instructive. Will try both approaches tomorrow when I'm back to work.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10254

29 Apr 2021, 14:47

You can also get rid of the problematic non-ascii characters using standard string functions:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input strL country
"Antigua and Barbuda" 
"Аntigua and Barbuda"
"Antigua and Вarbuda"
"Antiguа and Barbuda"
end

gen country2=ustrto(country, "ascii", 2)

Then replace manually or use a matching mechanism.

Res.:

Code:

. 
. gen country2=ustrto(country, "ascii", 2)

. l

     +-------------------------------------------+
     |             country              country2 |
     |-------------------------------------------|
  1. | Antigua and Barbuda   Antigua and Barbuda |
  2. | Аntigua and Barbuda    ntigua and Barbuda |
  3. | Antigua and Вarbuda    Antigua and arbuda |
  4. | Antiguа and Barbuda    Antigu and Barbuda |
     +-------------------------------------------+

Announcement