Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mystery using merge 1:1

    Hello Stata list,

    I am facing a situation I can't explain, I'm using the merge command failing to merge only one observation with extacly the same names, after removing all potential blanks. I am using Stata 15.

    The first datasets is the publicly available data on COVID by ECDC available here : https://www.ecdc.europa.eu/en/public...-rate-covid-19
    I use the excel version of it, saved on my laptop, and keep only the last available week and only cases, so that one country = one observations and I have no duplicates.

    Code:
    keep if year_week=="2021-16"
    keep if indicator=="cases"
    Then I change some of the upper cases to lower cases for the country names, and I remove any additional blanks (there were actually none):

    Code:
    replace country=subinstr(country, " And ", " and ", 1)
    replace country=subinstr(country, " Of ", " of ", 1)
    replace country=strtrim(stritrim(country))
    I merge it with a file I created which has the translation of each countries. I've also deleted all potential additional blanks beforehand, using the same command as above (and again, there were none).

    Code:
    merge 1:1 country using Translate.dta
    br country pays if _merge!=3
    The browse command give the following output:

    country pays _merge
    Antigua and Barbuda master only (1)
    Antigua and Barbuda Antigua-et-Barbuda using only (2)


    I don't see the difference between the country names.
    Yet, when I copy paste the country name directly from the browse table, the first one and then the second one (which look completely identical to me), using the following commands,

    Code:
     br country pays if country=="Antigua and Barbuda"
    and

    Code:
     br country pays if country=="Antigua and Barbuda"
    The output yields in one observation for each: the one I copied respectively.


    Following the same logic, when I tried to modify the name in my master dataset, it does make a replacement :

    replace country="Antigua and Barbuda" if country=="Antigua and Barbuda"
    (1 real change made)


    Are there any hidden characters we should be wary of?
    It's the first time I encountering such situation in Stata.

    Thank you in anticipation for enlighting me here.



  • #2
    Here is an example of such a dataset:
    Code:
    do https://www.radyakin.org/statalist/2021/1606702.do

    Code:
    Duplicates in terms of country
    
    --------------------------------------
       copies | observations       surplus
    ----------+---------------------------
            1 |            4             0
    --------------------------------------

    Code:
         +---------------------+
         |             country |
         |---------------------|
      1. | Antigua and Barbuda |
      2. | Аntigua and Barbuda |
      3. | Antigua and Вarbuda |
      4. | Antiguа and Barbuda |
         +---------------------+

    Comment


    • #3
      To solve 'mysteries' like these, the ascii() function of mata is very handy:

      Comment


      • #4
        After running Sergiy's code, using the community-contributed chartab command installed from SSC with
        Code:
        ssc install chartab
        tells us that in his example several character are in fact not ASCII characters, even though they appear to be; they are instead Unicode characters from foreign alphabets.
        Code:
        . help chartab
        
        . chartab *
        
           decimal  hexadecimal   character |     frequency    unique name
        ------------------------------------+--------------------------------------------
                32       \u0020             |             8    SPACE
                65       \u0041       A     |             3    LATIN CAPITAL LETTER A
                66       \u0042       B     |             3    LATIN CAPITAL LETTER B
                97       \u0061       a     |            15    LATIN SMALL LETTER A
                98       \u0062       b     |             4    LATIN SMALL LETTER B
               100       \u0064       d     |             8    LATIN SMALL LETTER D
               103       \u0067       g     |             4    LATIN SMALL LETTER G
               105       \u0069       i     |             4    LATIN SMALL LETTER I
               110       \u006e       n     |             8    LATIN SMALL LETTER N
               114       \u0072       r     |             4    LATIN SMALL LETTER R
               116       \u0074       t     |             4    LATIN SMALL LETTER T
               117       \u0075       u     |             8    LATIN SMALL LETTER U
             1,040       \u0410       А     |             1    CYRILLIC CAPITAL LETTER A
             1,042       \u0412       В     |             1    CYRILLIC CAPITAL LETTER VE
             1,072       \u0430       а     |             1    CYRILLIC SMALL LETTER A
        ------------------------------------+--------------------------------------------
        
                                            freq. count   distinct
        ASCII characters              =              73         12
        Multibyte UTF-8 characters    =               3          3
        Unicode replacement character =               0          0
        Total Unicode characters      =              76         15
        This explains how the problem can occur; it does not tell us a convenient way to rectify it.

        Install and run chartab on each of your two datasets and report back, and someone can likely advise you how to proceed.

        Comment


        • #5
          Code:
          merge 1:1 country using Translate.dta
          keep if _merge!=3
          mata ascii(`"`=country[1]'"')
          mata ascii(`"`=country[2]'"')
          Should indicate where the difference is.
          Most likely the spelling is non-canonical in the hand-made Translate.dta

          Comment


          • #6
            Thank you to both of you! That is very instructive. Will try both approaches tomorrow when I'm back to work.

            Comment


            • #7
              You can also get rid of the problematic non-ascii characters using standard string functions:

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input strL country
              "Antigua and Barbuda" 
              "Аntigua and Barbuda"
              "Antigua and Вarbuda"
              "Antiguа and Barbuda"
              end
              
              gen country2=ustrto(country, "ascii", 2)
              Then replace manually or use a matching mechanism.

              Res.:

              Code:
              . 
              . gen country2=ustrto(country, "ascii", 2)
              
              . l
              
                   +-------------------------------------------+
                   |             country              country2 |
                   |-------------------------------------------|
                1. | Antigua and Barbuda   Antigua and Barbuda |
                2. | Аntigua and Barbuda    ntigua and Barbuda |
                3. | Antigua and Вarbuda    Antigua and arbuda |
                4. | Antiguа and Barbuda    Antigu and Barbuda |
                   +-------------------------------------------+

              Comment

              Working...
              X