Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Consistent spelling of names in a list - string functions

    Dear Statalist,

    I am trying to solve a problem with duplicate observations (people) in my sample. I have a column with their first name and a column with their last name. The spelling of the first and the last name of the duplicate observations can differ. By differ I mean that in one case the first name can be spelled ‘BRUCE’ and in the other ‘Bruce’ or ‘bruce’. The same holds for the last name.

    To find first how many duplicates I have for the given combination of first and last name I used
    Code:
    duplicates tag Fname Lname, generate(duplicates)
    . Then I dropped the tagged duplicates. However, when I checked the new list of names there were still some duplicate observations because they could not have been identified as such by Stata. This comes most likely from the fact that Stata does not identify the uppercase spelling of the first name for example as the same when it is lower case or proper. I have been trying to find a way to make the spelling of the names in my list consistent – first letter is capital and the rest is lowercase, but could not come up with a solution. There are those string functions like strupper(s), but there I have to specify the exact string, which means that I have to do it for every first and last name separately. In Excel there is the function ‘proper’ which would solve my problem but I would like to do it in Stata if that is possible. Therefore, I would be extremely grateful if you can give me some suggestions for that. I am using Stata 14.1.

    Thank you very much in advance for your help.

    Albena

  • #2
    You could try something like
    Code:
    generate str full_name = trim(itrim(strlower(Fname) + " " + strlower(Lname)))
    duplicates tag full_name, generate(duplicates)
    Is the Excel function 'proper' similar to Stata's strproper()?

    Comment


    • #3
      Just to be clear: duplicates has one and one idea only of what is a duplicate, namely exact identity of stored values. It has precisely no idea of identical meaning, import or essence or of identifying what to people are evidently different versions of the same thing. So, as Joseph's answer implies, you have to do all the work of translating to a common form.

      Comment


      • #4
        Thank you Joseph and Nick! I tried your code Joseph and it works. The function 'proper' in Excel is similar to the strproper() in Stata. However, in Excel you would just type for example
        Code:
        PROPER(B2)
        and then the formula can be automatically applied to the other cells in the same column.

        Comment


        • #5
          In Stata too putting proper() around a variable name (or indeed an expression) applies it generally.

          Comment

          Working...
          X