Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Augmenting strproper to capitalize letter after Mc or Mac ?

    I am working with an all-caps mailing list provided by a direct mail vendor. The function strproper does a great job with names containing hyphens (Catherine Zeta-Jones) and apostrophes (Miles O'Brien). However it does not convert the letter following the common Irish leadin Mc (as in John McSorley).

    That is gen newname = strproper(oldname) renders JOHN MCSORLEY as John Mcsorley.

    I have over 650 such names, so I am hoping not to do this manually. Any suggestions for workarounds or existing fixes would be welcome.

  • #2
    Here is one approach that inserts a hyphen before applying -strproper()- and subsequently deletes it.

    Code:
    clear
    input str50 name
    "john james"
    "bill mcarthur"
    "james doyle"
    "charles mcdonald"
    "James McSILVER"
    end
    
    replace name= ustrregexra(lower(name),"(.*)(\s[m][c])([a-z])(.*)", "$1$2\-$3$4")
    gen wanted= strproper(name)
    replace wanted=ustrregexra(wanted,"(.*)(\s[M][c])([\-])([A-Z])(.*)", "$1$2$4$5")
    Res.:

    Code:
    . l
    
         +--------------------------------------+
         |              name             wanted |
         |--------------------------------------|
      1. |        john james         John James |
      2. |    bill mc-arthur      Bill McArthur |
      3. |       james doyle        James Doyle |
      4. | charles mc-donald   Charles McDonald |
      5. |   james mc-silver     James McSilver |
         +--------------------------------------+

    Comment


    • #3
      This is a minefield. Quick impressions:

      1. If a name starts Mc, it's often followed by a capital. So far, so good.

      2. But watch out for roman numerals as McMlii isn't acceptable as a rendering of MCMLII. People interested in cricket will recognise MCC.

      3. If a name starts with Mac, sometimes you want this and sometimes you don't. Macdonald and MacDonald, Macintosh and MacIntosh, Mackay and MacKay?

      That's my guess at why the function doesn't do what you want.

      There must be more discussion of this somewhere.

      Comment


      • #4
        Nick: Absolutely correct. Luckily, I have a column of surnames, so roman numerals are not an issue. Mac is, indeed a minefield, with many imposter Mac's like Machado. Luckily, I have just a few dozen and can fix them manually.

        Andrew: You solution was more elegant than my initial fix, that required identifying all rows containing Mc* names by inspection.

        * Fixing Irish Mc* names starting the Mc
        * first sort data and manually identify start and finish of Mc names
        sort lastname
        gen irish = 0
        replace irish = 1 if _n >=20130 & _n <=20782

        *First split lastname into two parts
        gen firstpart = substr(lastname, 1,2) if irish==1
        gen secondpart = substr(lastname, 3, 14) if irish==1
        replace secondpart = strproper(secondpart) if irish==1

        egen lastname_x = concat(firstpart secondpart) if irish==1
        replace lastname = lastname_x if irish==1
        drop firstpart secondpart lastname_x


        I'll use your approach as it's less dependent on human intervention in each implementation.

        Thanks to you both!

        Comment


        • #5
          I suspect the reason why the mailing list is all caps to begin with is to exactly avoid issues is incorrect capitalizations. Have you considered keeping it as is?

          Comment


          • #6
            Just after my post I saw a false MacMillan for the book publisher, originally British (nay, Scottish) but with strong international presence.

            The British Prime Minister Harold Macmillan was from that family.

            Confusingly, or otherwise, the founders of the company were called MacMillan.

            Comment

            Working...
            X