Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • codes to spell out full name based on initial

    Hello STATA experts, please help me with a recoding challenge. The dataset contains multiple obs per name. Give the same person, some obs have the last and first names fully spelled out, while others have the last name spelled out and only initial of the first name. It looks like the following:

    lastname firstname
    smith michael
    smith m
    smith m
    smith michael
    johnson l
    johnson l
    johnson linda
    johnson l

    How do I recode the firstname so that each ob has the firstname spelled out? The dataset has about 200 names and 2200 obs.

    Thank you so much!

  • #2
    well, if we can be *certain* that "smith m" is Smith Michael, it could be as simple as:
    Code:
    gen fnlength=length(firstname)
    gsort lastname -fnlength
    replace firstname=firstname[_n-1] if lastname==lastname[_n-1]
    but this makes some heroic assumptions, including that last names are unique.

    Comment


    • #3
      ps. I hope you have an ID variable apart from the names. In which case,
      Code:
      gen fnlength=length(firstname)
      gsort realID lastname -fnlength 
      replace firstname=firstname[_n-1] if realID[_n-1]==realID[_n-1]
      is better.
      Last edited by ben earnhart; 01 Aug 2014, 16:13.

      Comment


      • #4
        Here is another heroic approach:

        Code:
        clear all
        input str20 lastname str20 firstname
        smith michael
        smith m
        smith m
        smith michael
        johnson l
        johnson l
        johnson linda
        johnson l
        end
        bysort lastname (firstname): gen newfirst = firstname[_N]
        list
        Only the bysort command is actually needed once you have the data.

        Other spelling inconsistencies could screw up any of these approaches, e.g. Mike, Michael, Mick.
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        StataNow Version: 19.5 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          Ben and Richard, thank you both so much! Yes, all three approaches worked for people with unique last names. There are a few whose last names are identical, and a few whose last names and first name initial are identical. Is there any way to handle such situation? There is no ID variable...Given that this dataset is not large, I can do it manually....But I'd love to learn if there are cleverer ways to handle similar situation with large dataset. Many thanks! Any advice is appreciated!

          Comment


          • #6
            For the cases where the last name and first initial are identical, how do you know which full first name to use? If you can carefully describe the rules that you use to make the decision when doing the correction manually we can probably help you program it. Without more information, though, it seems like whatever choice you make to assign full names in those instances would be arbitrary.

            Comment


            • #7
              I would be nervous about using an automated solution under the conditions you describe. Even if, as Sarah says, you could figure out the rules, it might take you far longer to program them than it would take to just fix the data manually.

              Tweaking my earlier code,

              Code:
              clear all
              input str20 lastname str20 firstname
              smith michael
              smith m
              smith m
              smith michael
              johnson l
              johnson l
              johnson linda
              johnson l
              davis r
              davis rich
              davis robert
              end
              gen initial = substr(firstname,1,1)
              bysort lastname initial (firstname): gen newfirst = firstname[_N]
              list, sepby(lastname)
              you see that it works ok when lastname & first initial are unique, but it breaks down when they aren't. I guess you could identify the breakdowns by looking at the listing; but I can't guarantee that there aren't other problems I am overlooking.

              Code:
              . list, sepby(lastname)
              
                   +------------------------------------------+
                   | lastname   firstn~e   initial   newfirst |
                   |------------------------------------------|
                1. |    davis          r         r     robert |
                2. |    davis       rich         r     robert |
                3. |    davis     robert         r     robert |
                   |------------------------------------------|
                4. |  johnson          l         l      linda |
                5. |  johnson          l         l      linda |
                6. |  johnson          l         l      linda |
                7. |  johnson      linda         l      linda |
                   |------------------------------------------|
                8. |    smith          m         m    michael |
                9. |    smith          m         m    michael |
               10. |    smith    michael         m    michael |
               11. |    smith    michael         m    michael |
                   +------------------------------------------+
              -------------------------------------------
              Richard Williams, Notre Dame Dept of Sociology
              StataNow Version: 19.5 MP (2 processor)

              EMAIL: [email protected]
              WWW: https://www3.nd.edu/~rwilliam

              Comment


              • #8
                Hi Sarah, I do that by looking at the courses they teach listed on their CVs (The dataset is about faculty course evaluations). So for example, for two faculty with same last name and same initial, I check the courseid and semester listed on their CV.

                Last First Courseid Semester
                ​James Patrick EDUC100 Fall2010
                James Patrick EDUC300 Spring2011
                James Peter EDUC100 Spring2011
                James Peter EDUC200 Fall2010
                ​James P EDUC600 Spring2011
                James P EDUC100 Fall2011

                By checking their CVs, I was able to know the first James P was Patrick and the second was Peter. I almost feel doing this manually is the only viable option. But I hope I am wrong.

                Comment


                • #9
                  With my code P, Patrick and Peter would all get coded as Peter. You could visually identify such cases. Robert and Bob would sneak by you though.

                  The last line of my code would be better as

                  Code:
                  list, sepby(lastname initial)
                  I suppose you could add additional error checking code if you have to do this a lot or have thousands of records.
                  -------------------------------------------
                  Richard Williams, Notre Dame Dept of Sociology
                  StataNow Version: 19.5 MP (2 processor)

                  EMAIL: [email protected]
                  WWW: https://www3.nd.edu/~rwilliam

                  Comment


                  • #10
                    Thank you Richard. What I have planned to do is to generate the full first name for those with unique last name and initial, then checking the errors manually for those with same last name and same initial.

                    Comment


                    • #11
                      It occurs to me that the above code (or at least mine) will still cause Patrick and Peter to get recoded as Peter. It could probably be fixed though, e.g. only change the name when the first name is only one letter long.
                      -------------------------------------------
                      Richard Williams, Notre Dame Dept of Sociology
                      StataNow Version: 19.5 MP (2 processor)

                      EMAIL: [email protected]
                      WWW: https://www3.nd.edu/~rwilliam

                      Comment


                      • #12
                        Thank you Richard! You are very helpful!

                        Comment


                        • #13
                          This may be slightly better. It only changes the first name if the original first name was only a single letter (but that will create problems if the first name was actually the first two initials). It also creates a variable called namechange that lets you easily see what records were changed.

                          Code:
                          clear all
                          input str20 lastname str20 firstname
                          smith michael
                          smith m
                          smith m
                          smith michael
                          johnson l
                          johnson l
                          johnson linda
                          johnson l
                          davis r
                          davis rich
                          davis robert
                          james patrick
                          james peter
                          james p
                          end
                          * Preserve original ordering of cases in case it is needed
                          gen nrec = _n
                          gen initial = substr(firstname,1,1)
                          bysort lastname initial (firstname): gen newfirst = firstname[_N]
                          * Only change the first name if an initial only was used
                          * This will create different problems if two initials were used!
                          replace newfirst = firstname if length(firstname) > 1
                          gen namechange = firstname != newfirst
                          list, sepby(lastname initial)
                          list if namechange
                          Code:
                          . list, sepby(lastname initial)
                          
                               +------------------------------------------------------------+
                               | lastname   firstn~e   nrec   initial   newfirst   namech~e |
                               |------------------------------------------------------------|
                            1. |    davis          r      9         r     robert          1 |
                            2. |    davis       rich     10         r       rich          0 |
                            3. |    davis     robert     11         r     robert          0 |
                               |------------------------------------------------------------|
                            4. |    james          p     14         p      peter          1 |
                            5. |    james    patrick     12         p    patrick          0 |
                            6. |    james      peter     13         p      peter          0 |
                               |------------------------------------------------------------|
                            7. |  johnson          l      5         l      linda          1 |
                            8. |  johnson          l      8         l      linda          1 |
                            9. |  johnson          l      6         l      linda          1 |
                           10. |  johnson      linda      7         l      linda          0 |
                               |------------------------------------------------------------|
                           11. |    smith          m      3         m    michael          1 |
                           12. |    smith          m      2         m    michael          1 |
                           13. |    smith    michael      4         m    michael          0 |
                           14. |    smith    michael      1         m    michael          0 |
                               +------------------------------------------------------------+
                          
                          . list if namechange
                          
                               +------------------------------------------------------------+
                               | lastname   firstn~e   nrec   initial   newfirst   namech~e |
                               |------------------------------------------------------------|
                            1. |    davis          r      9         r     robert          1 |
                            4. |    james          p     14         p      peter          1 |
                            7. |  johnson          l      5         l      linda          1 |
                            8. |  johnson          l      8         l      linda          1 |
                            9. |  johnson          l      6         l      linda          1 |
                               |------------------------------------------------------------|
                           11. |    smith          m      3         m    michael          1 |
                           12. |    smith          m      2         m    michael          1 |
                               +------------------------------------------------------------+
                          -------------------------------------------
                          Richard Williams, Notre Dame Dept of Sociology
                          StataNow Version: 19.5 MP (2 processor)

                          EMAIL: [email protected]
                          WWW: https://www3.nd.edu/~rwilliam

                          Comment


                          • #14
                            Thanks Richard! I didn't know the bysort command before until I read your codes. Very helpful. Many thanks.

                            Comment

                            Working...
                            X