Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Most common word from string variable

    Hi all -- probably an easy question, but I can't seem to figure it out. I have one string variable composed of a number of names and would like to generate a new variable containing the most common name from the other variable.

    So if var1 = "John John James John William Sam Sarah John"

    I would like var2 to contain "John"

    Any way to take the word mode from a single variable?

  • #2
    Some technique:

    Code:
    clear
    set obs 2
    
    gen var1 = "John John James John William Sam Sarah John" in 1
    replace var1 = "Michelle Michelle Martin" in 2
    
    gen long id = _n
    split var1
    drop var1
    reshape long var1, i(id) j(which)
    drop if missing(var1)
    bysort id var1 : gen freq = _N
    bysort id (freq) : gen mode = var1[_N]
    drop freq
    reshape wide var1, i(id) j(which)
    
    list

    Comment


    • #3
      Beautiful, thanks!

      Comment


      • #4
        Just one caution before you implement Nick's advice. In situations like this it commonly happens that in some observations there may be more than one word tied for most common. E.g. if you had an observation with var1 = "John John James John James James William Sam Sarah", John and James would both have "the" most occurrences. Nick's code will select one of those at random--and it may not be reproducible when you rerun the code. If that is fine with you, then proceed.

        If not, you may need to come up with a rule for breaking this kind of tie (e.g. the alphabetically first, or last, or some other scheme.) In that case, the code needs to be modified accordingly. You can post back for help if that's needed and you don't see how to do it yourself.

        Comment


        • #5
          Ah thanks, good point. I think in my situation there should always be a clear winner (the original variable has first and last names for a family, so the last name consistently emerges with Nick's code)

          Comment


          • #6
            I have never used seriously but I think txttool may help you with this.

            Comment


            • #7
              A simple example using txttool follows. But I guess that if you have too many different names in your dataset this might be a problem.

              Code:
              clear
              set obs 2
              gen var1 = "John John James John William Sam Sarah John" in 1
              replace var1 = "Michelle Michelle Martin Martin William" in 2
              txttool var1 , replace bagwords
              egen rowmax=rowmax(w_*)
              gen var2=""
              gen var3=""
              foreach x of var w_* {
               replace var2=substr("`x'",3,.) if `x'==rowmax // var2 similar to Nick's code (although keeping only the last most frequent name found)
               replace var3=var3+" "+substr("`x'",3,.) if `x'==rowmax // var3 keeps all tied most frequent names found.
              }
              list var2 var3 w*, table
              Last edited by Julio Raffo; 17 Jan 2017, 00:56. Reason: mispelling

              Comment


              • #8
                Julio: Please give the provenance of user-written programs you cite.

                SJ-14-4 dm0077 . . . . . . . . txttool: Utilities for text analysis in Stata
                (help txttool if installed) . . . . . . U. Williams and S. P. Williams
                Q4/14 SJ 14(4):817--829
                provides tools for managing free-form text

                Comment

                Working...
                X