Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Adding a suffix to a string variable under specific conditions

    Hello Statalist,

    Using Stata 17, I need to take a string variable (chid) and clean it up so I can merge it with another dataset. I see two problems in the data below.

    Code:
    input str52 name str128 chid
    "William Juckes"                  "9216678"          
    "Nicholas Charalambous"           "8490515"          
    "Kirsty Senior"                   "8452449"          
    "Oliver Adkins"                   "10966959"          
    "Alec Ramsey"                     "10966959"          
    "Ryan Burton"                     "SC565616"          
    "Flemming Andersen"               "10670173"          
    "Tim Barclay"                     "9417237"          
    "John Matthew Martin"             "11220400"          
    "Fotios Talantzis"                "10313940"          
    "Iraklis Bourantas"               "10313940"          
    "Amanda Harrington"               "11249637"          
    "Amaury De Closset"               "10653083"          
    "Daniel Morton"                   "10653083"          
    "Joy Foster"                      "10418960"          
    "Natasha Elizabeth Helena Thomas" "5729837"          
    "Steven Tyson"                    "9241863"          
    "Chao Liu"                        "9250295"          
    "Alexey Chudnovsky"               "5269210"          
    "Andrew Orrock"                   "11296834, 09235524"
    1. chid is sometimes 7 characters and other times 8, but in the other dataset my equivalent variable adds a suffix 0 in front of observations which are only 7 characters. In other words, row 1 (William Juckes) should be 09216678. But row 4 (Oliver Adkins) should not be altered.
    2. chid sometimes lists two sets of codes - ie in the last row of the toy data, 11296834, 09235524. In this case I want to generate a duplicate row, except each row should only contain a single code in chid. The maximum number of chid codes is 2.

    I'm not sure how to accomplish these tasks. Any advice very much welcome!

    Thanks.

  • #2
    In post #1, problem 1 is adding a prefix. A suffix is whatever occurs at the end.

    Code:
    clear 
    input str52 name str128 chid
    "William Juckes"                  "9216678"          
    "Nicholas Charalambous"           "8490515"          
    "Kirsty Senior"                   "8452449"          
    "Oliver Adkins"                   "10966959"          
    "Alec Ramsey"                     "10966959"          
    "Ryan Burton"                     "SC565616"          
    "Flemming Andersen"               "10670173"          
    "Tim Barclay"                     "9417237"          
    "John Matthew Martin"             "11220400"          
    "Fotios Talantzis"                "10313940"          
    "Iraklis Bourantas"               "10313940"          
    "Amanda Harrington"               "11249637"          
    "Amaury De Closset"               "10653083"          
    "Daniel Morton"                   "10653083"          
    "Joy Foster"                      "10418960"          
    "Natasha Elizabeth Helena Thomas" "5729837"          
    "Steven Tyson"                    "9241863"          
    "Chao Liu"                        "9250295"          
    "Alexey Chudnovsky"               "5269210"          
    "Andrew Orrock"                   "11296834, 09235524"
    end 
    
    split chid, parse(,)
    expand 2 if strpos(chid, ",") 
    bysort chid : replace chid = cond(_n == 1, chid1, chid2) if strpos(chid, ",")
    replace chid = "0" + chid if strlen(chid) == 7 
    list, sepby(chid)
    Some people have the same chid. Is that a problem?

    Comment


    • #3
      Many thanks Nick. Yes - sorry prefix, not suffix.

      It's not a problem that name-chid is 1:m - chid represents companies, so the same company can have multiple named people (founders) attached.

      When I run the code you sent, the prefix problem is solved. But for the second problem, I now have two rows of Andrew Orrock, but each with a chid variable that lists both values. Instead what I need is each row with one distinct value.

      Suggestions?

      Comment


      • #4
        I read #1 as asking for a duplicate row (meaning, observation) to be created (meaning, added to the dataset).

        If you want only one observation, you need a rule for which identifier to be kept, or equivalently which to be dropped. What rule do you have in mind?

        Comment


        • #5
          Yes, I need a duplicate observation - except for chid - which should only contain one of the two chid values. So it should look like

          Code:
           
           "Andrew Orrock"                   "11296834"  
           "Andrew Orrock"                   "09235524"

          Comment


          • #6
            Erm, I think my code already does this.

            Comment


            • #7
              Sorry, not sure why this did not work the first time - you're right!

              Comment

              Working...
              X