Adding a suffix to a string variable under specific conditions

Tom Kemeny

Join Date: Apr 2016
Posts: 52

Adding a suffix to a string variable under specific conditions

31 Jul 2024, 12:54

Hello Statalist,

Using Stata 17, I need to take a string variable (chid) and clean it up so I can merge it with another dataset. I see two problems in the data below.

Code:

input str52 name str128 chid
"William Juckes"                  "9216678"          
"Nicholas Charalambous"           "8490515"          
"Kirsty Senior"                   "8452449"          
"Oliver Adkins"                   "10966959"          
"Alec Ramsey"                     "10966959"          
"Ryan Burton"                     "SC565616"          
"Flemming Andersen"               "10670173"          
"Tim Barclay"                     "9417237"          
"John Matthew Martin"             "11220400"          
"Fotios Talantzis"                "10313940"          
"Iraklis Bourantas"               "10313940"          
"Amanda Harrington"               "11249637"          
"Amaury De Closset"               "10653083"          
"Daniel Morton"                   "10653083"          
"Joy Foster"                      "10418960"          
"Natasha Elizabeth Helena Thomas" "5729837"          
"Steven Tyson"                    "9241863"          
"Chao Liu"                        "9250295"          
"Alexey Chudnovsky"               "5269210"          
"Andrew Orrock"                   "11296834, 09235524"

1. chid is sometimes 7 characters and other times 8, but in the other dataset my equivalent variable adds a suffix 0 in front of observations which are only 7 characters. In other words, row 1 (William Juckes) should be 09216678. But row 4 (Oliver Adkins) should not be altered.
2. chid sometimes lists two sets of codes - ie in the last row of the toy data, 11296834, 09235524. In this case I want to generate a duplicate row, except each row should only contain a single code in chid. The maximum number of chid codes is 2.

I'm not sure how to accomplish these tasks. Any advice very much welcome!

Thanks.

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35698

31 Jul 2024, 13:16

In post #1, problem 1 is adding a prefix. A suffix is whatever occurs at the end.

Code:

clear 
input str52 name str128 chid
"William Juckes"                  "9216678"          
"Nicholas Charalambous"           "8490515"          
"Kirsty Senior"                   "8452449"          
"Oliver Adkins"                   "10966959"          
"Alec Ramsey"                     "10966959"          
"Ryan Burton"                     "SC565616"          
"Flemming Andersen"               "10670173"          
"Tim Barclay"                     "9417237"          
"John Matthew Martin"             "11220400"          
"Fotios Talantzis"                "10313940"          
"Iraklis Bourantas"               "10313940"          
"Amanda Harrington"               "11249637"          
"Amaury De Closset"               "10653083"          
"Daniel Morton"                   "10653083"          
"Joy Foster"                      "10418960"          
"Natasha Elizabeth Helena Thomas" "5729837"          
"Steven Tyson"                    "9241863"          
"Chao Liu"                        "9250295"          
"Alexey Chudnovsky"               "5269210"          
"Andrew Orrock"                   "11296834, 09235524"
end 

split chid, parse(,)
expand 2 if strpos(chid, ",") 
bysort chid : replace chid = cond(_n == 1, chid1, chid2) if strpos(chid, ",")
replace chid = "0" + chid if strlen(chid) == 7 
list, sepby(chid)

Some people have the same chid. Is that a problem?

Comment

Tom Kemeny

Join Date: Apr 2016

Posts: 52
#3

01 Aug 2024, 05:57

Many thanks Nick. Yes - sorry prefix, not suffix.

It's not a problem that name-chid is 1:m - chid represents companies, so the same company can have multiple named people (founders) attached.

When I run the code you sent, the prefix problem is solved. But for the second problem, I now have two rows of Andrew Orrock, but each with a chid variable that lists both values. Instead what I need is each row with one distinct value.

Suggestions?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

01 Aug 2024, 06:09

I read #1 as asking for a duplicate row (meaning, observation) to be created (meaning, added to the dataset).

If you want only one observation, you need a rule for which identifier to be kept, or equivalently which to be dropped. What rule do you have in mind?
Comment
Tom Kemeny

Join Date: Apr 2016

Posts: 52
#5

01 Aug 2024, 06:31

Yes, I need a duplicate observation - except for chid - which should only contain one of the two chid values. So it should look like

Code:

"Andrew Orrock" "11296834" "Andrew Orrock" "09235524"
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#6

01 Aug 2024, 06:47

Erm, I think my code already does this.
Comment
Tom Kemeny

Join Date: Apr 2016

Posts: 52
#7

01 Aug 2024, 06:49

Sorry, not sure why this did not work the first time - you're right!
Comment

Announcement

Adding a suffix to a string variable under specific conditions

Comment

Comment

Comment

Comment

Comment

Comment