Delete only part of a string variable

Morten Hans Jensen

Join Date: Apr 2022
Posts: 28

Delete only part of a string variable

30 Nov 2022, 06:48

Hi Statalist

I want to delete all the brackets and "" from the following:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str9 diagnosis
`"["99"]"'   
`"["99"]"'   
`"["5"]"'    
`"["5"]"'    
`"["9"]"'    
`"["5"]"'    
`"["5"]"'    
`"["9"]"'    
`"["5"]"'    
`"["5"]"'    
`"["5"]"'    
`"["5"]"'    
`"["5"]"'    
`"["5"]"'    
`"["5"]"'    
`"["6"]"'    
`"["9"]"'    
`"["6"]"'    
`"["6"]"'    
`"["6"]"'    
`"["6"]"'    
`"["6"]"'    
`"["6"]"'    
`"["6"]"'    
`"["6"]"'    
""           
`"["6"]"'    
`"["1"]"'    
`"["5"]"'    
`"["5","6"]"'
`"["6"]"'    
`"["6"]"'    
end

I want it to be so I only have a numeric variable. I am unsure if it is possible when there are multiple numbers in one. Each number represents a diagnosis.

Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

30 Nov 2022, 07:38

You cannot have multiple numbers in a single numeric variable. Perhaps this example code will start you in a useful direction.

Code:

generate diag = ustrregexra(diagnosis,`"[^\d,]"',"")
split diag, parse(",") destring
describe *
list in 25/l, clean

Code:

. describe *

Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------------
diagnosis       str9    %9s                  
diag            str3    %9s                  
diag1           byte    %10.0g                
diag2           byte    %10.0g                

. list in 25/l, clean

       diagnosis   diag   diag1   diag2  
 25.       ["6"]      6       6       .  
 26.                          .       .  
 27.       ["6"]      6       6       .  
 28.       ["1"]      1       1       .  
 29.       ["5"]      5       5       .  
 30.   ["5","6"]    5,6       5       6  
 31.       ["6"]      6       6       .  
 32.       ["6"]      6       6       .  

.

Added in edit: I simplified the ustrregexra() second argument from what I originally posted. This function - admittedly incomprehensible to the novice - deletes every character that is neither a digit nor a comma. You may find the following code easier to follow, and more instructive in Stata basics.

Code:

generate diag = diagnosis
replace diag = subinstr(diag,"[","",.)
replace diag = subinstr(diag,"]","",.)
replace diag = subinstr(diag,`"""',"",.)
split diag, parse(",") destring
describe *
list in 25/l, clean

The results are effectively identical on your example data. The trickiest part of the new code is using compound double quotes to surround the string containing a single double quote - the text shown in blue above. For details on quoting in Stata, see the output of

Code:

help quotes

Last edited by William Lisowski; 30 Nov 2022, 07:49.

Announcement

Delete only part of a string variable

Comment