removing coma decimal seperator from a string variable

Mina Wu

Join Date: Jul 2015

Posts: 79
#1

removing coma decimal seperator from a string variable

05 Nov 2016, 05:02

Dear all,

I would like to destring string variable, which contains comma as a decimal separator .However, when I do that, STATA creates a number which ignores all the values following comma. Hence If I have a string variable containing observation "123,456" and then apply destring command, STATA creates a new variable which has observation 123 for the particular observation.

When I copy/paste the dataset into excel and remove "comma" as a decimator separator, and then create .dta file from such dataset, STATA does the destring job correctly..

Does anyone know how I can remove comma from the string variables?

Thank you!
Tags: None
Mathias Pedersen Heinze

Join Date: Jun 2015

Posts: 78
#2

05 Nov 2016, 05:10

Hi Mina,

Have you tried the dpcomma option? It converts variables with commas as decimals to period-decimal format.

Code:

destring varlist, dpcomma
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35756
#3

05 Nov 2016, 05:47

Matthias is right. If the comma is decimal separator, tell that to destring.

I am nevertheless concerned about the question. The claim in #1 is not supported.

Code:

. clear . set obs 1 obs was 0, now 1 . gen mystring = "123,456" . destring mystring, replace mystring contains nonnumeric characters; no replace . destring mystring, replace force mystring contains nonnumeric characters; replaced as byte (1 missing value generated) . gen mystring2 = "123,456" . destring mystring2, replace ignore(,) mystring2: characters , removed; replaced as long . l +---------------------+ | mystring mystri~2 | |---------------------| 1. | . 123456 | +---------------------+

I find it hard to see that destring would ever transform "123,456" to 123. Even if you instruct it to ignore the comma, that won't happen.

So, please give a reproducible example where (you think) this is happening. Perhaps you are doing something else like using split, destring. If so, that's quite wrong for this problem.

Declaration of interest: I was the original author of destring and split.

Last edited by Nick Cox; 05 Nov 2016, 05:50.
1 like
Comment

Mina Wu

Join Date: Jul 2015
Posts: 79

05 Nov 2016, 06:46

Nick, Mathias-thank you for the answers!

I got the following answers from STATA when appling your suggestions:

. destring var9, replace dpcomma
var9 contains nonnumeric characters; no replace

. destring var9, replace ignore(,)
var9 contains characters not specified in ignore(); no replace

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 var9
"3,978,160"
"3,697,162"
"1,744,745"
"1,650,939"
"669,193"  
"853,853"  
"570,510"  
"426,758"  
"401,937"  
"117,702"  
"640,869"  
"391,752"  
"529,779"  
"355,714"  
"479,347"  
"360,819"  
"299,271"  
"392,697"  
"357,198"  
"249,367"  
"360,346"  
"333,024"  
"264,899"  
"288,910"  
"238,367"  
"258,043"  
"258,667"  
"277,466"  
"249,872"  
"215,375"  
"184,083"  
"189,482"  
"n.a."     
"211,134"  
"229,383"  
"238,283"  
"222,839"  
"211,831"  
"215,869"  
"210,536"  
"224,874"  
"199,937"  
"205,534"  
"205,168"  
"195,808"  
"157,561"  
"182,946"  
"181,374"  
"117,415"  
"173,373"  
"133,319"  
"204,879"  
"196,787"  
"143,232"  
"146,670"  
"171,285"  
"166,019"  
"142,563"  
"163,723"  
"106,769"  
"98,158"   
"2,171"    
"82,394"   
"87,173"   
"35,588"   
"107,793"  
"115,325"  
"152,850"  
"122,731"  
"86,968"   
"101,064"  
"91,657"   
"83,291"   
"114,078"  
"91,757"   
"144,402"  
"113,514"  
"70,145"   
"89,636"   
"88,889"   
"93,438"   
"114,583"  
"128,079"  
"114,372"  
"85,539"   
"101,668"  
"103,125"  
"106,886"  
"98,735"   
"127,510"  
"96,264"   
"85,052"   
"107,921"  
"97,234"   
"88,276"   
"75,317"   
"83,433"   
"71,070"   
"75,741"   
"83,542"   
end

Nick, I have not actually used destring in the "conventional way. What seemed to work better for me was the following command:

Code:

 foreach var of varlist var9-var12 {
destring `var', replace force
}

I am getting back to this dataset after sometime, but if my memory serves me well, using "destring" solely produced many missing if not all.

Comment

Mathias Pedersen Heinze

Join Date: Jun 2015

Posts: 78
#5

05 Nov 2016, 07:11

For your data example in #4, the following command should ignore the commas while replacing the n.a.s with missings:

Code:

destring var9, replace ignore(",") force
Comment
Mina Wu

Join Date: Jul 2015

Posts: 79
#6

05 Nov 2016, 07:26

This finally worked! Thanks!!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35756
#7

05 Nov 2016, 10:47

All you report is exactly as designed and documented.

The option force doesn't make destring smarter; it makes it more brutal. That alone does what you want with "n.a." but the commas must be separately handled, again as Matthias suggested.

Note that there is absolutely no need to call destring within a loop; it takes a varlist.

Code:

destring var0-var12, replace force ignore(,)

The claim in #1 is unsubstantiated and I believe it to be false.

See also http://www.statalist.org/forums/help#spelling on the spelling Stata.
Comment
Mina Wu

Join Date: Jul 2015

Posts: 79
#8

12 Nov 2016, 11:50

Nick, thanks! I will try to detect what went wrong and let you know. I know that after certain version of destring command, maximum value was 999 (the maximum value prior comma). I have to go back to that.
Comment
Brittany Krzyzanowski

Join Date: Apr 2020

Posts: 4
#9

19 Oct 2021, 08:46

I have a similar problem and Nick's code works nearly perfectly for my ICD-9 data, but I wonder if there is something I can do in order to keep the leading zeros in my string? Is it impossible to have a numeric variable with leading zeros? Perhaps there is a way to remove the decimal from a string variable in which case I wouldn't require the destring piece and the leading zeros would remain. EDIT: I figured it out and posted below. I can't figure out how to delete my post, so hopefully this is useful to someone.

replace oldstring = subinstr(oldstring, ".", "",.)

Last edited by Brittany Krzyzanowski; 19 Oct 2021, 08:55.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35756
#10

19 Oct 2021, 09:32

You can't delete posts: please see FAQ Advice.

You can **see** leading zeros with numeric variables or values with an appropriate display format:

Code:

. di %06.0f 123 000123

If it's important that you store leading zeros, you might need to stay string.
1 like
Comment
Brittany Krzyzanowski

Join Date: Apr 2020

Posts: 4
#11

20 Oct 2021, 10:01

That is good to know. I was able to format my string ICD9 variable into a new string var with no decimal while keeping the leading zeros using the code below. This works for ICD-9 codes where the first character is not alpha (E or V for example).

Code:

gen dxcode = ICD_code replace dxcode = subinstr(dxcode, ".", "",.) replace dxcode = dxcode + substr("00000", 1, 5 - length(dxcode))
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35756
#12

20 Oct 2021, 11:03

That code adds trailing zeros.
Comment
tig som

Join Date: Sep 2022

Posts: 58
#13

29 Feb 2024, 08:51

I have a similar question, but I want to remove only the comma from the beginning of my string variable. For example, if one of the values of the string variable is ", 53, 87, 88", I want to change it to "53, 87, 88", removing only the commas at the start of the string values.
How should I rearrange the above command? Do you have a simple suggestion?
Thank you very much!
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10259

#14

29 Feb 2024, 09:17

If the comma is in the beginning of the string, then it is the first character. You can replace the observations with the string minus the first character if the first character is a comma. Regular expressions solutions are also possible.

Code:

clear
input str29 example
", 53, 87, 88"
" ,sun, 53, 87, 88"
"frog, newt, salamander"
end

gen wanted1= cond(substr(trim(example), 1, 1)==",", substr(trim(example), 2, .), trim(example))
gen wanted2= ustrregexra(trim(example), "^,(.*)$", "$1")
assert wanted1== wanted2

Res.:

Code:

. assert wanted1== wanted2

. l

     +--------------------------------------------------------------------------+
     |                example                  wanted1                  wanted2 |
     |--------------------------------------------------------------------------|
  1. |           , 53, 87, 88               53, 87, 88               53, 87, 88 |
  2. |       ,sun, 53, 87, 88          sun, 53, 87, 88          sun, 53, 87, 88 |
  3. | frog, newt, salamander   frog, newt, salamander   frog, newt, salamander |
     +--------------------------------------------------------------------------+

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2421
#15

29 Feb 2024, 09:24

Let me add to Andrew's useful suggestion the more global note that Stata has a rich and relatively easy to use collection of string functions which, among many other things, will enable substitution or removal of characters in strings. I commonly see questions that make me think people don't realize that string functions exist. They seem to be less known than other kinds of functions in Stata.

So, for anyone who needs to work with string variables, I'd make the suggestion to look through -help string functions-. The object is not to remember the use or even existence of every function in that collection, but rather to give a sense of the kinds of things that can be done so that now and sometime in the future, one can look at them again and see if they will help.
2 likes
Comment

Announcement