Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • removing coma decimal seperator from a string variable

    Dear all,

    I would like to destring string variable, which contains comma as a decimal separator .However, when I do that, STATA creates a number which ignores all the values following comma. Hence If I have a string variable containing observation "123,456" and then apply destring command, STATA creates a new variable which has observation 123 for the particular observation.

    When I copy/paste the dataset into excel and remove "comma" as a decimator separator, and then create .dta file from such dataset, STATA does the destring job correctly..

    Does anyone know how I can remove comma from the string variables?

    Thank you!

  • #2
    Hi Mina,

    Have you tried the dpcomma option? It converts variables with commas as decimals to period-decimal format.
    Code:
    destring varlist, dpcomma

    Comment


    • #3
      Matthias is right. If the comma is decimal separator, tell that to destring.

      I am nevertheless concerned about the question. The claim in #1 is not supported.

      Code:
      . clear
      
      . set obs 1
      obs was 0, now 1
      
      . gen mystring = "123,456"
      
      . destring mystring, replace
      mystring contains nonnumeric characters; no replace
      
      . destring mystring, replace force
      mystring contains nonnumeric characters; replaced as byte
      (1 missing value generated)
      
      . gen mystring2 = "123,456"
      
      . destring mystring2, replace ignore(,)
      mystring2: characters , removed; replaced as long
      
      . l
      
           +---------------------+
           | mystring   mystri~2 |
           |---------------------|
        1. |        .     123456 |
           +---------------------+
      I find it hard to see that destring would ever transform "123,456" to 123. Even if you instruct it to ignore the comma, that won't happen.

      So, please give a reproducible example where (you think) this is happening. Perhaps you are doing something else like using split, destring. If so, that's quite wrong for this problem.

      Declaration of interest: I was the original author of destring and split.
      Last edited by Nick Cox; 05 Nov 2016, 05:50.

      Comment


      • #4
        Nick, Mathias-thank you for the answers!

        I got the following answers from STATA when appling your suggestions:


        . destring var9, replace dpcomma
        var9 contains nonnumeric characters; no replace

        . destring var9, replace ignore(,)
        var9 contains characters not specified in ignore(); no replace

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str9 var9
        "3,978,160"
        "3,697,162"
        "1,744,745"
        "1,650,939"
        "669,193"  
        "853,853"  
        "570,510"  
        "426,758"  
        "401,937"  
        "117,702"  
        "640,869"  
        "391,752"  
        "529,779"  
        "355,714"  
        "479,347"  
        "360,819"  
        "299,271"  
        "392,697"  
        "357,198"  
        "249,367"  
        "360,346"  
        "333,024"  
        "264,899"  
        "288,910"  
        "238,367"  
        "258,043"  
        "258,667"  
        "277,466"  
        "249,872"  
        "215,375"  
        "184,083"  
        "189,482"  
        "n.a."     
        "211,134"  
        "229,383"  
        "238,283"  
        "222,839"  
        "211,831"  
        "215,869"  
        "210,536"  
        "224,874"  
        "199,937"  
        "205,534"  
        "205,168"  
        "195,808"  
        "157,561"  
        "182,946"  
        "181,374"  
        "117,415"  
        "173,373"  
        "133,319"  
        "204,879"  
        "196,787"  
        "143,232"  
        "146,670"  
        "171,285"  
        "166,019"  
        "142,563"  
        "163,723"  
        "106,769"  
        "98,158"   
        "2,171"    
        "82,394"   
        "87,173"   
        "35,588"   
        "107,793"  
        "115,325"  
        "152,850"  
        "122,731"  
        "86,968"   
        "101,064"  
        "91,657"   
        "83,291"   
        "114,078"  
        "91,757"   
        "144,402"  
        "113,514"  
        "70,145"   
        "89,636"   
        "88,889"   
        "93,438"   
        "114,583"  
        "128,079"  
        "114,372"  
        "85,539"   
        "101,668"  
        "103,125"  
        "106,886"  
        "98,735"   
        "127,510"  
        "96,264"   
        "85,052"   
        "107,921"  
        "97,234"   
        "88,276"   
        "75,317"   
        "83,433"   
        "71,070"   
        "75,741"   
        "83,542"   
        end
        Nick, I have not actually used destring in the "conventional way. What seemed to work better for me was the following command:

        Code:
         foreach var of varlist var9-var12 {
        destring `var', replace force
        }
        I am getting back to this dataset after sometime, but if my memory serves me well, using "destring" solely produced many missing if not all.

        Comment


        • #5
          For your data example in #4, the following command should ignore the commas while replacing the n.a.s with missings:
          Code:
          destring var9, replace ignore(",") force

          Comment


          • #6
            This finally worked! Thanks!!

            Comment


            • #7
              All you report is exactly as designed and documented.

              The option force doesn't make destring smarter; it makes it more brutal. That alone does what you want with "n.a." but the commas must be separately handled, again as Matthias suggested.

              Note that there is absolutely no need to call destring within a loop; it takes a varlist.

              Code:
              destring var0-var12, replace force ignore(,)
              The claim in #1 is unsubstantiated and I believe it to be false.

              See also http://www.statalist.org/forums/help#spelling on the spelling Stata.

              Comment


              • #8
                Nick, thanks! I will try to detect what went wrong and let you know. I know that after certain version of destring command, maximum value was 999 (the maximum value prior comma). I have to go back to that.

                Comment


                • #9
                  I have a similar problem and Nick's code works nearly perfectly for my ICD-9 data, but I wonder if there is something I can do in order to keep the leading zeros in my string? Is it impossible to have a numeric variable with leading zeros? Perhaps there is a way to remove the decimal from a string variable in which case I wouldn't require the destring piece and the leading zeros would remain. EDIT: I figured it out and posted below. I can't figure out how to delete my post, so hopefully this is useful to someone.

                  replace oldstring = subinstr(oldstring, ".", "",.)
                  Last edited by Brittany Krzyzanowski; 19 Oct 2021, 08:55.

                  Comment


                  • #10
                    You can't delete posts: please see FAQ Advice.

                    You can **see** leading zeros with numeric variables or values with an appropriate display format:

                    Code:
                    . di %06.0f 123
                    000123

                    If it's important that you store leading zeros, you might need to stay string.

                    Comment


                    • #11
                      That is good to know. I was able to format my string ICD9 variable into a new string var with no decimal while keeping the leading zeros using the code below. This works for ICD-9 codes where the first character is not alpha (E or V for example).

                      Code:
                      gen dxcode = ICD_code
                      replace dxcode = subinstr(dxcode, ".", "",.) 
                      replace dxcode = dxcode + substr("00000", 1, 5 - length(dxcode))

                      Comment


                      • #12
                        That code adds trailing zeros.

                        Comment


                        • #13
                          I have a similar question, but I want to remove only the comma from the beginning of my string variable. For example, if one of the values of the string variable is ", 53, 87, 88", I want to change it to "53, 87, 88", removing only the commas at the start of the string values.
                          How should I rearrange the above command? Do you have a simple suggestion?
                          Thank you very much!

                          Comment


                          • #14
                            If the comma is in the beginning of the string, then it is the first character. You can replace the observations with the string minus the first character if the first character is a comma. Regular expressions solutions are also possible.

                            Code:
                            clear
                            input str29 example
                            ", 53, 87, 88"
                            " ,sun, 53, 87, 88"
                            "frog, newt, salamander"
                            end
                            
                            gen wanted1= cond(substr(trim(example), 1, 1)==",", substr(trim(example), 2, .), trim(example))
                            gen wanted2= ustrregexra(trim(example), "^,(.*)$", "$1")
                            assert wanted1== wanted2
                            Res.:

                            Code:
                            . assert wanted1== wanted2
                            
                            . l
                            
                                 +--------------------------------------------------------------------------+
                                 |                example                  wanted1                  wanted2 |
                                 |--------------------------------------------------------------------------|
                              1. |           , 53, 87, 88               53, 87, 88               53, 87, 88 |
                              2. |       ,sun, 53, 87, 88          sun, 53, 87, 88          sun, 53, 87, 88 |
                              3. | frog, newt, salamander   frog, newt, salamander   frog, newt, salamander |
                                 +--------------------------------------------------------------------------+

                            Comment


                            • #15
                              Let me add to Andrew's useful suggestion the more global note that Stata has a rich and relatively easy to use collection of string functions which, among many other things, will enable substitution or removal of characters in strings. I commonly see questions that make me think people don't realize that string functions exist. They seem to be less known than other kinds of functions in Stata.

                              So, for anyone who needs to work with string variables, I'd make the suggestion to look through -help string functions-. The object is not to remember the use or even existence of every function in that collection, but rather to give a sense of the kinds of things that can be done so that now and sometime in the future, one can look at them again and see if they will help.

                              Comment

                              Working...
                              X