Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata 17 destring dpcomma does not work

    Hi, recently I updated my stata16 version to stata17 and now have an issue with the command destring myvar, replace dpcomma. However, my code for stata16 successfully worked before.
    For example I have a var read1_1 that contains mix of numbers integer, with decimal places using a comma, with decimal places using a dot. I need to destring the var read1_1.

    I used a command:
    Code:
    destring read1_1, replace dpcomma
    , and it worked. Since I've switched to stata 17 it does not work now and returns the message : read1_1: contains nonnumeric characters; no replace.
    Fragment of data is here:
    read1_1
    230. | 5.5 |
    231. | 7.5 |
    232. | 5,9 |
    233. | 5,3 |
    234. | 7,9 |
    235. | 4,2 |
    236. | 6,4 |
    237. | 5,4 |
    238. | 5,4 |
    239. | 6,4 |
    240. | 5.7 |
    241. | 5.4 |
    242. | 8.1 |
    243. | 7,1 |
    244. | 6.7 |
    245. | 4.4 |
    246. | 6.4 |
    247. | 4,4 |
    248. | 6,2 |
    249. | 10,2 |
    250. | 4,3 |
    251. | 5,4 |
    252. | 6.5 |
    253. | 5,7 |
    254. | 5,1 |
    255. | 6,0 |
    261. | 5,7 |
    262. | 4,9 |
    263. | 7 |
    264. | 7,6 |


  • #2
    I see vertical bars there which you must ignore too.

    Comment


    • #3
      Thanks Nick, sorry for the confusion but there are no vertical bars in the data. Those bars come from the data output as a separator.

      Comment


      • #4
        Push your variable through chartab (SSC) or charlist (SSC) to see what extra characters exist. ASCII 160, an invisible space, sometimes can confuse.

        Comment


        • #5
          For example, if you did have char(160) present, this should work


          Code:
          destring problem , replace ignore("`=char(160)'", asbytes)

          Comment


          • #6
            dpcomma specifies that variables with commas as decimal values should be converted to have periods
            as decimal values.
            Maybe your previous experience was based on a file with only comma as decimal point? or using the force option? Running destring two times may be one alternative:
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str20 v1  
            " 5.5 "
            " 7.5 "
            " 5,9 "
            " 5,3 "
            end
            
            
            destring v1, gen(v22) force 
            destring v1, dpcomma gen(v21) force 
            
            gen v2 = v21 
            replace v2 = v22 if mi(v21)
            
            list
            Code:
                 +-------------------------+
                 |    v1   v21   v22    v2 |
                 |-------------------------|
              1. |  5.5      .   5.5   5.5 |
              2. |  7.5      .   7.5   7.5 |
              3. |  5,9    5.9     .   5.9 |
              4. |  5,3    5.3     .   5.3 |
                 +-------------------------+

            Comment


            • #7
              Regarding #4, I push through chartab and it seems there are no hidden characters. Here is the output of the command
              Code:
              . chartab read1_1
              
                 decimal  hexadecimal   character |     frequency    unique name
              ------------------------------------+-----------------------------
                      44       \u002c       ,     |            51    COMMA
                      46       \u002e       .     |           277    FULL STOP
                      48       \u0030       0     |            12    DIGIT ZERO
                      49       \u0031       1     |            73    DIGIT ONE
                      50       \u0032       2     |            48    DIGIT TWO
                      51       \u0033       3     |            50    DIGIT THREE
                      52       \u0034       4     |           109    DIGIT FOUR
                      53       \u0035       5     |           128    DIGIT FIVE
                      54       \u0036       6     |           110    DIGIT SIX
                      55       \u0037       7     |            69    DIGIT SEVEN
                      56       \u0038       8     |            63    DIGIT EIGHT
                      57       \u0039       9     |            50    DIGIT NINE
              ------------------------------------+-----------------------------
              
                                                  freq. count   distinct
              ASCII characters              =           1,040         12
              Multibyte UTF-8 characters    =               0          0
              Unicode replacement character =               0          0
              Total Unicode characters      =           1,040         12

              Comment


              • #8
                Hi Bjarte Aagnes, thanks for that. Yes, I know it will work and I use subinstr function to replace comma by dot and then destring it to avoid that problem. But just wondering why it is not working with dpcomma option once I updated the Saata version to 17.

                Comment


                • #9
                  #7 shows that you have commas AND points within this variable. How come?

                  Here's a guess. You have, in some observations, commas as indicating the decimal places and points as separators. If so, what you are doing in #8 won't work as you are just adding another point.

                  Here is an example:


                  Code:
                  . clear
                  
                  . set obs 1
                  Number of observations (_N) was 0, now 1.
                  
                  . gen problem = "1.234,567"
                  
                  . destring problem, dpcomma replace
                  problem: contains nonnumeric characters; no replace
                  
                  . destring problem, dpcomma replace ignore(".")
                  problem: character . removed; replaced as double
                  
                  . l
                  
                       +----------+
                       |  problem |
                       |----------|
                    1. | 1234.567 |
                       +----------+
                  If that doesn't help, show the results of


                  Code:
                  dataex read1_1 if strpos(read1_1, ".") & strpos(read1_1, ",")
                  so that we can see any values with both points and commas.

                  (I doubt this is anything to do with Stata version.)

                  Comment


                  • #10
                    Hi Nick, the data comes from different sites and those observations from EU countries have comma as decimal separators other observations come with a dot. There are no numbers with dots and commas in one observation. Here is the result of the suggested code
                    Code:
                     dataex read1_1 if strpos(read1_1, ".") & strpos(read1_1, ",")
                    no observations; nothing to generate
                    r(2000);

                    Comment


                    • #11
                      OK, so one kind of pre-processing is needed for some observations. As stops and commas never occur together, you should try.

                      Code:
                      clonevar sandbox = read1_1 
                      replace sandbox = subinstr(sandbox, ",", ".", .) if strpos(sandbox, ",") 
                      destring sandbox, replace
                      If that doesn't work you need to show the results of something like

                      Code:
                      tab sandbox if missing(real(sandbox))

                      Comment


                      • #12
                        Thanks Nick, the code from #11 works and I mentioned before that I found the solution by using function subinstrg. I posted this query because everything worked before with one line code (destring var, replace dpcomma) until I've updated the Stata version and was wondering if someone else had this issue with a new stata version. I checked different scenarios and it seems (
                        Code:
                        destring var, replace dpcomma
                        ) works only if all observations contain comma but if we have mix: some observations with comma, some with dot then this command does not work.
                        Last edited by Linda Shuk; 23 Oct 2021, 21:29.

                        Comment


                        • #13
                          Let's back up. I made some wrong guesses here along the way, so sorry for that.

                          Your problem can be restated as follows.

                          Q. With data as below, with a mix of decimal points and decimal commas, destring with the dpcomma option does not work.

                          Code:
                          * Example generated by -dataex-. For more info, type help dataex
                          clear
                          input str4 read1_1
                          "5.5" 
                          "7.5" 
                          "5,9" 
                          "5,3" 
                          "7,9" 
                          "4,2" 
                          "6,4" 
                          "5,4" 
                          "5,4" 
                          "6,4" 
                          "5.7" 
                          "5.4" 
                          "8.1" 
                          "7,1" 
                          "6.7" 
                          "4.4" 
                          "6.4" 
                          "4,4" 
                          "6,2" 
                          "10,2"
                          "4,3" 
                          "5,4" 
                          "6.5" 
                          "5,7" 
                          "5,1" 
                          "6,0" 
                          "5,7" 
                          "4,9" 
                          "7"   
                          "7,6" 
                          end

                          Answer. The option dpcomma means "comma IS the decimal place" which is true only in some of your observations. It does not mean "comma MAY BE the decimal place". Accordingly, destring reports an error because given that option the dots, stops or periods are non-numeric characters.

                          When you have a mixture, you need a more complicated strategy.

                          One work-around is to apply force and then combine results.

                          One work-around is to apply the subinstr() function to replace the commas with stops and then apply destring.

                          Code:
                          replace read1_1 = subinstr(read1_1, ",", ".", 1) 
                          destring read1_1, replace
                          Here is another work-around. It has no advantages here but may be of interest for related problems.


                          Code:
                          separate read1_1, by(strpos(read1_1, ",") > 0)
                          destring read1_10, replace
                          destring read1_11, dpcomma replace
                          gen wanted = min(read1_10, read1_11)
                          
                          list 
                          
                               +----------------------------------------+
                               | read1_1   read1_10   read1_11   wanted |
                               |----------------------------------------|
                            1. |     5.5        5.5          .      5.5 |
                            2. |     7.5        7.5          .      7.5 |
                            3. |     5,9          .        5.9      5.9 |
                            4. |     5,3          .        5.3      5.3 |
                            5. |     7,9          .        7.9      7.9 |
                               |----------------------------------------|
                            6. |     4,2          .        4.2      4.2 |
                            7. |     6,4          .        6.4      6.4 |
                            8. |     5,4          .        5.4      5.4 |
                            9. |     5,4          .        5.4      5.4 |
                           10. |     6,4          .        6.4      6.4 |
                               |----------------------------------------|
                           11. |     5.7        5.7          .      5.7 |
                           12. |     5.4        5.4          .      5.4 |
                           13. |     8.1        8.1          .      8.1 |
                           14. |     7,1          .        7.1      7.1 |
                           15. |     6.7        6.7          .      6.7 |
                               |----------------------------------------|
                           16. |     4.4        4.4          .      4.4 |
                           17. |     6.4        6.4          .      6.4 |
                           18. |     4,4          .        4.4      4.4 |
                           19. |     6,2          .        6.2      6.2 |
                           20. |    10,2          .       10.2     10.2 |
                               |----------------------------------------|
                           21. |     4,3          .        4.3      4.3 |
                           22. |     5,4          .        5.4      5.4 |
                           23. |     6.5        6.5          .      6.5 |
                           24. |     5,7          .        5.7      5.7 |
                           25. |     5,1          .        5.1      5.1 |
                               |----------------------------------------|
                           26. |     6,0          .          6        6 |
                           27. |     5,7          .        5.7      5.7 |
                           28. |     4,9          .        4.9      4.9 |
                           29. |       7          7          .        7 |
                           30. |     7,6          .        7.6      7.6 |
                               +----------------------------------------+
                          However, I see no grounds for thinking that your problem is a side-effect of an upgrade (not update) from Stata 16 to Stata 17.

                          The code of destring hasn't changed since 2016

                          Code:
                          . which destring
                          C:\Program Files\Stata17\ado\base\d\destring.ado
                          *! version 2.4.4  15sep2016
                          so any difference in behaviour is due to differences in data.

                          As the original author of destring I retain a parental interest in the command. My view, and more importantly, I suspect the view of StataCorp, is that it's not destring's fault if the data are more complicated than your syntax implies. A simple analogue is that daily dates often arrive in a mix of styles that don't match a single pattern for daily(), in which case two or more steps are often needed to produce numeric daily dates.

                          Comment


                          • #14
                            Thank you so much Nick for such a comprehension explanation.

                            Comment

                            Working...
                            X