Stata 17 destring dpcomma does not work

Linda Shuk

Join Date: Oct 2021

Posts: 7
#1

Stata 17 destring dpcomma does not work

15 Oct 2021, 00:36

Hi, recently I updated my stata16 version to stata17 and now have an issue with the command destring myvar, replace dpcomma. However, my code for stata16 successfully worked before.
For example I have a var read1_1 that contains mix of numbers integer, with decimal places using a comma, with decimal places using a dot. I need to destring the var read1_1.

I used a command:

Code:

destring read1_1, replace dpcomma

, and it worked. Since I've switched to stata 17 it does not work now and returns the message : read1_1: contains nonnumeric characters; no replace.
Fragment of data is here:
read1_1
230. | 5.5 |
231. | 7.5 |
232. | 5,9 |
233. | 5,3 |
234. | 7,9 |
235. | 4,2 |
236. | 6,4 |
237. | 5,4 |
238. | 5,4 |
239. | 6,4 |
240. | 5.7 |
241. | 5.4 |
242. | 8.1 |
243. | 7,1 |
244. | 6.7 |
245. | 4.4 |
246. | 6.4 |
247. | 4,4 |
248. | 6,2 |
249. | 10,2 |
250. | 4,3 |
251. | 5,4 |
252. | 6.5 |
253. | 5,7 |
254. | 5,1 |
255. | 6,0 |
261. | 5,7 |
262. | 4,9 |
263. | 7 |
264. | 7,6 |
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

15 Oct 2021, 01:50

I see vertical bars there which you must ignore too.
Comment
Linda Shuk

Join Date: Oct 2021

Posts: 7
#3

15 Oct 2021, 23:58

Thanks Nick, sorry for the confusion but there are no vertical bars in the data. Those bars come from the data output as a separator.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

16 Oct 2021, 03:18

Push your variable through chartab (SSC) or charlist (SSC) to see what extra characters exist. ASCII 160, an invisible space, sometimes can confuse.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#5

16 Oct 2021, 05:25

For example, if you did have char(160) present, this should work

Code:

destring problem , replace ignore("`=char(160)'", asbytes)
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

17 Oct 2021, 14:51

dpcomma specifies that variables with commas as decimal values should be converted to have periods
as decimal values.

Maybe your previous experience was based on a file with only comma as decimal point? or using the force option? Running destring two times may be one alternative:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str20 v1  
" 5.5 "
" 7.5 "
" 5,9 "
" 5,3 "
end


destring v1, gen(v22) force 
destring v1, dpcomma gen(v21) force 

gen v2 = v21 
replace v2 = v22 if mi(v21)

list

Code:

     +-------------------------+
     |    v1   v21   v22    v2 |
     |-------------------------|
  1. |  5.5      .   5.5   5.5 |
  2. |  7.5      .   7.5   7.5 |
  3. |  5,9    5.9     .   5.9 |
  4. |  5,3    5.3     .   5.3 |
     +-------------------------+

Comment

Linda Shuk

Join Date: Oct 2021
Posts: 7

17 Oct 2021, 17:41

Regarding #4, I push through chartab and it seems there are no hidden characters. Here is the output of the command

Code:

. chartab read1_1

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+-----------------------------
        44       \u002c       ,     |            51    COMMA
        46       \u002e       .     |           277    FULL STOP
        48       \u0030       0     |            12    DIGIT ZERO
        49       \u0031       1     |            73    DIGIT ONE
        50       \u0032       2     |            48    DIGIT TWO
        51       \u0033       3     |            50    DIGIT THREE
        52       \u0034       4     |           109    DIGIT FOUR
        53       \u0035       5     |           128    DIGIT FIVE
        54       \u0036       6     |           110    DIGIT SIX
        55       \u0037       7     |            69    DIGIT SEVEN
        56       \u0038       8     |            63    DIGIT EIGHT
        57       \u0039       9     |            50    DIGIT NINE
------------------------------------+-----------------------------

                                    freq. count   distinct
ASCII characters              =           1,040         12
Multibyte UTF-8 characters    =               0          0
Unicode replacement character =               0          0
Total Unicode characters      =           1,040         12

Comment

Linda Shuk

Join Date: Oct 2021

Posts: 7
#8

17 Oct 2021, 17:47

Hi Bjarte Aagnes, thanks for that. Yes, I know it will work and I use subinstr function to replace comma by dot and then destring it to avoid that problem. But just wondering why it is not working with dpcomma option once I updated the Saata version to 17.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#9

18 Oct 2021, 05:57

#7 shows that you have commas AND points within this variable. How come?

Here's a guess. You have, in some observations, commas as indicating the decimal places and points as separators. If so, what you are doing in #8 won't work as you are just adding another point.

Here is an example:

Code:

. clear . set obs 1 Number of observations (_N) was 0, now 1. . gen problem = "1.234,567" . destring problem, dpcomma replace problem: contains nonnumeric characters; no replace . destring problem, dpcomma replace ignore(".") problem: character . removed; replaced as double . l +----------+ | problem | |----------| 1. | 1234.567 | +----------+

If that doesn't help, show the results of

Code:

dataex read1_1 if strpos(read1_1, ".") & strpos(read1_1, ",")

so that we can see any values with both points and commas.

(I doubt this is anything to do with Stata version.)
Comment
Linda Shuk

Join Date: Oct 2021

Posts: 7
#10

21 Oct 2021, 21:15

Hi Nick, the data comes from different sites and those observations from EU countries have comma as decimal separators other observations come with a dot. There are no numbers with dots and commas in one observation. Here is the result of the suggested code

Code:

dataex read1_1 if strpos(read1_1, ".") & strpos(read1_1, ",") no observations; nothing to generate r(2000);
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#11

22 Oct 2021, 04:02

OK, so one kind of pre-processing is needed for some observations. As stops and commas never occur together, you should try.

Code:

clonevar sandbox = read1_1 replace sandbox = subinstr(sandbox, ",", ".", .) if strpos(sandbox, ",") destring sandbox, replace

If that doesn't work you need to show the results of something like

Code:

tab sandbox if missing(real(sandbox))
Comment
Linda Shuk

Join Date: Oct 2021

Posts: 7
#12

23 Oct 2021, 20:56

Thanks Nick, the code from #11 works and I mentioned before that I found the solution by using function subinstrg. I posted this query because everything worked before with one line code (destring var, replace dpcomma) until I've updated the Stata version and was wondering if someone else had this issue with a new stata version. I checked different scenarios and it seems (

Code:

destring var, replace dpcomma

) works only if all observations contain comma but if we have mix: some observations with comma, some with dot then this command does not work.

Last edited by Linda Shuk; 23 Oct 2021, 21:29.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

#13

24 Oct 2021, 03:31

Let's back up. I made some wrong guesses here along the way, so sorry for that.

Your problem can be restated as follows.

Q. With data as below, with a mix of decimal points and decimal commas, destring with the dpcomma option does not work.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str4 read1_1
"5.5" 
"7.5" 
"5,9" 
"5,3" 
"7,9" 
"4,2" 
"6,4" 
"5,4" 
"5,4" 
"6,4" 
"5.7" 
"5.4" 
"8.1" 
"7,1" 
"6.7" 
"4.4" 
"6.4" 
"4,4" 
"6,2" 
"10,2"
"4,3" 
"5,4" 
"6.5" 
"5,7" 
"5,1" 
"6,0" 
"5,7" 
"4,9" 
"7"   
"7,6" 
end

Answer. The option dpcomma means "comma IS the decimal place" which is true only in some of your observations. It does not mean "comma MAY BE the decimal place". Accordingly, destring reports an error because given that option the dots, stops or periods are non-numeric characters.

When you have a mixture, you need a more complicated strategy.

One work-around is to apply force and then combine results.

One work-around is to apply the subinstr() function to replace the commas with stops and then apply destring.

Code:

replace read1_1 = subinstr(read1_1, ",", ".", 1) 
destring read1_1, replace

Here is another work-around. It has no advantages here but may be of interest for related problems.

Code:

separate read1_1, by(strpos(read1_1, ",") > 0)
destring read1_10, replace
destring read1_11, dpcomma replace
gen wanted = min(read1_10, read1_11)

list 

     +----------------------------------------+
     | read1_1   read1_10   read1_11   wanted |
     |----------------------------------------|
  1. |     5.5        5.5          .      5.5 |
  2. |     7.5        7.5          .      7.5 |
  3. |     5,9          .        5.9      5.9 |
  4. |     5,3          .        5.3      5.3 |
  5. |     7,9          .        7.9      7.9 |
     |----------------------------------------|
  6. |     4,2          .        4.2      4.2 |
  7. |     6,4          .        6.4      6.4 |
  8. |     5,4          .        5.4      5.4 |
  9. |     5,4          .        5.4      5.4 |
 10. |     6,4          .        6.4      6.4 |
     |----------------------------------------|
 11. |     5.7        5.7          .      5.7 |
 12. |     5.4        5.4          .      5.4 |
 13. |     8.1        8.1          .      8.1 |
 14. |     7,1          .        7.1      7.1 |
 15. |     6.7        6.7          .      6.7 |
     |----------------------------------------|
 16. |     4.4        4.4          .      4.4 |
 17. |     6.4        6.4          .      6.4 |
 18. |     4,4          .        4.4      4.4 |
 19. |     6,2          .        6.2      6.2 |
 20. |    10,2          .       10.2     10.2 |
     |----------------------------------------|
 21. |     4,3          .        4.3      4.3 |
 22. |     5,4          .        5.4      5.4 |
 23. |     6.5        6.5          .      6.5 |
 24. |     5,7          .        5.7      5.7 |
 25. |     5,1          .        5.1      5.1 |
     |----------------------------------------|
 26. |     6,0          .          6        6 |
 27. |     5,7          .        5.7      5.7 |
 28. |     4,9          .        4.9      4.9 |
 29. |       7          7          .        7 |
 30. |     7,6          .        7.6      7.6 |
     +----------------------------------------+

However, I see no grounds for thinking that your problem is a side-effect of an upgrade (not update) from Stata 16 to Stata 17.

The code of destring hasn't changed since 2016

Code:

. which destring
C:\Program Files\Stata17\ado\base\d\destring.ado
*! version 2.4.4  15sep2016

so any difference in behaviour is due to differences in data.

As the original author of destring I retain a parental interest in the command. My view, and more importantly, I suspect the view of StataCorp, is that it's not destring's fault if the data are more complicated than your syntax implies. A simple analogue is that daily dates often arrive in a mix of styles that don't match a single pattern for daily(), in which case two or more steps are often needed to produce numeric daily dates.

Comment

Linda Shuk

Join Date: Oct 2021

Posts: 7
#14

24 Oct 2021, 17:55

Thank you so much Nick for such a comprehension explanation.
Comment

Announcement