Destring: Including ascii coded characters in the ignore() option

Noemie Lecourt

Join Date: May 2021
Posts: 7

Destring: Including ascii coded characters in the ignore() option

09 May 2023, 07:46

Good morning,

I have a string variable (name : massenumsq1_sq1) which represents amounts in euros, imported from a .csv file. The kind of data in the cells are: 1045 or 29 568,43 or 32678 € for instance. There are also blank cells. I want to turn it into a numeric variable but there are non-numeric characters that I don't see. So to find them, I ran:

Code:

 charlist massenumsq1_sq1 if missing(real(massenumsq1_sq1))
 ,01234579�����

. return list

macros:
              r(chars) : ",01234579�����"
           r(sepchars) : ", 0 1 2 3 4 5 7 9 � � � � � "
              r(ascii) : "32 44 48 49 50 51 52 53 55 57 128 130 172 175 226 "

I need to remove the last 5 characters. How can I specify a list of characters into the ignore() option, only using their ascii code? (It is the only think I know about them).
I tried with:

Code:

* without quotation marks
destring massenumsq1_sq1, dpcomma ignore(`char(128)'`=char(130)'`=char(172)'`=char(175)'`=char(226)') replace
invalid Unicode characters may not be ignored when removing aschars
r(198);

* one couple of quotation marks
destring massenumsq1_sq1, dpcomma ignore("`char(128)'`=char(130)'`=char(172)'`=char(175)'`=char(226)'") replace
invalid Unicode characters may not be ignored when removing aschars
r(198);

* quotation marks for each character
destring massenumsq1_sq1, dpcomma ignore("`=char(128)'","`=char(130)'","`=char(172)'","`=char(175)'","`=char(226)'") replace
invalid '"�' 
r(198);

But any of these attempts is working...
Also I use Stata 17.
Thanks in advance for any help!

Tags: None

daniel klein

Join Date: Mar 2014

Posts: 3860
#2

09 May 2023, 08:02

How did you import the data? More specifically, wich encoding did you specify?
Comment
Noemie Lecourt

Join Date: May 2021

Posts: 7
#3

09 May 2023, 08:08

Thank you for answering. Here is my code:

Code:

import delimited data_V3.csv (encoding automatically selected: UTF-8) (451 vars, 58 obs)

So I guess that the encoding is UTF-8 (but I am not familiar with encoding issues)

Last edited by Noemie Lecourt; 09 May 2023, 08:36.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3860
#4

09 May 2023, 08:53

Try

Code:

import delimited data_V3.csv , encoding(windows-1252)

or, depending on where the csv file comes from: Western Europe, Linux: encoding(iso-8859-1), Mac OSX: encoding(macroman) ...
Comment

Noemie Lecourt

Join Date: May 2021
Posts: 7

09 May 2023, 09:09

I still get invisible non-numeric characters, however the list is not exactly the same:

Code:

import delimited data_V3.csv, encoding(windows-1252)
(451 vars, 58 obs)

destring massenumsq1_sq1, dpcomma ignore(`=char(128)'`=char(130)'`=char(172)'`=char(175)'`=char(226)') replace
invalid Unicode characters may not be ignored when removing aschars
r(198);

. charlist massenumsq1_sq1 if missing(real(massenumsq1_sq1))
 ,01234579���������

. return list

macros:
              r(chars) : ",01234579���������"
           r(sepchars) : ", 0 1 2 3 4 5 7 9 � � � � � � � � � "
              r(ascii) : "32 44 48 49 50 51 52 53 55 57 128 130 154 162 172 175 194 195 226 "

clear 

 import delimited data_V3.csv, encoding(iso-8859-1)
(451 vars, 58 obs)

. destring massenumsq1_sq1, dpcomma ignore(`=char(128)'`=char(130)'`=char(172)'`=char(175)'`=char(226)') replace
invalid Unicode characters may not be ignored when removing aschars
r(198);

. charlist massenumsq1_sq1 if missing(real(massenumsq1_sq1))
 ,01234579�������

. return list

macros:
              r(chars) : ",01234579�������"
           r(sepchars) : ", 0 1 2 3 4 5 7 9 � � � � � � � "
              r(ascii) : "32 44 48 49 50 51 52 53 55 57 128 130 162 172 175 194 195 "

The data is the answers of an online survey administered using Lyme Survey. I extracted the data in a .csv file, turned it into an .odt file (UTF-8 encoding) to perform some adjustements, and converted it into a .csv file. The survey is administered in French, and I use Windows 10.

Comment

daniel klein

Join Date: Mar 2014

Posts: 3860
#6

09 May 2023, 12:45

LimeSurvey supports UTF-8 encoding for .csv-files if I remember correctly. So my guess is that something went wrong during the Export from OpenOffice to csv (the default might not be UTF-8) there. I would still try to fix the problem at its source but if you want to stick with destring, the ignore() option has an invalid sub-option, which seems relevant here.
Comment

Noemie Lecourt

Join Date: May 2021
Posts: 7

09 May 2023, 15:08

I tried to export my data from OpenOffice to .xlsx format, but I got the same result than with my .csv file. However I tried to chek my encoding with unicode analyze command and here is the result (data_stata.dta is the same dataset than data_V3.csv, but in a dta format):

Code:

unicode analyze data_stata.dta
  (Directory ./bak.stunicode created; please do not delete)

  File summary (before starting):
        1  file(s) specified
        1  file(s) to be examined ...

  File data_stata.dta (Stata dataset)
          --------------------------------------------------------------------------------------------------------
          File does not need translation, except ...
          The file appears to be UTF-8 already.  Sometimes files that need translating can look like UTF-8.
          Look at these example(s):
              contents of str# variable nom
          Do they look okay to you?
          If not, the file needs translating or retranslating with the transutf8 option.  Type
              . unicode   translate "data_stata.dta", transutf8
              . unicode retranslate "data_stata.dta", transutf8

  File summary:
      all files okay

When I visualy check my string variables, they look ok. Is an encoding problem the only thing that could cause my issue?

Comment

daniel klein

Join Date: Mar 2014

Posts: 3860
#8

09 May 2023, 15:48

I am not sure what is going on, partly because I cannot replicate the in-between steps of creating a .odt-file, performing ad-hoc adjustments, and then exporting to .csv (or any other format).

What I can tell is that (extended) ASCII code 128 is the Euro symbol and 130 resembles a comma in Windows-1252 encoding

Code:

display ustrfrom(char(128), "windows-1252", 1) display ustrfrom(char(130), "windows-1252", 1)

hence my original guess.

Either import the .csv file from LimeSurvey directly into Stata and do the adjustments there or try the illegal sub-option in destring's ignore() option. An alternative brute-force fix is

Code:

generate massenumsq1_sq1_clean = ustrregexra(massenumsq1_sq1, "[^,\.0-9]", "")

The code strips all characters except the digits 0-9, the comma, and the dot; the latter two might act as a thousand separator or decimal point.

Last edited by daniel klein; 09 May 2023, 15:53.
Comment
Noemie Lecourt

Join Date: May 2021

Posts: 7
#9

09 May 2023, 16:08

I tried the ustrregextra command you suggested and it worked! I was able to destring the new variable massenumsq1_sq1_clean. It does not solve the encoding issue, but I only need to perform basic descriptive statistics, so I think this solution will be enough. Thank you very much for helping me!
Comment

Announcement

Destring: Including ascii coded characters in the ignore() option

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment