Cannot destring numbers--hidden characters?

Masood Sadat

Join Date: Aug 2014

Posts: 38
#1

Cannot destring numbers--hidden characters?

18 Aug 2018, 06:14

Dear all,

I am importing data from CSV that comes from ODK into Stata (I tried on 15/IC and 14/IC). I have multiple variables that are numeric but import as string. The issue is, I cannot destring these variables. There appears to be some hidden string, that `strtrim()` or `stritrim()` cannot help. If I use force with destring, they all go missing.

I am attaching a snippet of the data in CSV file which will not destring after importing:

Code:

import delimited "~/Downloads/cannot_destring.csv", varnames(1) encoding(utf-8) case(preserve) clear

Note that I need the data to import with utf-8 encoding because there are some Arabic/Persian text.

Thanks in advance for your help.
Attached Files

cannot_destring.csv (22 Bytes, 1 view)
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4461
#2

18 Aug 2018, 07:28

first, to show data examples, please use -dataex- (if you have an up-to-date Stata, -dataex- is built-in; if an old version, which you should tell us about, see the FAQ)

second, see the "ignore" option for -destring-;

Code:

help destring
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

18 Aug 2018, 07:40

I'm not an expert in Unicode (UTF-8 or otherwise) character sets, let me start with that. And I'll add that "data ... that comes from ODK" means absolutely nothing to me.

What I did accomplish was to hex dump your sample file using the unix od command from the Terminal window on my Mac.

Code:

% od -x cannot_destring.csv 0000000 bbef 78bf 0a0d bbef 31bf 0a0d bbef 31bf 0000020 0a0d bbef 31bf 0000026

From this, we see that your lines are separated by 0a0d - that is, a return followed by a newline, the standard Windows "CRLF" line termination sequence.

So your four lines contain

Code:

bb ef 78 bf bb ef 31 bf bb ef 31 bf bb ef 31 bf

We can see the characters for ASCII "x" (78) and "1" (31). What the rest of it connotes is beyond my meagre knowledge of Unicode, and my nonexistent knowledge of "ODF" output.

Perhaps armed with this, someone more knowledgable with Unicode can enlighten us further.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

18 Aug 2018, 07:53

What I wrote in post #3 above is misleading.

I forgot about Stata's hexdump command.

Code:

. hexdump "~/Downloads/cannot_destring.csv"
                 |                                         |    character
                 |           hex representation            |  representation
         address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
-----------------+-----------------------------------------+-----------------
               0 | efbb bf78 0d0a efbb bf31 0d0a efbb bf31 | x.. 1.. 1 
              10 | 0d0a efbb bf31                          | .. 1

So it appears to me that od is inverting the byte order. I still have no ability to make sense out of this.

Comment

Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#5

18 Aug 2018, 08:33

Building on William's detective work: what youre seeing is this: https://en.wikipedia.org/wiki/Byte_order_mark
Ive tried just now to open the csv in notepad++, encode to UTF-8 without BOM and open that in Stata, but that somehow doesnt result in any difference
Also tried opening in excel and saving as xls, which changed things a bit but not for the better.
If possible, ask your ODK program to export to UTF-8 without BOM
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

18 Aug 2018, 08:37

You have a non-numeric utf-8 character that's preventing destring from doing its thing. To track down the issue, you can use the ustrtohex() string function to create escaped hex strings for up to 200 Unicode characters:

Code:

. import delimited "cannot_destring.csv", varnames(1) encoding(utf-8) case(preserve) clear
(1 var, 3 obs)

. 
. gen s = ustrtohex(x)

. list

     +--------------------+
     |   x              s |
     |--------------------|
  1. | 1   \ufeff\u0031 |
  2. | 1   \ufeff\u0031 |
  3. | 1   \ufeff\u0031 |
     +--------------------+

. dis ustrtohex("1")
\u0031

.

A "1" is "\u0031" in hex and "\ufeff" is a BOM (byte order mark) character that certainly does not belong there. There are several ways to remove this unwanted character. Here are a few examples:

Code:

. gen x_ascii = ustrto(ustrnormalize(x, "nfd"), "ascii", 2)

. gen x_utf = ustrregexra(x, ustrunescape("\ufeff"), "")

. gen x_utf2 = ustrregexra(x, "[^-\.0-9]", "")

. 
. destring x_*, replace
x_ascii: all characters numeric; replaced as byte
x_utf: all characters numeric; replaced as byte
x_utf2: all characters numeric; replaced as byte

. list

     +-----------------------------------------------+
     |   x              s   x_ascii   x_utf   x_utf2 |
     |-----------------------------------------------|
  1. | 1   \ufeff\u0031         1       1        1 |
  2. | 1   \ufeff\u0031         1       1        1 |
  3. | 1   \ufeff\u0031         1       1        1 |
     +-----------------------------------------------+

.

Comment

Chris Larkin

Join Date: Apr 2016

Posts: 296
#7

18 Aug 2018, 08:59

The egenmore (SSC) sieve function takes care of non-alpha and non-numeric characters pretty easily. You can run it straight away and generate a new variable or use ereplace (SSC) to remove the unwanted characters from the original variable. E.g.

Code:

clear input str7 v1 "ï»¿x" "ï»¿1" "ï»¿1" "ï»¿1" end *OPTION 1 ssc install egenmore //if you don't already have it installed egen v2 = sieve(v1), keep(a n) *OPTION 2 ssc install ereplace //if you don't already have it installed ereplace v1 = sieve(v1), keep(a n)
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

18 Aug 2018, 09:33

First, my thanks to Robert and Chris; I've decreased my ignorance about Unicode and increased my knowledge of techniques for dealing with problems.

I will point out something in retrospect was important: when I opened the sample data in BBEdit, allowing it to deduce the encoding, it described the encoding as "UTF-8, with BOM". Wikipedia has something to say about this at this link. In particular, it tells us that the BOM mark only belongs at the beginning of the file, so I'd guess that whatever ODK is, it's doing something incorrect in placing it at the start of each line.
Comment
Masood Sadat

Join Date: Aug 2014

Posts: 38
#9

19 Aug 2018, 23:53

Thank you Chris and Robert, I tried all your methods and they worked. Thanks so much!
Comment

Announcement