Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cannot destring numbers--hidden characters?

    Dear all,

    I am importing data from CSV that comes from ODK into Stata (I tried on 15/IC and 14/IC). I have multiple variables that are numeric but import as string. The issue is, I cannot destring these variables. There appears to be some hidden string, that `strtrim()` or `stritrim()` cannot help. If I use force with destring, they all go missing.

    I am attaching a snippet of the data in CSV file which will not destring after importing:

    Code:
    import delimited "~/Downloads/cannot_destring.csv", varnames(1) encoding(utf-8) case(preserve) clear
    Note that I need the data to import with utf-8 encoding because there are some Arabic/Persian text.

    Thanks in advance for your help.
    Attached Files

  • #2
    first, to show data examples, please use -dataex- (if you have an up-to-date Stata, -dataex- is built-in; if an old version, which you should tell us about, see the FAQ)

    second, see the "ignore" option for -destring-;
    Code:
    help destring

    Comment


    • #3
      I'm not an expert in Unicode (UTF-8 or otherwise) character sets, let me start with that. And I'll add that "data ... that comes from ODK" means absolutely nothing to me.

      What I did accomplish was to hex dump your sample file using the unix od command from the Terminal window on my Mac.

      Code:
      % od -x cannot_destring.csv
      0000000      bbef    78bf    0a0d    bbef    31bf    0a0d    bbef    31bf
      0000020      0a0d    bbef    31bf                                        
      0000026
      From this, we see that your lines are separated by 0a0d - that is, a return followed by a newline, the standard Windows "CRLF" line termination sequence.

      So your four lines contain
      Code:
      bb ef 78 bf
      bb ef 31 bf
      bb ef 31 bf
      bb ef 31 bf
      We can see the characters for ASCII "x" (78) and "1" (31). What the rest of it connotes is beyond my meagre knowledge of Unicode, and my nonexistent knowledge of "ODF" output.

      Perhaps armed with this, someone more knowledgable with Unicode can enlighten us further.

      Comment


      • #4
        What I wrote in post #3 above is misleading.

        I forgot about Stata's hexdump command.
        Code:
        . hexdump "~/Downloads/cannot_destring.csv"
                         |                                         |    character
                         |           hex representation            |  representation
                 address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
        -----------------+-----------------------------------------+-----------------
                       0 | efbb bf78 0d0a efbb bf31 0d0a efbb bf31 | x.. 1.. 1 
                      10 | 0d0a efbb bf31                          | .. 1
        So it appears to me that od is inverting the byte order. I still have no ability to make sense out of this.

        Comment


        • #5
          Building on William's detective work: what youre seeing is this: https://en.wikipedia.org/wiki/Byte_order_mark
          Ive tried just now to open the csv in notepad++, encode to UTF-8 without BOM and open that in Stata, but that somehow doesnt result in any difference
          Also tried opening in excel and saving as xls, which changed things a bit but not for the better.
          If possible, ask your ODK program to export to UTF-8 without BOM

          Comment


          • #6
            You have a non-numeric utf-8 character that's preventing destring from doing its thing. To track down the issue, you can use the ustrtohex() string function to create escaped hex strings for up to 200 Unicode characters:
            Code:
            . import delimited "cannot_destring.csv", varnames(1) encoding(utf-8) case(preserve) clear
            (1 var, 3 obs)
            
            . 
            . gen s = ustrtohex(x)
            
            . list
            
                 +--------------------+
                 |   x              s |
                 |--------------------|
              1. | 1   \ufeff\u0031 |
              2. | 1   \ufeff\u0031 |
              3. | 1   \ufeff\u0031 |
                 +--------------------+
            
            . dis ustrtohex("1")
            \u0031
            
            .
            A "1" is "\u0031" in hex and "\ufeff" is a BOM (byte order mark) character that certainly does not belong there. There are several ways to remove this unwanted character. Here are a few examples:
            Code:
            . gen x_ascii = ustrto(ustrnormalize(x, "nfd"), "ascii", 2)
            
            . gen x_utf = ustrregexra(x, ustrunescape("\ufeff"), "")
            
            . gen x_utf2 = ustrregexra(x, "[^-\.0-9]", "")
            
            . 
            . destring x_*, replace
            x_ascii: all characters numeric; replaced as byte
            x_utf: all characters numeric; replaced as byte
            x_utf2: all characters numeric; replaced as byte
            
            . list
            
                 +-----------------------------------------------+
                 |   x              s   x_ascii   x_utf   x_utf2 |
                 |-----------------------------------------------|
              1. | 1   \ufeff\u0031         1       1        1 |
              2. | 1   \ufeff\u0031         1       1        1 |
              3. | 1   \ufeff\u0031         1       1        1 |
                 +-----------------------------------------------+
            
            .

            Comment


            • #7
              The egenmore (SSC) sieve function takes care of non-alpha and non-numeric characters pretty easily. You can run it straight away and generate a new variable or use ereplace (SSC) to remove the unwanted characters from the original variable. E.g.

              Code:
              clear
              input str7 v1
              "x"
              "1"
              "1"
              "1"
              end
              
              *OPTION 1
              ssc install egenmore //if you don't already have it installed
              egen v2 = sieve(v1), keep(a n)
              
              *OPTION 2
              ssc install ereplace //if you don't already have it installed
              ereplace v1 = sieve(v1), keep(a n)

              Comment


              • #8
                First, my thanks to Robert and Chris; I've decreased my ignorance about Unicode and increased my knowledge of techniques for dealing with problems.

                I will point out something in retrospect was important: when I opened the sample data in BBEdit, allowing it to deduce the encoding, it described the encoding as "UTF-8, with BOM". Wikipedia has something to say about this at this link. In particular, it tells us that the BOM mark only belongs at the beginning of the file, so I'd guess that whatever ODK is, it's doing something incorrect in placing it at the start of each line.

                Comment


                • #9
                  Thank you Chris and Robert, I tried all your methods and they worked. Thanks so much!

                  Comment

                  Working...
                  X