Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Destring: Including ascii coded characters in the ignore() option

    Good morning,

    I have a string variable (name : massenumsq1_sq1) which represents amounts in euros, imported from a .csv file. The kind of data in the cells are: 1045 or 29 568,43 or 32678 € for instance. There are also blank cells. I want to turn it into a numeric variable but there are non-numeric characters that I don't see. So to find them, I ran:

    Code:
     charlist massenumsq1_sq1 if missing(real(massenumsq1_sq1))
     ,01234579�����
    
    . return list
    
    macros:
                  r(chars) : ",01234579�����"
               r(sepchars) : ", 0 1 2 3 4 5 7 9 � � � � � "
                  r(ascii) : "32 44 48 49 50 51 52 53 55 57 128 130 172 175 226 "
    I need to remove the last 5 characters. How can I specify a list of characters into the ignore() option, only using their ascii code? (It is the only think I know about them).
    I tried with:
    Code:
    * without quotation marks
    destring massenumsq1_sq1, dpcomma ignore(`char(128)'`=char(130)'`=char(172)'`=char(175)'`=char(226)') replace
    invalid Unicode characters may not be ignored when removing aschars
    r(198);
    
    * one couple of quotation marks
    destring massenumsq1_sq1, dpcomma ignore("`char(128)'`=char(130)'`=char(172)'`=char(175)'`=char(226)'") replace
    invalid Unicode characters may not be ignored when removing aschars
    r(198);
    
    * quotation marks for each character
    destring massenumsq1_sq1, dpcomma ignore("`=char(128)'","`=char(130)'","`=char(172)'","`=char(175)'","`=char(226)'") replace
    invalid '"�' 
    r(198);
    But any of these attempts is working...
    Also I use Stata 17.
    Thanks in advance for any help!

  • #2
    How did you import the data? More specifically, wich encoding did you specify?

    Comment


    • #3
      Thank you for answering. Here is my code:
      Code:
      import delimited data_V3.csv
      (encoding automatically selected: UTF-8)
      (451 vars, 58 obs)
      So I guess that the encoding is UTF-8 (but I am not familiar with encoding issues)
      Last edited by Noemie Lecourt; 09 May 2023, 08:36.

      Comment


      • #4
        Try

        Code:
        import delimited data_V3.csv , encoding(windows-1252)
        or, depending on where the csv file comes from: Western Europe, Linux: encoding(iso-8859-1), Mac OSX: encoding(macroman) ...

        Comment


        • #5
          I still get invisible non-numeric characters, however the list is not exactly the same:
          Code:
          import delimited data_V3.csv, encoding(windows-1252)
          (451 vars, 58 obs)
          
          destring massenumsq1_sq1, dpcomma ignore(`=char(128)'`=char(130)'`=char(172)'`=char(175)'`=char(226)') replace
          invalid Unicode characters may not be ignored when removing aschars
          r(198);
          
          . charlist massenumsq1_sq1 if missing(real(massenumsq1_sq1))
           ,01234579���������
          
          . return list
          
          macros:
                        r(chars) : ",01234579���������"
                     r(sepchars) : ", 0 1 2 3 4 5 7 9 � � � � � � � � � "
                        r(ascii) : "32 44 48 49 50 51 52 53 55 57 128 130 154 162 172 175 194 195 226 "
          
          clear 
          
           import delimited data_V3.csv, encoding(iso-8859-1)
          (451 vars, 58 obs)
          
          . destring massenumsq1_sq1, dpcomma ignore(`=char(128)'`=char(130)'`=char(172)'`=char(175)'`=char(226)') replace
          invalid Unicode characters may not be ignored when removing aschars
          r(198);
          
          . charlist massenumsq1_sq1 if missing(real(massenumsq1_sq1))
           ,01234579�������
          
          . return list
          
          macros:
                        r(chars) : ",01234579�������"
                     r(sepchars) : ", 0 1 2 3 4 5 7 9 � � � � � � � "
                        r(ascii) : "32 44 48 49 50 51 52 53 55 57 128 130 162 172 175 194 195 "
          The data is the answers of an online survey administered using Lyme Survey. I extracted the data in a .csv file, turned it into an .odt file (UTF-8 encoding) to perform some adjustements, and converted it into a .csv file. The survey is administered in French, and I use Windows 10.

          Comment


          • #6
            LimeSurvey supports UTF-8 encoding for .csv-files if I remember correctly. So my guess is that something went wrong during the Export from OpenOffice to csv (the default might not be UTF-8) there. I would still try to fix the problem at its source but if you want to stick with destring, the ignore() option has an invalid sub-option, which seems relevant here.

            Comment


            • #7
              I tried to export my data from OpenOffice to .xlsx format, but I got the same result than with my .csv file. However I tried to chek my encoding with unicode analyze command and here is the result (data_stata.dta is the same dataset than data_V3.csv, but in a dta format):

              Code:
              unicode analyze data_stata.dta
                (Directory ./bak.stunicode created; please do not delete)
              
                File summary (before starting):
                      1  file(s) specified
                      1  file(s) to be examined ...
              
                File data_stata.dta (Stata dataset)
                        --------------------------------------------------------------------------------------------------------
                        File does not need translation, except ...
                        The file appears to be UTF-8 already.  Sometimes files that need translating can look like UTF-8.
                        Look at these example(s):
                            contents of str# variable nom
                        Do they look okay to you?
                        If not, the file needs translating or retranslating with the transutf8 option.  Type
                            . unicode   translate "data_stata.dta", transutf8
                            . unicode retranslate "data_stata.dta", transutf8
              
                File summary:
                    all files okay
              When I visualy check my string variables, they look ok. Is an encoding problem the only thing that could cause my issue?

              Comment


              • #8
                I am not sure what is going on, partly because I cannot replicate the in-between steps of creating a .odt-file, performing ad-hoc adjustments, and then exporting to .csv (or any other format).

                What I can tell is that (extended) ASCII code 128 is the Euro symbol and 130 resembles a comma in Windows-1252 encoding

                Code:
                display ustrfrom(char(128), "windows-1252", 1)
                display ustrfrom(char(130), "windows-1252", 1)
                hence my original guess.

                Either import the .csv file from LimeSurvey directly into Stata and do the adjustments there or try the illegal sub-option in destring's ignore() option. An alternative brute-force fix is
                Code:
                generate massenumsq1_sq1_clean = ustrregexra(massenumsq1_sq1, "[^,\.0-9]", "")
                The code strips all characters except the digits 0-9, the comma, and the dot; the latter two might act as a thousand separator or decimal point.
                Last edited by daniel klein; 09 May 2023, 15:53.

                Comment


                • #9
                  I tried the ustrregextra command you suggested and it worked! I was able to destring the new variable massenumsq1_sq1_clean. It does not solve the encoding issue, but I only need to perform basic descriptive statistics, so I think this solution will be enough. Thank you very much for helping me!

                  Comment

                  Working...
                  X