Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata 13/SE: Looping through +/- 4500 .txt files cleaning them and keep Unicode characters

    Hello everybody,

    The last few weeks I'm struggling with the problem that Stata 13/SE does not read Unicode (specifically Latin-1 supplement).

    The task:
    I have 4500 text files where I want to parse certain rows (i.e., names) out. With the syntax below I managed to do this. I let Stata loop through 4500 text files and keep the row after the row where "Vriend" is mentioned. What I get is an appended list of rows (names out of these text files) where the identifier in front of it states which text file the specific row comes from. Finally, I delete all of the 16000 text files that I made through the loop and save one long appended list. (I use the capture command a lot, because the text files are named 20001_friends.txt to 36872_friends.txt, while there are only 4500 text files instead of +/-16000.)

    The problem:
    So far so good. However, the appended list of names that I have now contains a lot of errors because stata does not read Unicode. In essence, weird characters in names, such as Ä, Ú, î, õ, etc. are replaced by ???, or /, which I do not want. I want to keep the original names, since I want to do some additional analyses on that.

    My question:
    My question is: is there a way in Stata 13/SE where I can keep the original names, WITH the uncommon characters in it?

    I hope you can help me out, thank you in advance,

    Bas Hofstra

    Syntax used:

    cd "users/mypath/"

    forvalue x = 20001(1)36872 { // All respondents where we downloaded .txt files from

    clear
    capture import delimited using "`x'_friends.txt" // Force import, because from 20001-36872 only 4500 files present

    capture gen firstword = word(v1,1)
    capture gen x = firstword == "Vriend" // Force: set x to 1 if Vriend
    capture gen name`x' = v1 if x[_n-1]==1 // If previous is 1, then pick name

    capture keep name`x' // Keep only the variable name
    capture gen userID = `x' // Make userID

    capture drop if missing(name`x') // Drop missings
    capture save "Friend data/names`x'.dta", replace //Save loose files: +/- 16000 now, because of forced save
    }

    clear
    cd "users/mypath/" // Apparently have to set cd again..

    use "names20001.dta”
    gen name = name20001 // Gen name variable for all datasets

    forvalue x = 20002(1)36872 { // For all available files

    append using "names`x'" // Append all other files
    capture replace name = name`x' if !missing(name`x') // Force to replace
    capture drop name`x' // Drop useless variable
    }


    drop name20001 // Drop name variable from first file

    save "fds14_friendlists.dta", replace // Save these data

    forvalue x = 20001(1)36872 { // Delete useless data --> only takes space and not neccessary

    erase "names`x'.dta"
    }


  • #2
    The answer to your question is no. No, you can't process in Stata files with non-ASCII characters in them.

    If you are on Windows, you can rely on SHORT names.

    In general, rename the files to 1,2,3... and then feed them to Stata.

    It looks like you are processing social networks data. Remove any and all names early on in your analysis and process by ids only. This should save alot of time and frustration.

    Best, Sergiy

    Comment


    • #3
      Dear Sergiy,

      Thank you very much for your answer, even though the outcome is disappointing. It seems I should invest time in learning other software languages in order to solve my problem.

      I indeed process social network data, where in the next step I attach attributes to names conditional upon that names. After this, I anonymize and change names to ids.

      Kind regards,

      Bas

      Comment


      • #4
        You may also want to consider looking at the file and/or filefilter commands. If you are familiar with any tools in any other languages that can manage translation of those character sets to ASCII characters you could always shell out and use those processing tools to get the same/similar results.

        Comment


        • #5

          Originally posted by bashofstra View Post
          I indeed process social network data, where in the next step I attach attributes to names conditional upon that names. After this, I anonymize and change names to ids.
          That seems to me the wrong way around. My strategy would be to start with creating the ID's, merge those in, and afterwards never touch the names again and always work with the IDs. The reasons for that are that the same ID is guaranteed to refer to the same person, and there is no such guarantee with names. Also names tend to contain a lot of errors, as typing in names is very error prone (spelling mistakes especially in foreing names, switching between common variations of the same name (Jakob, Jacob), capitalization (Van der Vaart, Van Der Vaart, van der Vaart), etc.). When using ID numbers, you only have to resolve these problems once to create the ids, and afterwards you just don't use the names anymore.
          ---------------------------------
          Maarten L. Buis
          University of Konstanz
          Department of history and sociology
          box 40
          78457 Konstanz
          Germany
          http://www.maartenbuis.nl
          ---------------------------------

          Comment


          • #6
            I am aware that names contain a lot of errors, and this might also be the case for my data. However, the attributes that I attach to the names are specific for that name. For instance, socioeconomic status corresponds with names, the Dutch name "Josephine" might indicate higher socioeconomic status than "Kevin". Attaching attributes occurs by matching my names to name-based register data of an entire population, and in order to match to these data I need exact names (including uncommon characters), so converting to ASCII is unfortunately not an option.

            Comment


            • #7
              Originally posted by bashofstra View Post
              However, the attributes that I attach to the names are specific for that name. For instance, socioeconomic status corresponds with names, the Dutch name "Josephine" might indicate higher socioeconomic status than "Kevin"
              That makes sense.

              That reminds me of the mean quote from a teacher in Germany who was asked to evaluate names of fictional students: "Kevin ist kein Name, sondern eine Diagnose". Translated: "Kevin isn't a name, it's a diagnosis".
              ---------------------------------
              Maarten L. Buis
              University of Konstanz
              Department of history and sociology
              box 40
              78457 Konstanz
              Germany
              http://www.maartenbuis.nl
              ---------------------------------

              Comment


              • #8
                Preface - I have little knowledge of what ASCII vs Unicode, so this may not be very helpful.

                That said, I don't have any troubles with parsing the character Ä in Stata (ASCII code 196, from http://www.ascii-code.com/).
                clear all
                set obs 1

                //
                // Testing whether Stata can display Ä
                //

                // 1) Can Stata display Ä?
                di "Ä" // Ä is alt-0-196 on Windows (use numeric keypad)
                // Success!

                // 2) Can Ä be stored as a string variable?
                gen x = "Ä"
                di x[1]
                // Success!


                //
                // Testing whether Stata can file read/write Ä
                //

                tempfile myfile
                tempname fh

                // 1) Try writing a file with Ä
                file open `fh' using `myfile', write
                file write `fh' "first line normal" _n
                file write `fh' "second line normal" _n
                file write `fh' "third line has non-ascii Ä. Is this a problem? " _n
                file write `fh' "fourth line normal" _n
                file close `fh'
                // Success!

                // 2) Try reading a file with Ä
                local linenum = 0
                file open `fh' using `myfile', read
                file read `fh' line
                while r(eof)==0 {
                local linenum = `linenum' + 1
                display %4.0f `linenum' _asis `" `macval(line)'"'
                file read `fh' line
                }
                file close `fh'
                // Success!

                If I haven't understood this correctly, and ASCII and Unicode are encodings, where an ASCII Ä is encoded into binary differently from how a Unicode Ä is encoded into binary then two options that come to mind if you want to remain in Stata would be (someone correct me if either of these are infeasible or impractical) :
                1. Use Stata to convert the binary to ASCII, then read that. This would require mapping the Unicode representation of Ä in binary to the ASCII binary, then reading the mapped value as ASCII. Stata's file read (file write) and mata's fread() and fwrite() support binary.
                2. Find a third party converter launchable from the command line that converts a file from Unicode to ASCII. You would then just need to add a line in your first loop to shell escape and convert the file before reading it.

                Comment


                • #9
                  Actually, unicode is not an encoding. Ascii, latin-1, and utf-8 are encodings, but unicode is just an abstract form of the text.
                  Thinking of unicode as an encoding is a very common source of errors, and causes many mistakes in e.g. perl/python:
                  http://stackoverflow.com/questions/3...951740#3951740
                  http://www.joelonsoftware.com/articles/Unicode.html

                  That said, if his text files are really in latin1, then Stata *should* support it.
                  EG:
                  Code:
                  clear
                  set obs 1
                  gen x = "çæäãÆþ"
                  di strpos(x,"þ")
                  list
                  outsheet using foo.raw
                  insheet using foo,clear
                  What may be happening is that Bas is just using a font that doesn't support latin1 so he sees ?? instead of the real characters. Bas: try with ubuntu mono / consolas / a newish monotype font, and see if the error persists.

                  Comment


                  • #10
                    Hello everybody,

                    Thanks you all very much, I think all the suggestions might work!

                    I chose the easy way out and parsed the rows out of my text files via R and saved the appended list as .txt. I wasn't able to immediately save the correct identifiers next to my names.So, I opened the .txt file and just copy/ pasted the column in Stata, next to the identifiers with which I did not had any problems (I checked and double-checked and they pasted correctly and I am aware that this is very prone to errors). It seems indeed that Sergio is correct, because as soon as I just copy paste the appended list in Stata, the characters stay as they should be! So all of my text files might be saved in weird fonts which aren't recognized by Stata.

                    Comment


                    • #11
                      Another option is to use Python, R or whatever tool to encode the text into their underlying UTF-16 hexadecimals, then enclose these in strings in Stata. There would be no loss of information.

                      Comment

                      Working...
                      X