Stata 13/SE: Looping through +/- 4500 .txt files cleaning them and keep Unicode characters

bashofstra

Join Date: Oct 2014

Posts: 7
#1

Stata 13/SE: Looping through +/- 4500 .txt files cleaning them and keep Unicode characters

28 Oct 2014, 04:23

Hello everybody,

The last few weeks I'm struggling with the problem that Stata 13/SE does not read Unicode (specifically Latin-1 supplement).

The task:
I have 4500 text files where I want to parse certain rows (i.e., names) out. With the syntax below I managed to do this. I let Stata loop through 4500 text files and keep the row after the row where "Vriend" is mentioned. What I get is an appended list of rows (names out of these text files) where the identifier in front of it states which text file the specific row comes from. Finally, I delete all of the 16000 text files that I made through the loop and save one long appended list. (I use the capture command a lot, because the text files are named 20001_friends.txt to 36872_friends.txt, while there are only 4500 text files instead of +/-16000.)

The problem:
So far so good. However, the appended list of names that I have now contains a lot of errors because stata does not read Unicode. In essence, weird characters in names, such as Ä, Ú, î, õ, etc. are replaced by ???, or /, which I do not want. I want to keep the original names, since I want to do some additional analyses on that.

My question:
My question is: is there a way in Stata 13/SE where I can keep the original names, WITH the uncommon characters in it?

I hope you can help me out, thank you in advance,

Bas Hofstra

Syntax used:

cd "users/mypath/"

forvalue x = 20001(1)36872 { // All respondents where we downloaded .txt files from

clear
capture import delimited using "`x'_friends.txt" // Force import, because from 20001-36872 only 4500 files present

capture gen firstword = word(v1,1)
capture gen x = firstword == "Vriend" // Force: set x to 1 if Vriend
capture gen name`x' = v1 if x[_n-1]==1 // If previous is 1, then pick name

capture keep name`x' // Keep only the variable name
capture gen userID = `x' // Make userID

capture drop if missing(name`x') // Drop missings
capture save "Friend data/names`x'.dta", replace //Save loose files: +/- 16000 now, because of forced save
}

clear
cd "users/mypath/" // Apparently have to set cd again..

use "names20001.dta”
gen name = name20001 // Gen name variable for all datasets

forvalue x = 20002(1)36872 { // For all available files

append using "names`x'" // Append all other files
capture replace name = name`x' if !missing(name`x') // Force to replace
capture drop name`x' // Drop useless variable
}

drop name20001 // Drop name variable from first file

save "fds14_friendlists.dta", replace // Save these data

forvalue x = 20001(1)36872 { // Delete useless data --> only takes space and not neccessary

erase "names`x'.dta"
}
Tags: None
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#2

28 Oct 2014, 08:45

The answer to your question is no. No, you can't process in Stata files with non-ASCII characters in them.

If you are on Windows, you can rely on SHORT names.

In general, rename the files to 1,2,3... and then feed them to Stata.

It looks like you are processing social networks data. Remove any and all names early on in your analysis and process by ids only. This should save alot of time and frustration.

Best, Sergiy
Comment
bashofstra

Join Date: Oct 2014

Posts: 7
#3

28 Oct 2014, 09:38

Dear Sergiy,

Thank you very much for your answer, even though the outcome is disappointing. It seems I should invest time in learning other software languages in order to solve my problem.

I indeed process social network data, where in the next step I attach attributes to names conditional upon that names. After this, I anonymize and change names to ids.

Kind regards,

Bas
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#4

29 Oct 2014, 03:59

You may also want to consider looking at the file and/or filefilter commands. If you are familiar with any tools in any other languages that can manage translation of those character sets to ASCII characters you could always shell out and use those processing tools to get the same/similar results.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3459
#5

29 Oct 2014, 04:39

Originally posted by bashofstra View Post

I indeed process social network data, where in the next step I attach attributes to names conditional upon that names. After this, I anonymize and change names to ids.

That seems to me the wrong way around. My strategy would be to start with creating the ID's, merge those in, and afterwards never touch the names again and always work with the IDs. The reasons for that are that the same ID is guaranteed to refer to the same person, and there is no such guarantee with names. Also names tend to contain a lot of errors, as typing in names is very error prone (spelling mistakes especially in foreing names, switching between common variations of the same name (Jakob, Jacob), capitalization (Van der Vaart, Van Der Vaart, van der Vaart), etc.). When using ID numbers, you only have to resolve these problems once to create the ids, and afterwards you just don't use the names anymore.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
bashofstra

Join Date: Oct 2014

Posts: 7
#6

29 Oct 2014, 06:12

I am aware that names contain a lot of errors, and this might also be the case for my data. However, the attributes that I attach to the names are specific for that name. For instance, socioeconomic status corresponds with names, the Dutch name "Josephine" might indicate higher socioeconomic status than "Kevin". Attaching attributes occurs by matching my names to name-based register data of an entire population, and in order to match to these data I need exact names (including uncommon characters), so converting to ASCII is unfortunately not an option.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3459
#7

29 Oct 2014, 06:31

Originally posted by bashofstra View Post

However, the attributes that I attach to the names are specific for that name. For instance, socioeconomic status corresponds with names, the Dutch name "Josephine" might indicate higher socioeconomic status than "Kevin"

That makes sense.

That reminds me of the mean quote from a teacher in Germany who was asked to evaluate names of fictional students: "Kevin ist kein Name, sondern eine Diagnose". Translated: "Kevin isn't a name, it's a diagnosis".

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Andrew Maurer

Join Date: Apr 2014

Posts: 28
#8

29 Oct 2014, 09:11

Preface - I have little knowledge of what ASCII vs Unicode, so this may not be very helpful.

That said, I don't have any troubles with parsing the character Ä in Stata (ASCII code 196, from http://www.ascii-code.com/).

clear all
set obs 1

//
// Testing whether Stata can display Ä
//

// 1) Can Stata display Ä?
di "Ä" // Ä is alt-0-196 on Windows (use numeric keypad)
// Success!

// 2) Can Ä be stored as a string variable?
gen x = "Ä"
di x[1]
// Success!

//
// Testing whether Stata can file read/write Ä
//

tempfile myfile
tempname fh

// 1) Try writing a file with Ä
file open `fh' using `myfile', write
file write `fh' "first line normal" _n
file write `fh' "second line normal" _n
file write `fh' "third line has non-ascii Ä. Is this a problem? " _n
file write `fh' "fourth line normal" _n
file close `fh'
// Success!

// 2) Try reading a file with Ä
local linenum = 0
file open `fh' using `myfile', read
file read `fh' line
while r(eof)==0 {
local linenum = `linenum' + 1
display %4.0f `linenum' _asis `" `macval(line)'"'
file read `fh' line
}
file close `fh'
// Success!

If I haven't understood this correctly, and ASCII and Unicode are encodings, where an ASCII Ä is encoded into binary differently from how a Unicode Ä is encoded into binary then two options that come to mind if you want to remain in Stata would be (someone correct me if either of these are infeasible or impractical) :
Use Stata to convert the binary to ASCII, then read that. This would require mapping the Unicode representation of Ä in binary to the ASCII binary, then reading the mapped value as ASCII. Stata's file read (file write) and mata's fread() and fwrite() support binary.

Find a third party converter launchable from the command line that converts a file from Unicode to ASCII. You would then just need to add a line in your first loop to shell escape and convert the file before reading it.
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#9

29 Oct 2014, 17:08

Actually, unicode is not an encoding. Ascii, latin-1, and utf-8 are encodings, but unicode is just an abstract form of the text.
Thinking of unicode as an encoding is a very common source of errors, and causes many mistakes in e.g. perl/python:
http://stackoverflow.com/questions/3...951740#3951740
http://www.joelonsoftware.com/articles/Unicode.html

That said, if his text files are really in latin1, then Stata *should* support it.
EG:

Code:

clear set obs 1 gen x = "çæäãÆþ" di strpos(x,"þ") list outsheet using foo.raw insheet using foo,clear

What may be happening is that Bas is just using a font that doesn't support latin1 so he sees ?? instead of the real characters. Bas: try with ubuntu mono / consolas / a newish monotype font, and see if the error persists.
Comment
bashofstra

Join Date: Oct 2014

Posts: 7
#10

31 Oct 2014, 07:06

Hello everybody,

Thanks you all very much, I think all the suggestions might work!

I chose the easy way out and parsed the rows out of my text files via R and saved the appended list as .txt. I wasn't able to immediately save the correct identifiers next to my names.So, I opened the .txt file and just copy/ pasted the column in Stata, next to the identifiers with which I did not had any problems (I checked and double-checked and they pasted correctly and I am aware that this is very prone to errors). It seems indeed that Sergio is correct, because as soon as I just copy paste the appended list in Stata, the characters stay as they should be! So all of my text files might be saved in weird fonts which aren't recognized by Stata.
Comment
Richard Stanley

Join Date: Jul 2014

Posts: 6
#11

26 Mar 2015, 13:01

Another option is to use Python, R or whatever tool to encode the text into their underlying UTF-16 hexadecimals, then enclose these in strings in Stata. There would be no loss of information.
Comment

Announcement

Stata 13/SE: Looping through +/- 4500 .txt files cleaning them and keep Unicode characters

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment