Hello everybody,
The last few weeks I'm struggling with the problem that Stata 13/SE does not read Unicode (specifically Latin-1 supplement).
The task:
I have 4500 text files where I want to parse certain rows (i.e., names) out. With the syntax below I managed to do this. I let Stata loop through 4500 text files and keep the row after the row where "Vriend" is mentioned. What I get is an appended list of rows (names out of these text files) where the identifier in front of it states which text file the specific row comes from. Finally, I delete all of the 16000 text files that I made through the loop and save one long appended list. (I use the capture command a lot, because the text files are named 20001_friends.txt to 36872_friends.txt, while there are only 4500 text files instead of +/-16000.)
The problem:
So far so good. However, the appended list of names that I have now contains a lot of errors because stata does not read Unicode. In essence, weird characters in names, such as Ä, Ú, î, õ, etc. are replaced by ???, or /, which I do not want. I want to keep the original names, since I want to do some additional analyses on that.
My question:
My question is: is there a way in Stata 13/SE where I can keep the original names, WITH the uncommon characters in it?
I hope you can help me out, thank you in advance,
Bas Hofstra
Syntax used:
cd "users/mypath/"
forvalue x = 20001(1)36872 { // All respondents where we downloaded .txt files from
clear
capture import delimited using "`x'_friends.txt" // Force import, because from 20001-36872 only 4500 files present
capture gen firstword = word(v1,1)
capture gen x = firstword == "Vriend" // Force: set x to 1 if Vriend
capture gen name`x' = v1 if x[_n-1]==1 // If previous is 1, then pick name
capture keep name`x' // Keep only the variable name
capture gen userID = `x' // Make userID
capture drop if missing(name`x') // Drop missings
capture save "Friend data/names`x'.dta", replace //Save loose files: +/- 16000 now, because of forced save
}
clear
cd "users/mypath/" // Apparently have to set cd again..
use "names20001.dta”
gen name = name20001 // Gen name variable for all datasets
forvalue x = 20002(1)36872 { // For all available files
append using "names`x'" // Append all other files
capture replace name = name`x' if !missing(name`x') // Force to replace
capture drop name`x' // Drop useless variable
}
drop name20001 // Drop name variable from first file
save "fds14_friendlists.dta", replace // Save these data
forvalue x = 20001(1)36872 { // Delete useless data --> only takes space and not neccessary
erase "names`x'.dta"
}
The last few weeks I'm struggling with the problem that Stata 13/SE does not read Unicode (specifically Latin-1 supplement).
The task:
I have 4500 text files where I want to parse certain rows (i.e., names) out. With the syntax below I managed to do this. I let Stata loop through 4500 text files and keep the row after the row where "Vriend" is mentioned. What I get is an appended list of rows (names out of these text files) where the identifier in front of it states which text file the specific row comes from. Finally, I delete all of the 16000 text files that I made through the loop and save one long appended list. (I use the capture command a lot, because the text files are named 20001_friends.txt to 36872_friends.txt, while there are only 4500 text files instead of +/-16000.)
The problem:
So far so good. However, the appended list of names that I have now contains a lot of errors because stata does not read Unicode. In essence, weird characters in names, such as Ä, Ú, î, õ, etc. are replaced by ???, or /, which I do not want. I want to keep the original names, since I want to do some additional analyses on that.
My question:
My question is: is there a way in Stata 13/SE where I can keep the original names, WITH the uncommon characters in it?
I hope you can help me out, thank you in advance,
Bas Hofstra
Syntax used:
cd "users/mypath/"
forvalue x = 20001(1)36872 { // All respondents where we downloaded .txt files from
clear
capture import delimited using "`x'_friends.txt" // Force import, because from 20001-36872 only 4500 files present
capture gen firstword = word(v1,1)
capture gen x = firstword == "Vriend" // Force: set x to 1 if Vriend
capture gen name`x' = v1 if x[_n-1]==1 // If previous is 1, then pick name
capture keep name`x' // Keep only the variable name
capture gen userID = `x' // Make userID
capture drop if missing(name`x') // Drop missings
capture save "Friend data/names`x'.dta", replace //Save loose files: +/- 16000 now, because of forced save
}
clear
cd "users/mypath/" // Apparently have to set cd again..
use "names20001.dta”
gen name = name20001 // Gen name variable for all datasets
forvalue x = 20002(1)36872 { // For all available files
append using "names`x'" // Append all other files
capture replace name = name`x' if !missing(name`x') // Force to replace
capture drop name`x' // Drop useless variable
}
drop name20001 // Drop name variable from first file
save "fds14_friendlists.dta", replace // Save these data
forvalue x = 20001(1)36872 { // Delete useless data --> only takes space and not neccessary
erase "names`x'.dta"
}
Comment