Hi all,
I have a problem to read in a fixed format text file with string variables holding unicode characters (specifically the German letters ÄÖÜäöü and the letter ß). The problem seems to be that -infile- uses the byte-length to determine the columns in the text file, while my dictionary documents the columns in terms of the numbers of characters.
Here is a minimal example (note that in practice I have many of these files and they are large):
My text file 0188ANO.TXT contains:
My dictionary to read the file is
I invoke the dictionary with
which brings me to
The entries in F4, F5 ... are incorrect for the communities with -unicode- characters.
I have also tried to specifiy the start columns of variables using -_column(#)_ in the dictionary, but with no success. -help infile2- says something about EBCDIC-encoding, but I do not unterstand what this means. If I state -ebcdic- Stata does not import any data.
I realized that one solution would be to translate the unicode encoded file into an extended ASCII encoded file, reading in those files, and then translate the resulting Stata file back to unicode. I can do that "by hand" in my editor, but as I have many files I would prefer a programming solution (either a Stata command or a Unix command line). Correct me, if I am wrong, but I cannot use -unicode translate- to translate from -utf8- to say -latin1-, right?
Any insights highly appreciated.
I have a problem to read in a fixed format text file with string variables holding unicode characters (specifically the German letters ÄÖÜäöü and the letter ß). The problem seems to be that -infile- uses the byte-length to determine the columns in the text file, while my dictionary documents the columns in terms of the numbers of characters.
Here is a minimal example (note that in practice I have many of these files and they are large):
My text file 0188ANO.TXT contains:
Code:
8705Ebersbach 1210Ind. 1210R0051 42 3 32 880704880811 . . 8809018808300 Bad Dürrenberg 0811Ind. 0811E0181 52 3 33 880704880815 . . 8809228809200 Berlin 1515Handel 1515E0184 62 3 32 880704881001 . . 8810128810030 8900Görlitz 1232Handel 1232E0054 72 3 31 880704880815 . . 8808108808090 6500Gera 1031Handel 1031E0184 82 3 32 880704 . . . . 8808248808180 4900Zeitz 0820Handel 0820 4 92 880704 0
Code:
dictionary { str4 F2 %4s "Postleitzahl" str18 F3 %18s "Wohnort" str4 F4 %4s "Wohnbezirk bzw. Kreis-ID" str24 F5 %24s "Problemschlüssel (sachlich)" str4 F6 %4s "Problemschlüssel (territoriale Zuordnung)" str4 F7 %4s "Problemschlüssel (sachlich) (Kurzbezeichnung)" str1 F8 %1s "Adressat" str1 F9 %1s "Gerichtliche Nachprüfung" str4 FB %4s "laufende Nummer" str1 FC %1s "Form der Eingabe" str1 FD %1s "Bezugsinhalt der Eingabe" str2 FE %2s "Charakter der Eingabe" str1 FF %1s "Art der Bearbeitung" str1 FG %1s "Abgabe/Rückantwort durch" str3 FH %3s "Soziale Merkmale des Einreichers" str6 FL %6s "Eingangsdatum" str6 FM %6s "Entscheidungstermin" str6 FN %6s "Wiedervorlagedatum" str6 FP %6s "Reserve-Datumsfeld" str6 FQ %6s "Zu-den-Akten-Datum" str6 FR %6s "Datum des Bezugsschreibens" str1 FS %1s "Leerfeld 1 Zeichen" }
Code:
. infile using DC20MD.dct, using(ANOTXT/0188ANO.TXT)
Code:
. list F2 F3 F4 F5 in 4/9 +----------------------------------------+ | F2 F3 F4 F5 | |----------------------------------------| 4. | 8705 Ebersbach 1210 Ind. | 5. | Bad Dürrenberg 081 1Ind. | 6. | Berlin 1515 Handel | 7. | 8900 Görlitz 123 2Handel | 8. | 6500 Gera 1031 Handel | |----------------------------------------| 9. | 4900 Zeitz 0820 Handel | +----------------------------------------+
I have also tried to specifiy the start columns of variables using -_column(#)_ in the dictionary, but with no success. -help infile2- says something about EBCDIC-encoding, but I do not unterstand what this means. If I state -ebcdic- Stata does not import any data.
I realized that one solution would be to translate the unicode encoded file into an extended ASCII encoded file, reading in those files, and then translate the resulting Stata file back to unicode. I can do that "by hand" in my editor, but as I have many files I would prefer a programming solution (either a Stata command or a Unix command line). Correct me, if I am wrong, but I cannot use -unicode translate- to translate from -utf8- to say -latin1-, right?
Any insights highly appreciated.
Comment