Reading fixed format text file with unicode string variables

Ulrich Kohler

Join Date: May 2014
Posts: 89

Reading fixed format text file with unicode string variables

30 Nov 2016, 07:42

Hi all,

I have a problem to read in a fixed format text file with string variables holding unicode characters (specifically the German letters ÄÖÜäöü and the letter ß). The problem seems to be that -infile- uses the byte-length to determine the columns in the text file, while my dictionary documents the columns in terms of the numbers of characters.

Here is a minimal example (note that in practice I have many of these files and they are large):

My text file 0188ANO.TXT contains:

Code:

8705Ebersbach         1210Ind.                    1210R0051    42 3 32   880704880811  .  .      8809018808300
    Bad Dürrenberg    0811Ind.                    0811E0181    52 3 33   880704880815  .  .      8809228809200
    Berlin            1515Handel                  1515E0184    62 3 32   880704881001  .  .      8810128810030
8900Görlitz           1232Handel                  1232E0054    72 3 31   880704880815  .  .      8808108808090
6500Gera              1031Handel                  1031E0184    82 3 32   880704  .  .  .  .      8808248808180
4900Zeitz             0820Handel                  0820    4    92        880704                              0

My dictionary to read the file is

Code:

dictionary {
str4 F2 %4s "Postleitzahl"
str18 F3 %18s "Wohnort"
str4 F4 %4s "Wohnbezirk bzw. Kreis-ID"
str24 F5 %24s "Problemschlüssel (sachlich)"
str4 F6 %4s "Problemschlüssel (territoriale Zuordnung)"
str4 F7 %4s "Problemschlüssel (sachlich) (Kurzbezeichnung)"
str1 F8 %1s "Adressat"
str1 F9 %1s "Gerichtliche Nachprüfung"
str4 FB %4s "laufende Nummer"
str1 FC %1s "Form der Eingabe"
str1 FD %1s "Bezugsinhalt der Eingabe"
str2 FE %2s "Charakter der Eingabe"
str1 FF %1s "Art der Bearbeitung"
str1 FG %1s "Abgabe/Rückantwort durch"
str3 FH %3s "Soziale Merkmale des Einreichers"
str6 FL %6s "Eingangsdatum"
str6 FM %6s "Entscheidungstermin"
str6 FN %6s "Wiedervorlagedatum"
str6 FP %6s "Reserve-Datumsfeld"
str6 FQ %6s "Zu-den-Akten-Datum"
str6 FR %6s "Datum des Bezugsschreibens"
str1 FS %1s "Leerfeld 1 Zeichen"
}

I invoke the dictionary with

Code:

. infile using DC20MD.dct, using(ANOTXT/0188ANO.TXT)

which brings me to

Code:

. list F2 F3 F4 F5 in 4/9

     +----------------------------------------+
     |   F2               F3     F4        F5 |
     |----------------------------------------|
  4. | 8705        Ebersbach   1210      Ind. |
  5. |        Bad Dürrenberg    081     1Ind. |
  6. |                Berlin   1515    Handel |
  7. | 8900          Görlitz    123   2Handel |
  8. | 6500             Gera   1031    Handel |
     |----------------------------------------|
  9. | 4900            Zeitz   0820    Handel |
     +----------------------------------------+

The entries in F4, F5 ... are incorrect for the communities with -unicode- characters.

I have also tried to specifiy the start columns of variables using -_column(#)_ in the dictionary, but with no success. -help infile2- says something about EBCDIC-encoding, but I do not unterstand what this means. If I state -ebcdic- Stata does not import any data.

I realized that one solution would be to translate the unicode encoded file into an extended ASCII encoded file, reading in those files, and then translate the resulting Stata file back to unicode. I can do that "by hand" in my editor, but as I have many files I would prefer a programming solution (either a Stata command or a Unix command line). Correct me, if I am wrong, but I cannot use -unicode translate- to translate from -utf8- to say -latin1-, right?

Any insights highly appreciated.

Tags: None

Bjarte Aagnes

Join Date: Apr 2014

Posts: 785
#2

30 Nov 2016, 13:49

-unicode convertfile- will convert (one file at a time).
Comment

Announcement

Reading fixed format text file with unicode string variables

Comment