Using "infix" command to read in fixed format ASCII file when each line of observation has different variables

Lu Jinks

Join Date: Sep 2017

Posts: 2
#1

Using "infix" command to read in fixed format ASCII file when each line of observation has different variables

28 Sep 2017, 18:13

Hello Statalist friends,

I have an ASCII file that is in fixed format, and I know the layout of the variables. I've written a dictionary to read in the observations to Stata.

Here is an example of an observation in the ASCII file (I skipped the variables that are unavailable, which was originally just empty space, and this does not affect the meaning of my question):

Claim2094IFNWC201406030745MOCA20140603201406032014 06032014071800000000FLFL09002227005200340000003333

I know exactly the layout of the variables, so I create a dictionary for Stata to read in the data, below is an example of my dictionary:

str5 record 1-5
int client 4-7
str3 vendor 40-42
str2 line 46-47
long date_of_incident 48-55

Many of the observations in my ASCII file contain different variables. So for example, below are three different observations in the ASCII file (one observation for each line):

Claim2094IFNWC201406030745MOCA20140603201406032014 06032014071800000000FLFL09002227005200340000003333
Payment2094IFNC384411842015042420150424PNN00000000 YN1500000920FSYS649
Status2094IFN20140718113018CAN000000281841.700

I only need the observations that with a record type "Claim", and don't need the observations starting with "payment" or "status". I add a command at the end of the dictionary: if record=="claim". However, I am not able to run my infix command and Stata gives me error message "dictionary invalid". I suspect this is because the observations have different variables/columns, and Stata is not able to skip other observations and only read in the ones that I need?

Is my suspicion correct? How can I ask Stata to only read in the observations that only start with "Claim"?

Any suggestions are appreciated!

Last edited by Lu Jinks; 28 Sep 2017, 18:32.
Tags: None
Robert Picard

Join Date: Mar 2014

Posts: 1536
#2

29 Sep 2017, 10:24

You don't show exactly what you used so it's hard to say for sure what's wrong but you can't have an if qualifier in a dictionary. Also, the variable specifications you show do not align with the data example you show.

With this type of data, it's almost easier to read the whole line as a string and use substr() to pluck out values. When I use infix, I typically skip the dictionary and just specify the location and variable type directly. The following works with your example, it's up to you do correctly align the column range for each variable:

Code:

quietly infix /// str5 record 1-5 /// int client 6-9 /// str3 vendor 40-42 /// str2 line 46-47 /// long date_of_incident 52-60 /// using raw_data.txt if record == "Claim", clear
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

29 Sep 2017, 10:44

On second though, I think it's better to input all variables as string and then convert them to numeric as needed. That way, you do not need to prefix the command with quietly (to avoid type conversions for non-claim records). Something like:

Code:

infix ///
    str5 record 1-5 ///
    str4 client 6-9 ///
    str3 vendor 40-42 ///
    str2 line   46-47 ///
    str8 date_of_incident 52-59 ///
    using raw_data.txt if record == "Claim", clear

destring client, replace
gen doi = daily(date_of_incident, "MDY")
format %td doi

and the results:

Code:

. list

     +--------------------------------------------------------+
     | record   client   vendor   line   date_o~t         doi |
     |--------------------------------------------------------|
  1. |  Claim     2094      014     32   06032014   03jun2014 |
     +--------------------------------------------------------+

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2411
#4

29 Sep 2017, 11:30

Robert's suggestion, which is the approach I would take, gives me an idea for an ado-file I don't have the energy (ability?) to write: Interpreting and processing each data record differently depending on some characteristic of the record is a standard feature of a number of language/packages. Going back to the days when lava was still freezing into rock, I saw this feature in COBOL ("redefines" or something like that), and SPSS ("reread,", if I recall correctly). One Stata-ish approach I can think of would be something like a "-filter_record-" command that, for each record type in a file, would read each line as a string, detect its record type, and write it out to a record-type specific text file, which then could be conventionally processed with infix. Having some front end for the situation of multiple record types in a file, which at least in the past was not uncommon as a data format, would be useful.
Comment
Lu Jinks

Join Date: Sep 2017

Posts: 2
#5

29 Sep 2017, 12:34

Thank you Robert and Mike! The commend "using raw_data.txt if record == "Claim", clear" really worked, and I got the observations that I wanted! Thanks again!
Comment

Announcement

Using "infix" command to read in fixed format ASCII file when each line of observation has different variables

Comment

Comment

Comment

Comment