Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using "infix" command to read in fixed format ASCII file when each line of observation has different variables

    Hello Statalist friends,

    I have an ASCII file that is in fixed format, and I know the layout of the variables. I've written a dictionary to read in the observations to Stata.

    Here is an example of an observation in the ASCII file (I skipped the variables that are unavailable, which was originally just empty space, and this does not affect the meaning of my question):

    Claim2094IFNWC201406030745MOCA20140603201406032014 06032014071800000000FLFL09002227005200340000003333

    I know exactly the layout of the variables, so I create a dictionary for Stata to read in the data, below is an example of my dictionary:

    str5 record 1-5
    int client 4-7
    str3 vendor 40-42
    str2 line 46-47
    long date_of_incident 48-55

    Many of the observations in my ASCII file contain different variables. So for example, below are three different observations in the ASCII file (one observation for each line):

    Claim2094IFNWC201406030745MOCA20140603201406032014 06032014071800000000FLFL09002227005200340000003333
    Payment2094IFNC384411842015042420150424PNN00000000 YN1500000920FSYS649
    Status2094IFN20140718113018CAN000000281841.700

    I only need the observations that with a record type "Claim", and don't need the observations starting with "payment" or "status". I add a command at the end of the dictionary: if record=="claim". However, I am not able to run my infix command and Stata gives me error message "dictionary invalid". I suspect this is because the observations have different variables/columns, and Stata is not able to skip other observations and only read in the ones that I need?

    Is my suspicion correct? How can I ask Stata to only read in the observations that only start with "Claim"?

    Any suggestions are appreciated!
    Last edited by Lu Jinks; 28 Sep 2017, 18:32.

  • #2
    You don't show exactly what you used so it's hard to say for sure what's wrong but you can't have an if qualifier in a dictionary. Also, the variable specifications you show do not align with the data example you show.

    With this type of data, it's almost easier to read the whole line as a string and use substr() to pluck out values. When I use infix, I typically skip the dictionary and just specify the location and variable type directly. The following works with your example, it's up to you do correctly align the column range for each variable:

    Code:
    quietly infix ///
        str5 record 1-5 ///
        int  client 6-9 ///
        str3 vendor 40-42 ///
        str2 line   46-47 ///
        long date_of_incident 52-60 ///
        using raw_data.txt if record == "Claim", clear

    Comment


    • #3
      On second though, I think it's better to input all variables as string and then convert them to numeric as needed. That way, you do not need to prefix the command with quietly (to avoid type conversions for non-claim records). Something like:

      Code:
      infix ///
          str5 record 1-5 ///
          str4 client 6-9 ///
          str3 vendor 40-42 ///
          str2 line   46-47 ///
          str8 date_of_incident 52-59 ///
          using raw_data.txt if record == "Claim", clear
      
      destring client, replace
      gen doi = daily(date_of_incident, "MDY")
      format %td doi
      and the results:
      Code:
      . list
      
           +--------------------------------------------------------+
           | record   client   vendor   line   date_o~t         doi |
           |--------------------------------------------------------|
        1. |  Claim     2094      014     32   06032014   03jun2014 |
           +--------------------------------------------------------+

      Comment


      • #4
        Robert's suggestion, which is the approach I would take, gives me an idea for an ado-file I don't have the energy (ability?) to write: Interpreting and processing each data record differently depending on some characteristic of the record is a standard feature of a number of language/packages. Going back to the days when lava was still freezing into rock, I saw this feature in COBOL ("redefines" or something like that), and SPSS ("reread,", if I recall correctly). One Stata-ish approach I can think of would be something like a "-filter_record-" command that, for each record type in a file, would read each line as a string, detect its record type, and write it out to a record-type specific text file, which then could be conventionally processed with infix. Having some front end for the situation of multiple record types in a file, which at least in the past was not uncommon as a data format, would be useful.

        Comment


        • #5
          Thank you Robert and Mike! The commend "using raw_data.txt if record == "Claim", clear" really worked, and I got the observations that I wanted! Thanks again!

          Comment

          Working...
          X