Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing delimited text file

    I'm importing data into Stata from text files, and have found that some of the cases are shifted in the dataset once everything is in Stata. For example, I am importing 6 columns/variables labeled: ID, ZIP, DOB, SEX, LIC_ID, YEAR. Most cases imported perfectly, but some are shifted so that when I use the code:

    tab SEX ---- > the results shows that some of the LIC_ID and DOB values have shifted into this column. This carries over to the other variables too - for ~3,000 / 1 million cases the values have shifted over.

    The import code I've tried using:

    First way:
    import delimited "filepathname", delimiter(space, collapse) bindquote(strict) varnames(1) asfloat clear

    Second way:
    import delimited "filepathname", delimiter(space, collapse) bindquote(strict) varnames(1) numericcols(1 2 5 6) asfloat clear

    I was using the second way to specify certain variables imported as numerics, not strings, but besides this result both files imported exactly the same with the "bumped over" values.

    When I go back and look over the original text file, there are some missing cases (for example - "ZIP" has been left blank for some records) so I see how this could be happening, but I'm not sure how to remedy it.

  • #2
    Try turning off the collapse option. It "forces import delimited to treat multiple consecutive delimiters as just one delimiter," which seems to be your problem.
    Alternatively, could the data actually be in fixed-length format?

    Comment


    • #3
      Can you post a couple of rows of example data? My guess: the fact that the data are space-delimited with missing values is what's causing the problem. I'm not sure if excluding the collapse suboption of delimiter would help you to that end. If you have any control over the source you might try using a different delimiter.

      Comment


      • #4
        I'm attaching an example clipped from one of the text files. Yes it looks like the columns are separated by a space. I will try to run without the collapse command and see if that helps, don't have Stata access right now because I'm sharing computers.

        David - What do you mean "If you have any control over the source you might try using a different delimiter."?

        One thing I am starting to try is opening the .txt files one at a time in Excel (I have about 2 dozen .txt files) and using the spaces as the columns separators, which seems to be working, but I have a lot of files that exceed the maximum no. of rows in Excel so that's a whole different problem to work with and that I would like to avoid.

        Comment


        • #5
          I am probably one of the few who can't tell a difference between a space and a tab character in the screenshot that Erin Burkett has shown, so I will assume that it is a space.

          produces
          http://www.radyakin.org/statalist/20...11_1305697.txt

          Notice also that the header doesn't correspond to the data, as there is an extra DATE column in the data. I called in CUSTOMER_BORN

          Best, Sergiy.
          Last edited by Sergiy Radyakin; 11 Aug 2015, 10:58. Reason: changed invalid link on the do file

          Comment


          • #6
            What you have appears to be "fixed width" data, not delimited data. Check out the infix command.

            Comment


            • #7
              ben earnhart , true, but given the nature of the values, the data can also be read in as space-delimited, collapsing the delimiters, as E. David Aja has mentioned above and what Erin was trying to begin with.

              I've just noticed that Erin had DOB in her list of vars. She can substitute her desired var names instead of the ones contained in the file of course.

              Erin mentions that something is "shifted" in the file. It would be helpful to see the source file around that place. Use the text attachment. It's impossible to distinguish between different delimiters in the screenshots.

              Best, Sergiy.

              Comment


              • #8
                Sergiy Radyakin -- if the data is missing on a variable, then treating it as space delimited with the collapse option will cause the "shift" she described, since it doesn't recognize there's anything missing.

                Comment


                • #9
                  Looking at Erin's screen capture, and considering that in post #1 she tells us some of the ZIP values are missing (and presumably are represented as spaces in the file, although her screen capture does not allow us to confirm this), I agree with the Ben's comments that this should be read as fixed width data and that, although it could possibly be read as space delimited, the missing values for ZIP will cause problems.

                  Erin could open a file in Excel, telling Excel to treat the file as fixed width, and then look at the screen where it displays the data with vertical lines where it guesses the fields should be separated, and use that to determine the necessary column specifications for Stata's infix command. Alternatively, in her screen shot the second line of the input file has dashes at the top of each column, separated by a single space between the dashes, so she could use that to count columns and create the infix data.

                  Comment


                  • #10
                    Agree with William Lisowski .
                    If any content is missing, then reading the file as space-delimited would be wrong.
                    However, if any content in the file is shifted, then reading the file as fixed may be wrong.
                    Sergiy

                    Comment


                    • #11
                      Here's an example of the top of the data set:


                      HTML Code:
                      CUSTOMER_ID                             CUSTOMER_ZIP            CUSTOMER_GENDER TXNA_APPR_TYPE TXNA_YEAR
                      --------------------------------------- ------------ ---------- --------------- -------------- ---------
                      11445                                   54984        12/24/1932 M               132            1999
                      11577                                   29575        12/18/1951 M               100            1999
                      11833                                   53051        10/30/1939 M               132            1999
                      12088                                   53050        12/17/1969 M               100            1999
                      10868                                   54615        10/15/1953 M               132            1999
                      10629                                   54401        11/01/1944 M               132            1999
                      10520                                   53575        10/02/1962 M               132            1999
                      10249                                   53575        01/11/1957 M               100            1999
                      Hi,


                      And an example of a missing ZIP code value (it's just a blank):

                      HTML Code:
                      1161405                                 54157        04/08/1967 M               100            1999
                      1190792                                 60195        01/05/1959 M               200            1999
                      1168368                                 54940        10/16/1978 F               100            1999
                      1191055                                              12/31/1914 U               200            1999
                      1168905                                 55038        05/27/1974 M               200            1999
                      1173285                                 54656        12/11/1972 M               100            1999
                      1173764                                 53511        10/29/1971 F               110            1999
                      1173988                                 53032        05/01/1976 M               100            1999
                      1175074                                 53012        03/01/1978 M               100            1999
                      1175652                                 54661        07/08/1965 M               100            1999

                      Looks like the column widths go like this: 39 space 12 space 10 space 15 space 14 space 9

                      When I was using the following import syntax and ran tab on my GENDER column it was obvious things shifted upon import because the tabulation included M, F, U, but also included values for the 2 columns on either side (DOB & Lic_Type):

                      HTML Code:
                      import delimited "H:\RA\Stata\Wisconsin\WI_License_ORIGINAL\Text_files\1999 WI Fishing Customers.txt", delimiter(space, collapse) bindquote(strict) varnames(1) numericcols(1 2 5 6) asfloat clear
                      Yes these original headers are wrong, but that is the least of my problems since I know how to change those later. This is complicating things a little when explaining the issue since my .txt file clips above show the original/undesired headers, and the header for the DOB column is missing to begin with. There are essentially 6 columns of data to be imported.

                      I've spent some time reading about and attempting to use infile and infix without success. Does the data need to come with a "dictionary file" to work with in order to use these?

                      Comment


                      • #12
                        This site is more helpful than the one I was previously using - I'll try these tips. http://www.stata.com/support/faqs/da...d-format-data/

                        Comment


                        • #13
                          Using infile without a dictionary - Looks like my data sets do not meet the requirements for using infile (http://www.stata.com/support/faqs/da.../using-infile/). I have those blanks without anything denoting them.

                          Using infix - I tried to create my own dictionary from the original text file using:
                          HTML Code:
                          dictionary using 1999.txt {
                          _column(1)    str39 ID %39s
                          _column(41)    int    ZIP %12f
                          _column(45)    str10 DOB    %10s
                          _column(57)    str15    %15s
                          _column(74)    str14 %14s
                          _column(90)    int    YEAR %4f 
                          }
                          and get "unrecognized command: _column" in the results window.

                          Comment


                          • #14
                            try
                            Code:
                            clear
                            infix 3 first CUSTOMER_ID 1-40 CUSTOMER_ZIP 41-53 str DOB 54-64 str CUSTOMER_GENDER 65-80 TXNA_APPR_TYPE 81-95 TXNA_YEAR 96-105 using example1.txt

                            Comment


                            • #15
                              Excel has fixed width import option as well. Tried it with your sample data and it automatically recognizes the correct column width. Excel 2010 upwards has 1 million plus rows as a maximum

                              Comment

                              Working...
                              X