Importing delimited text file

Erin Burkett

Join Date: Aug 2015

Posts: 16
#1

Importing delimited text file

11 Aug 2015, 07:59

I'm importing data into Stata from text files, and have found that some of the cases are shifted in the dataset once everything is in Stata. For example, I am importing 6 columns/variables labeled: ID, ZIP, DOB, SEX, LIC_ID, YEAR. Most cases imported perfectly, but some are shifted so that when I use the code:

tab SEX ---- > the results shows that some of the LIC_ID and DOB values have shifted into this column. This carries over to the other variables too - for ~3,000 / 1 million cases the values have shifted over.

The import code I've tried using:

First way:
import delimited "filepathname", delimiter(space, collapse) bindquote(strict) varnames(1) asfloat clear

Second way:
import delimited "filepathname", delimiter(space, collapse) bindquote(strict) varnames(1) numericcols(1 2 5 6) asfloat clear

I was using the second way to specify certain variables imported as numerics, not strings, but besides this result both files imported exactly the same with the "bumped over" values.

When I go back and look over the original text file, there are some missing cases (for example - "ZIP" has been left blank for some records) so I see how this could be happening, but I'm not sure how to remedy it.
Tags: None
ben earnhart

Join Date: May 2014

Posts: 1027
#2

11 Aug 2015, 08:08

Try turning off the collapse option. It "forces import delimited to treat multiple consecutive delimiters as just one delimiter," which seems to be your problem.
Alternatively, could the data actually be in fixed-length format?
Comment
E. David Aja

Join Date: Mar 2014

Posts: 58
#3

11 Aug 2015, 08:10

Can you post a couple of rows of example data? My guess: the fact that the data are space-delimited with missing values is what's causing the problem. I'm not sure if excluding the collapse suboption of delimiter would help you to that end. If you have any control over the source you might try using a different delimiter.
Comment
Erin Burkett

Join Date: Aug 2015

Posts: 16
#4

11 Aug 2015, 10:11

I'm attaching an example clipped from one of the text files. Yes it looks like the columns are separated by a space. I will try to run without the collapse command and see if that helps, don't have Stata access right now because I'm sharing computers.

David - What do you mean "If you have any control over the source you might try using a different delimiter."?

One thing I am starting to try is opening the .txt files one at a time in Excel (I have about 2 dozen .txt files) and using the spaces as the columns separators, which seems to be working, but I have a lot of files that exceed the maximum no. of rows in Excel so that's a whole different problem to work with and that I would like to avoid.

1 Photo
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#5

11 Aug 2015, 10:55

I am probably one of the few who can't tell a difference between a space and a tab character in the screenshot that Erin Burkett has shown, so I will assume that it is a space.

Code:

do http://www.radyakin.org/statalist/20...11_1305697.do

produces
http://www.radyakin.org/statalist/20...11_1305697.txt

Notice also that the header doesn't correspond to the data, as there is an extra DATE column in the data. I called in CUSTOMER_BORN

Best, Sergiy.

Last edited by Sergiy Radyakin; 11 Aug 2015, 10:58. Reason: changed invalid link on the do file
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#6

11 Aug 2015, 11:02

What you have appears to be "fixed width" data, not delimited data. Check out the infix command.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#7

11 Aug 2015, 11:17

ben earnhart , true, but given the nature of the values, the data can also be read in as space-delimited, collapsing the delimiters, as E. David Aja has mentioned above and what Erin was trying to begin with.

I've just noticed that Erin had DOB in her list of vars. She can substitute her desired var names instead of the ones contained in the file of course.

Erin mentions that something is "shifted" in the file. It would be helpful to see the source file around that place. Use the text attachment. It's impossible to distinguish between different delimiters in the screenshots.

Best, Sergiy.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#8

11 Aug 2015, 11:58

Sergiy Radyakin -- if the data is missing on a variable, then treating it as space delimited with the collapse option will cause the "shift" she described, since it doesn't recognize there's anything missing.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

11 Aug 2015, 12:10

Looking at Erin's screen capture, and considering that in post #1 she tells us some of the ZIP values are missing (and presumably are represented as spaces in the file, although her screen capture does not allow us to confirm this), I agree with the Ben's comments that this should be read as fixed width data and that, although it could possibly be read as space delimited, the missing values for ZIP will cause problems.

Erin could open a file in Excel, telling Excel to treat the file as fixed width, and then look at the screen where it displays the data with vertical lines where it guesses the fields should be separated, and use that to determine the necessary column specifications for Stata's infix command. Alternatively, in her screen shot the second line of the input file has dashes at the top of each column, separated by a single space between the dashes, so she could use that to count columns and create the infix data.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#10

11 Aug 2015, 12:32

Agree with William Lisowski .
If any content is missing, then reading the file as space-delimited would be wrong.
However, if any content in the file is shifted, then reading the file as fixed may be wrong.
Sergiy
Comment

Erin Burkett

Join Date: Aug 2015
Posts: 16

#11

12 Aug 2015, 06:55

Here's an example of the top of the data set:

HTML Code:

CUSTOMER_ID                             CUSTOMER_ZIP            CUSTOMER_GENDER TXNA_APPR_TYPE TXNA_YEAR
--------------------------------------- ------------ ---------- --------------- -------------- ---------
11445                                   54984        12/24/1932 M               132            1999
11577                                   29575        12/18/1951 M               100            1999
11833                                   53051        10/30/1939 M               132            1999
12088                                   53050        12/17/1969 M               100            1999
10868                                   54615        10/15/1953 M               132            1999
10629                                   54401        11/01/1944 M               132            1999
10520                                   53575        10/02/1962 M               132            1999
10249                                   53575        01/11/1957 M               100            1999

Hi,

And an example of a missing ZIP code value (it's just a blank):

HTML Code:

1161405                                 54157        04/08/1967 M               100            1999
1190792                                 60195        01/05/1959 M               200            1999
1168368                                 54940        10/16/1978 F               100            1999
1191055                                              12/31/1914 U               200            1999
1168905                                 55038        05/27/1974 M               200            1999
1173285                                 54656        12/11/1972 M               100            1999
1173764                                 53511        10/29/1971 F               110            1999
1173988                                 53032        05/01/1976 M               100            1999
1175074                                 53012        03/01/1978 M               100            1999
1175652                                 54661        07/08/1965 M               100            1999

Looks like the column widths go like this: 39 space 12 space 10 space 15 space 14 space 9

When I was using the following import syntax and ran tab on my GENDER column it was obvious things shifted upon import because the tabulation included M, F, U, but also included values for the 2 columns on either side (DOB & Lic_Type):

HTML Code:

import delimited "H:\RA\Stata\Wisconsin\WI_License_ORIGINAL\Text_files\1999 WI Fishing Customers.txt", delimiter(space, collapse) bindquote(strict) varnames(1) numericcols(1 2 5 6) asfloat clear

Yes these original headers are wrong, but that is the least of my problems since I know how to change those later. This is complicating things a little when explaining the issue since my .txt file clips above show the original/undesired headers, and the header for the DOB column is missing to begin with. There are essentially 6 columns of data to be imported.

I've spent some time reading about and attempting to use infile and infix without success. Does the data need to come with a "dictionary file" to work with in order to use these?

Comment

Erin Burkett

Join Date: Aug 2015

Posts: 16
#12

12 Aug 2015, 07:08

This site is more helpful than the one I was previously using - I'll try these tips. http://www.stata.com/support/faqs/da...d-format-data/
Comment
Erin Burkett

Join Date: Aug 2015

Posts: 16
#13

12 Aug 2015, 07:37

Using infile without a dictionary - Looks like my data sets do not meet the requirements for using infile (http://www.stata.com/support/faqs/da.../using-infile/). I have those blanks without anything denoting them.

Using infix - I tried to create my own dictionary from the original text file using:

HTML Code:

dictionary using 1999.txt { _column(1) str39 ID %39s _column(41) int ZIP %12f _column(45) str10 DOB %10s _column(57) str15 %15s _column(74) str14 %14s _column(90) int YEAR %4f }

and get "unrecognized command: _column" in the results window.
Comment

ben earnhart

Join Date: May 2014
Posts: 1027

#14

12 Aug 2015, 07:57

try

Code:

clear
infix 3 first CUSTOMER_ID 1-40 CUSTOMER_ZIP 41-53 str DOB 54-64 str CUSTOMER_GENDER 65-80 TXNA_APPR_TYPE 81-95 TXNA_YEAR 96-105 using example1.txt

Comment

Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#15

12 Aug 2015, 08:07

Excel has fixed width import option as well. Tried it with your sample data and it automatically recognizes the correct column width. Excel 2010 upwards has 1 million plus rows as a maximum
Comment

Announcement