Working With A Pesky Text File

Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#1

Working With A Pesky Text File

18 Mar 2023, 06:25

Hey everyone. Consider the following dataset (may be worth looking at the text file here in your browser).

Code:

import delimited "https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt", /// rowrange(130) clear

As we can see, this imports a text file with some columns that're intended to be variables for year, state FIPS code, type of beverage, and various statistics related to alcohol consumption. The trouble is getting everything in a format that's useful (i.e., the 14 variable columns we see here). Consider:

Code:

import delimited "https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt", /// delimiter(space) rowrange(130) clear

Well..... this sort of works. The year it imports fine. The fips codes (the 'updated' column), kinda sorta. But what about the other columns!! There are now over 40 other columns that are likely treated different due to spacing reasons or other qualities of the dataset. When I change the delimiter to

Code:

import delimited "https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt", /// delimiter(whitespace) rowrange(130) clear

or

Code:

import delimited "https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt", /// delimiter(".") rowrange(130) clear

we barely get better results. My original idea was tot concatenate some of the columns, but I strongly suspect if I did that I'd run into values from different variables that appear in the same cell, which would give me incorrect results. I could use the split command to select on spaces

Code:

import delimited "https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt", /// rowrange(130) clear split v1, p(" ")

but this arguably the task even more hellish. I guess all I'd really want is to have what appears to be a 14 column dataset of all the variables they have here- the rest (state codes, etc) I can work with just fine. How might I begin to tackle this interesting text file?
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#2

18 Mar 2023, 06:33

I’m not at my computer right now, but you have fixed-width data, such that dates are laid out in specific columns of text. For this you should use -infix-.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#3

18 Mar 2023, 06:38

User contributed? Don't think I've ever heard of it. I'll check the help now.

EDIT: Fascinating. I've never needed to use this before.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10214

18 Mar 2023, 06:40

You can deal with the uneven spacing. See highlighted.

Code:

import delimited "https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt", rowrange(130)  clear
replace v1= trim(itrim(v1))
split v1, p(" ") g(var) destring
drop v1

Res.:

Code:

. l in 1/5, sep(0)

     +--------------------------------------------------------------------------------------------------------------------+
     | var1   var2   var3       var4      var5      var6    var7   var8      var9   var10   var11   var12   var13   var14 |
     |--------------------------------------------------------------------------------------------------------------------|
  1. | 1970      1      1    3863000   1738350   2499000    6956      9   2020000    8606       9       .       .       . |
  2. | 1970      1      2    1412000    225920   2499000     904      9   2020000    1118       9       .       .       . |
  3. | 1970      1      3   33098000   1489410   2499000    5960     10   2020000    7373      10       .       .       . |
  4. | 1970      1      4          .   3453680   2499000   13820     10   2020000   17097      10       .       .       . |
  5. | 1970      2      1     945000    425250    205000   20744      1    165000   25773       1       .       .       . |
     +--------------------------------------------------------------------------------------------------------------------+

Comment

Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#5

18 Mar 2023, 06:43

Indeed this does work. Thanks a lot. I wonder, though, what the "infix" solution would be. If someone doesn't beat me to it, I'll post the infix solution.
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2403

18 Mar 2023, 18:22

Originally posted by Jared Greathouse View Post

Indeed this does work. Thanks a lot. I wonder, though, what the "infix" solution would be. If someone doesn't beat me to it, I'll post the infix solution.

-infile- and -infix- are two predecessors of the -import- and are not used as often. I learned of them using some survey products that used fixed-column (I mistakenly called these fixed-width) data files. The two two are similar in that they both use dictionaries, but the syntax and flexibility differ. In this case, they're both fairly straightforward, as I demonstrate below, and compare to the example provided by Andrew Musau.

Code:

clear *
cls
cd c:\tmp\test

tempfile ex2 ex3

mkf ex1
cwf ex1
import delimited "https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt", rowrange(130)  clear
replace v1= trim(itrim(v1))
split v1, p(" ") g(var) destring
drop v1
rename(var*) (year fips beverage_type gal_bev gal_etoh pop_14u galpercap_14u decile_14u pop_21u galpercap_21u decile_21u source tv_abv tv_gal_etoh)
list in 1/5
list in -5/l

mkf ex2
cwf ex2
infix using "dict1.dct", using("https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt") clear
list in 1/5
list in -5/l
save `ex2', replace

cwf ex1
cf _all using `ex2', all verbose


mkf ex3
cwf ex3
infile using "dict2.dct", using("https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt") clear
list in 1/5
list in -5/l
save `ex3', replace

cwf ex1
cf _all using `ex3', all verbose

Dictionaries:

Code:

-------------------- beginning of dict1.dct ----------
infix dictionary {
  130 first
  int year                1-4
  int fips                6-7
  int beverage_type       9
  double gal_bev          11-20
  double gal_etoh         22-30
  double pop_14u          32-41
  double galpercap_14u    43-47
  int decile_14u          49-50
  double pop_21u          52-61
  double galpercap_21u    63-67
  int decile_21u          69-70
  int source              72
  double tv_abv           74-76
  double tv_gal_etoh      78-86
}
-------------------- end of dict1.dct ------------------

-------------------- beginning of dict12dct ----------
infile dictionary {
  _firstlineoffile(130)
  int year
  int fips
  int beverage_type
  double gal_bev
  double gal_etoh
  double pop_14u
  double galpercap_14u
  int decile_14u
  long pop_21u
  double galpercap_21u
  int decile_21u
  int source
  double tv_abv
  double tv_gal_etoh
}
-------------------- end of dict2.dct ------------------

Result

Code:

. cf _all using `ex2', all verbose
            year:  match
            fips:  match
   beverage_type:  match
         gal_bev:  match
        gal_etoh:  match
         pop_14u:  match
   galpercap_14u:  match
      decile_14u:  match
         pop_21u:  match
   galpercap_21u:  match
      decile_21u:  match
          source:  match
          tv_abv:  match
     tv_gal_etoh:  match


. cf _all using `ex3', all verbose
            year:  match
            fips:  match
   beverage_type:  match
         gal_bev:  match
        gal_etoh:  match
         pop_14u:  match
   galpercap_14u:  match
      decile_14u:  match
         pop_21u:  match
   galpercap_21u:  match
      decile_21u:  match
          source:  match
          tv_abv:  match
     tv_gal_etoh:  match

Announcement