Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Working With A Pesky Text File

    Hey everyone. Consider the following dataset (may be worth looking at the text file here in your browser).
    Code:
    import delimited "https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt", ///
    rowrange(130)  clear
    As we can see, this imports a text file with some columns that're intended to be variables for year, state FIPS code, type of beverage, and various statistics related to alcohol consumption. The trouble is getting everything in a format that's useful (i.e., the 14 variable columns we see here). Consider:
    Code:
    import delimited "https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt", ///
    delimiter(space) rowrange(130) clear
    Well..... this sort of works. The year it imports fine. The fips codes (the 'updated' column), kinda sorta. But what about the other columns!! There are now over 40 other columns that are likely treated different due to spacing reasons or other qualities of the dataset. When I change the delimiter to
    Code:
    import delimited "https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt", ///
    delimiter(whitespace) rowrange(130) clear
    or
    Code:
    import delimited "https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt", ///
    delimiter(".") rowrange(130) clear
    we barely get better results. My original idea was tot concatenate some of the columns, but I strongly suspect if I did that I'd run into values from different variables that appear in the same cell, which would give me incorrect results. I could use the split command to select on spaces
    Code:
    import delimited "https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt", ///
    rowrange(130) clear
    split v1, p(" ")
    but this arguably the task even more hellish. I guess all I'd really want is to have what appears to be a 14 column dataset of all the variables they have here- the rest (state codes, etc) I can work with just fine. How might I begin to tackle this interesting text file?

  • #2
    I’m not at my computer right now, but you have fixed-width data, such that dates are laid out in specific columns of text. For this you should use -infix-.

    Comment


    • #3
      User contributed? Don't think I've ever heard of it. I'll check the help now.

      EDIT: Fascinating. I've never needed to use this before.

      Comment


      • #4
        You can deal with the uneven spacing. See highlighted.

        Code:
        import delimited "https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt", rowrange(130)  clear
        replace v1= trim(itrim(v1))
        split v1, p(" ") g(var) destring
        drop v1
        Res.:

        Code:
        . l in 1/5, sep(0)
        
             +--------------------------------------------------------------------------------------------------------------------+
             | var1   var2   var3       var4      var5      var6    var7   var8      var9   var10   var11   var12   var13   var14 |
             |--------------------------------------------------------------------------------------------------------------------|
          1. | 1970      1      1    3863000   1738350   2499000    6956      9   2020000    8606       9       .       .       . |
          2. | 1970      1      2    1412000    225920   2499000     904      9   2020000    1118       9       .       .       . |
          3. | 1970      1      3   33098000   1489410   2499000    5960     10   2020000    7373      10       .       .       . |
          4. | 1970      1      4          .   3453680   2499000   13820     10   2020000   17097      10       .       .       . |
          5. | 1970      2      1     945000    425250    205000   20744      1    165000   25773       1       .       .       . |
             +--------------------------------------------------------------------------------------------------------------------+

        Comment


        • #5
          Indeed this does work. Thanks a lot. I wonder, though, what the "infix" solution would be. If someone doesn't beat me to it, I'll post the infix solution.

          Comment


          • #6
            Originally posted by Jared Greathouse View Post
            Indeed this does work. Thanks a lot. I wonder, though, what the "infix" solution would be. If someone doesn't beat me to it, I'll post the infix solution.
            -infile- and -infix- are two predecessors of the -import- and are not used as often. I learned of them using some survey products that used fixed-column (I mistakenly called these fixed-width) data files. The two two are similar in that they both use dictionaries, but the syntax and flexibility differ. In this case, they're both fairly straightforward, as I demonstrate below, and compare to the example provided by Andrew Musau.

            Code:
            clear *
            cls
            cd c:\tmp\test
            
            tempfile ex2 ex3
            
            mkf ex1
            cwf ex1
            import delimited "https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt", rowrange(130)  clear
            replace v1= trim(itrim(v1))
            split v1, p(" ") g(var) destring
            drop v1
            rename(var*) (year fips beverage_type gal_bev gal_etoh pop_14u galpercap_14u decile_14u pop_21u galpercap_21u decile_21u source tv_abv tv_gal_etoh)
            list in 1/5
            list in -5/l
            
            mkf ex2
            cwf ex2
            infix using "dict1.dct", using("https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt") clear
            list in 1/5
            list in -5/l
            save `ex2', replace
            
            cwf ex1
            cf _all using `ex2', all verbose
            
            
            mkf ex3
            cwf ex3
            infile using "dict2.dct", using("https://pubs.niaaa.nih.gov/publications/surveillance119/pcyr1970-2020.txt") clear
            list in 1/5
            list in -5/l
            save `ex3', replace
            
            cwf ex1
            cf _all using `ex3', all verbose
            Dictionaries:

            Code:
            -------------------- beginning of dict1.dct ----------
            infix dictionary {
              130 first
              int year                1-4
              int fips                6-7
              int beverage_type       9
              double gal_bev          11-20
              double gal_etoh         22-30
              double pop_14u          32-41
              double galpercap_14u    43-47
              int decile_14u          49-50
              double pop_21u          52-61
              double galpercap_21u    63-67
              int decile_21u          69-70
              int source              72
              double tv_abv           74-76
              double tv_gal_etoh      78-86
            }
            -------------------- end of dict1.dct ------------------
            
            -------------------- beginning of dict12dct ----------
            infile dictionary {
              _firstlineoffile(130)
              int year
              int fips
              int beverage_type
              double gal_bev
              double gal_etoh
              double pop_14u
              double galpercap_14u
              int decile_14u
              long pop_21u
              double galpercap_21u
              int decile_21u
              int source
              double tv_abv
              double tv_gal_etoh
            }
            -------------------- end of dict2.dct ------------------
            Result

            Code:
            . cf _all using `ex2', all verbose
                        year:  match
                        fips:  match
               beverage_type:  match
                     gal_bev:  match
                    gal_etoh:  match
                     pop_14u:  match
               galpercap_14u:  match
                  decile_14u:  match
                     pop_21u:  match
               galpercap_21u:  match
                  decile_21u:  match
                      source:  match
                      tv_abv:  match
                 tv_gal_etoh:  match
            
            
            . cf _all using `ex3', all verbose
                        year:  match
                        fips:  match
               beverage_type:  match
                     gal_bev:  match
                    gal_etoh:  match
                     pop_14u:  match
               galpercap_14u:  match
                  decile_14u:  match
                     pop_21u:  match
               galpercap_21u:  match
                  decile_21u:  match
                      source:  match
                      tv_abv:  match
                 tv_gal_etoh:  match

            Comment

            Working...
            X