Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • use Regex in filename/ looping over files

    Hello,

    Could you please help me on this :

    I have a folder with different files named this way :
    • test_report_day_month_year.xlsx and test_reporte_day_month_year.xlsx
    • The day can be encoded on two digits or just one, for example I have : reporte_1_05_1999.xlsx and reporte_02_05_1999 and report_18_05_1999 in the same folder
    I would like to loop over these files. I want to import each file, clean it and create 3 variables day, month and year, that I would fill from the filename.

    knowing that there are almost all days in the folder, I can create a local day = 1 and loop over files by doing something like :

    Code:
    import excel "`path'\report*_`++day'_`month'_1999.xlsx", sheet("report") clear
    Or, do you think it will be better if I just loop over all the files and split them in order to have the date ?

    Thank you


  • #2
    You can use NJC's -fs- to store a list of relevant xlsx files at some path in a local, and then loop over the elements of that local to import, clean, and save each one.

    Then you can use -fs- to get a list of Stata files to append together:

    Code:
    /* Create Fake Data */
    set more off
    sysuse auto, clear
    foreach stub in 1_05_1999 02_05_1999 1_5_2002 {
        export excel test_report_`stub'.xlsx, replace
        export excel test_reporte_`stub'.xlsx, replace
    }
    
    /* Import, Clean, and Convert to dta Format */
    capture ssc install fs
    fs test_report*.xlsx
    foreach f in `r(files)' {
        import excel `f', clear
        gen d = real(regexs(1)) if regexm("`f'","^test_report[e]?_([0-9]+)_")
        assert inrange(d,1,31)
        gen m = real(regexs(1)) if regexm("`f'","_([0-9]+)_")
        assert inrange(m,1,12)
        gen y = real(regexs(1)) if regexm("`f'","_([0-9]+)\.xlsx$")
        assert inrange(y,1990,2002)
        compress
        save stata_`=regexr("`f'",".xlsx",".dta")', replace
    }
    /* Merge Stata Files & Clean Up */
    clear
    fs stata_*.dta
    append using `r(files)'
    save "merged_files.dta", replace
    //!rm test_report*.xlsx
    !rm stata_*.dta

    Comment


    • #3
      Also see xls2dta (SSC).

      Code:
      cd where_my_excel_files_are
      
      xls2dta , generate(filename) : import excel . , sheet("report")
      xls2dta : xeq ///
          split filename , parse("_") ;
          rename (filename2-filename4)(day month year) ;
          replace day = "0" + day if (strlen(day) == 1) ;
          drop filename*
      xls2dta , save(reports_complete.dta) : append
      *xls2dta , eraseok : erase
      No regexm() needed here.

      Best
      Daniel
      Last edited by daniel klein; 24 May 2016, 01:26. Reason: Added leading 0 if necessary

      Comment


      • #4
        I messed up the code. The middle part should be

        Code:
        xls2dta : xeq ///
            split filename , parse("_") ; ///
            rename (filename2-filename4)(day month year) ; ///
            replace day = "0" + day if (strlen(day) == 1) ; ///
            drop filename*
        note the extra ///.

        The syntax given is unnecessary complicated, anyway. The same result is obtained by collapsing the converting and appending part and then omit the xeq part altogether

        Code:
        cd where_my_excel_files_are
        
        xls2dta , clear importopts(sheet("report")) generate(filename) : ///
            append using .
        
        split filename , parse("_")
        rename (filename2-filename4)(day month year)
        replace day = "0" + day if (strlen(day) == 1)
        drop filename*
        save reports_complete.dta
        Best
        Daniel
        Last edited by daniel klein; 24 May 2016, 04:48.

        Comment


        • #5
          That's great, thank you so much Daniel and Dimitriy.
          Last edited by lamya kejji; 24 May 2016, 09:33.

          Comment


          • #6
            What about renaming the file?

            You might also implement a check whether a file can be opened and the skip the respective file, if it cannot. But I assume you want the data from this file, too.

            Best
            Daniel

            Comment

            Working...
            X