Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Change Stata file names based on specific patterns of pre-existing files

    Dear Statalist members,

    Assume the following problem. A folder contains many files that have names in the following format:

    A_C.dta, A_D.dta, A_E.dta, B_A.dta, B_C.dta, B_D.dta, B_F.dta, E_H.dta, E_G.dta, E_K.dta.

    Apart from letters, those could be numbers, again with similar form: 2568975_112565.dta, 2568975_130520.dta, 2568975_999980.dta. The key point is that what is before the (_) denotes a "family," a common component.

    There is no particular order in the above forms, and there can be hundreds of such files. This is not known beforehand.

    What I would like to do is append all datasets that start with the same component before the underline (_); that is, being in the same family. For example, append A_C.dta, A_D.dta, A_E.dta, or append 2568975_112565.dta, 2568975_130520.dta, 2568975_999980.dta.

    After the append is done, I'd like this file to be saved with a name such as firm_number.dta. For example, firm_1.dta, firm_2.dta, firm_3.dta etc. All those numbers mean a specific family. For example, number 1 might indicate "A_" files, while 2 might indicate "B_" files. That is, there should be order in the file names.

    Also, files that have been used for the append, must be deleted. Ergo, only appended files must stay in the end. For example firm_1.dta, firm_2.dta, firm_3.dta etc.

    Is there a way to deal with filenames in the way described above?

  • #2
    Code:
    clear*
    filelist, pattern(*_*.dta) norecursive
    
    split filename, parse("_") gen(firm)
    
    capture program drop one_family
    program define one_family
        local family = firm1[1]
        forvalues i = 1/`=_N' {
            frame building: clear
            local nextfile = filename[`i']
            frame building: append using `"`nextfile'"'
            erase `"`nextfile'"'
        }
        frame building: save `"firm_`family'"', replace
        exit
    end
    
    frame create building
    
    runby one_family, by(firm1) verbose
    Notes: As this uses frames, version 16.0 or later is required.
    -filelist- is written by Robert Picard and is available from SSC.
    -runby- is written by Robert Picard and me. It is also available from SSC.
    This program assumes that all of the files to be handled are in the same directory, not organized into subdirectories. It also assumes that any file whose name matches *_*.dta in that directory is part of the process.
    The Stata command -erase- does not move files to the Recycle Bin. It actually erases them from the storage device. So you need to be really sure you will not need to recover these files (or you have already backed them up some where). Use at your own risk.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      Code:
      clear*
      filelist, pattern(*_*.dta) norecursive
      
      split filename, parse("_") gen(firm)
      
      capture program drop one_family
      program define one_family
      local family = firm1[1]
      forvalues i = 1/`=_N' {
      frame building: clear
      local nextfile = filename[`i']
      frame building: append using `"`nextfile'"'
      erase `"`nextfile'"'
      }
      frame building: save `"firm_`family'"', replace
      exit
      end
      
      frame create building
      
      runby one_family, by(firm1) verbose
      Notes: As this uses frames, version 16.0 or later is required.
      -filelist- is written by Robert Picard and is available from SSC.
      -runby- is written by Robert Picard and me. It is also available from SSC.
      This program assumes that all of the files to be handled are in the same directory, not organized into subdirectories. It also assumes that any file whose name matches *_*.dta in that directory is part of the process.
      The Stata command -erase- does not move files to the Recycle Bin. It actually erases them from the storage device. So you need to be really sure you will not need to recover these files (or you have already backed them up some where). Use at your own risk.
      Dear Clyde,

      Thanks for the code provided. Fortunately, I work with Stata 16.1, so frames could be utilized. Unfortunately, the code provided does not do what I had in mind. Maybe I did not explain well. Here is what I mean (with figures). Suppose that you have two distinct databases in the folder. For the sake of the example, assume their names are: a_c.dta and a_d.dta. Let's suppose these have the following elements.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str1 case_firm str10 firm_id int year str10 shareholder_id double(SIC3_code rivals beta_firm_s) float(numerator denominator) str2 matched_pair
      "A" "A" 2000 "s1" 251 2    .05 .01046875 .003725 "AC"
      "A" "A" 2000 "s4" 251 2   .035         .       . "AC"
      "A" "C" 2000 "s1" 251 2  .1875         .       . "AC"
      "A" "C" 2000 "s4" 251 2 .03125         .       . "AC"
      end
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str1 case_firm str10 firm_id int year str10 shareholder_id double(SIC3_code rivals beta_firm_s) float(numerator denominator) str2 matched_pair
      "A" "A" 2000 "s1" 251 2        .05 .033414286 .013329 "AD"
      "A" "A" 2000 "s4" 251 2       .035          .       . "AD"
      "A" "A" 2000 "s5" 251 2       .098          .       . "AD"
      "A" "D" 2000 "s1" 251 2 .114285714          .       . "AD"
      "A" "D" 2000 "s4" 251 2 .071428571          .       . "AD"
      "A" "D" 2000 "s5" 251 2 .257142857          .       . "AD"
      end
      Click image for larger version

Name:	capture1.PNG
Views:	1
Size:	9.8 KB
ID:	1588819



      Click image for larger version

Name:	cap2.PNG
Views:	1
Size:	14.3 KB
ID:	1588820


      What I want from the code is to find those two firms and append them. The names of the datasets that must be appended, have always a common part before the underline (_), in this case it happens to be "a". (There can be several files that start with the same prefix before the underline (_)). In other cases, it might be a number. Ideally, I would have liked the code to make a new dataset based on the appending of the above two components that would look like this:

      Click image for larger version

Name:	cap3.PNG
Views:	1
Size:	19.5 KB
ID:	1588821



      The code you provided just keeps the second database and renames it as firm_a.dta. For some reason, they do not append and do not produce the preferred outcome.

      In addition, since there are many files in the folder, I would like those to be named as firm_number.dta. This number should be ordered so that it can be used in another loop later on. For example, for firm_a.dta it could become firm_1.dta. The precise number does not matter, as long as there is order in file naming, starting from 1 to a maximum N (which is unknown).

      I hope it is clearer now.

      Comment


      • #4
        Sorry, my error. With regard to the created files being incorrect, I see the problem. The following code corrects that:

        Code:
        clear*
        filelist, pattern(*_*.dta) norecursive
        
        split filename, parse("_") gen(firm)
        
        capture program drop one_family
        program define one_family
            local family = firm1[1]
            frame building: clear
            forvalues i = 1/`=_N' {
                local nextfile = filename[`i']
                frame building: append using `"`nextfile'"'
                erase `"`nextfile'"'
            }
            frame building: save `"firm_`family'"', replace
            exit
        end
        
        frame create building
        
        runby one_family, by(firm1) verbose
        The only change is that the -frame building: clear- command has been moved out of the loop. That will allow the files to accumulate and append.

        As for the preference to use numbers in the filenames, I'm going to bow out of that. Although it is not that hard to program, I have no idea what you mean by "as long as there is order in file naming" and I suspect that the thread will extend to many posts clarifying that. Moreover, although I'm sure you have your reasons for this, since, at a minimum, it wold also require maintaining some sort of crosswalk between the numbering scheme and which family of firms is represented in the files, this approach just strikes me as unnecessarily complicated. If you feel strongly enough that you must do it that way, I'll leave it to you to implement on your own.

        Comment


        • #5
          Much obliged Clyde. Your help has been crucial.

          Comment


          • #6
            I have a point to make about the code in #4.

            I've tried to use it in a folder with lots of data and I observed something. The code run for most of the files in the folder, creating filenames in the format: firm_family.dta. I saved these files in another folder to compare later because the code worked for me in the toy example I provided, thus making me curious about what was going on. However, I saw that there were filenames that were not affected by the code and I had to run it another time. Then the rest of the filenames were created, again in the form firm_family.dta.

            I observed that one file that was created, had the name firm_490.dta, which was already created (partially in the first run, combining some 40 something files out of the 89. I wonder why this might have happened, and why the code did not initially pick that. I am assuming something happened that broke it, albeit it did not stop running.

            I also tried to not include file names 409_*.dta and re-run it. The problem is still there.
            Last edited by Pantelis Kazakis; 09 Jan 2021, 10:31.

            Comment


            • #7
              I can't think of any explanation for the problem you are describing. While I can think of reasons why a file might be missed by the code, such as if its filename does not conform to the pattern *_*.dta, it remains inexplicable why it would be picked up in a later rerun of the code.

              I would also point out that if you are running this on a Mac, filenames are case sensitive, so if a_C and A_D were originally created, on a Mac, they will not be brought together by this code, whereas in Windows they will.

              Other than those things, I have no ideas.

              And I don't think I can troubleshoot this over Statalist because it clearly relies on the files themselves, and there is no practical way to recreate your files and directory structure here. So I will have to suggest that you find somebody who can work with you in person directly on your machine. I'm sorry.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                I can't think of any explanation for the problem you are describing. While I can think of reasons why a file might be missed by the code, such as if its filename does not conform to the pattern *_*.dta, it remains inexplicable why it would be picked up in a later rerun of the code.

                I would also point out that if you are running this on a Mac, filenames are case sensitive, so if a_C and A_D were originally created, on a Mac, they will not be brought together by this code, whereas in Windows they will.

                Other than those things, I have no ideas.

                And I don't think I can troubleshoot this over Statalist because it clearly relies on the files themselves, and there is no practical way to recreate your files and directory structure here. So I will have to suggest that you find somebody who can work with you in person directly on your machine. I'm sorry.
                Hi Clyde. Sure, thanks for your assistance so far. I am on Windows, so there should not be a problem of case sensitive cases.

                After looking at the files, I found that some of them were empty; they had only variable names, but no content inside (i.e., no data). So, when I removed those, the code performed better, albeit I can still see that for one case only, it appended part of the family firm and its matched pair, but some files in that family-firm and firm pair did not match. I will investigate more on this.

                Comment


                • #9
                  After looking at the files, I found that some of them were empty; they had only variable names, but no content inside (i.e., no data). So, when I removed those, the code performed better, albeit I can still see that for one case only, it appended part of the family firm and its matched pair, but some files in that family-firm and firm pair did not match. I will investigate more on this.
                  Here's a thought. It may be that some of the files that you want to put together in a single "family" are not completely append-compatible. For example, there could be a variable that is numeric in one of the files, but the variable of the same name is string in another. That will cause an error in the -append- command, and probably leads to the family file failing to be created. If you add the -verbose- option to the -runby- command you will see any error messages generated along the way. (Unfortunately, you will also see a lot of other output, and since this is a large-scale project, it may be difficult to sort through all of the output to find the error messages. Isolating a group files that exhibits the problem would reduce this difficulty.)

                  Comment


                  • #10
                    Hopefully the file issues will be solved. Below is a tentative solution not using frames or any user written commands:
                    Code:
                    ********************************************************************************
                    clear
                     
                    local files : dir "." files "*_*.dta" ,  respectcase  
                    
                    foreach fn of local files {
                        
                        local stub = substr("`fn'", 1, strpos("`fn'", "_")-1) 
                         
                        if ( inlist("`prev'", "", "`stub'") ) { // first file, or same stub as prev
                                
                            append using `fn'
                        }  
                        
                        else { // different stub
                            
                            save `prev'  
                            clear
                            append using `fn'
                        }
                        
                        local prev `stub'
                        * erase `fn' // uncomment if you want to permanently delete files    
                    }
                    
                    save `prev'   // the very last file; same stub as prev 
                    
                    clear
                    ********************************************************************************

                    Comment


                    • #11
                      Originally posted by Bjarte Aagnes View Post
                      Hopefully the file issues will be solved. Below is a tentative solution not using frames or any user written commands:
                      Code:
                      ********************************************************************************
                      clear
                      
                      local files : dir "." files "*_*.dta" , respectcase
                      
                      foreach fn of local files {
                      
                      local stub = substr("`fn'", 1, strpos("`fn'", "_")-1)
                      
                      if ( inlist("`prev'", "", "`stub'") ) { // first file, or same stub as prev
                      
                      append using `fn'
                      }
                      
                      else { // different stub
                      
                      save `prev'
                       clear
                       append using `fn'
                      }
                      
                      local prev `stub'
                       * erase `fn' // uncomment if you want to permanently delete files 
                      }
                      
                      save `prev' // the very last file; same stub as prev 
                      
                      clear
                      ********************************************************************************
                      Hi Bjarte,

                      Thanks for the code. Indeed this works too. Instead of keeping the faulty case in the folder, it removes it (after uncommenting part of your code above).

                      Now, the code informs me which firm might be problematic. There must be some data inconsistency for this specific firm. I will need to look at this in detail.

                      Comment

                      Working...
                      X