Change Stata file names based on specific patterns of pre-existing files

Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#1

Change Stata file names based on specific patterns of pre-existing files

07 Jan 2021, 15:41

Dear Statalist members,

Assume the following problem. A folder contains many files that have names in the following format:

A_C.dta, A_D.dta, A_E.dta, B_A.dta, B_C.dta, B_D.dta, B_F.dta, E_H.dta, E_G.dta, E_K.dta.

Apart from letters, those could be numbers, again with similar form: 2568975_112565.dta, 2568975_130520.dta, 2568975_999980.dta. The key point is that what is before the (_) denotes a "family," a common component.

There is no particular order in the above forms, and there can be hundreds of such files. This is not known beforehand.

What I would like to do is append all datasets that start with the same component before the underline (_); that is, being in the same family. For example, append A_C.dta, A_D.dta, A_E.dta, or append 2568975_112565.dta, 2568975_130520.dta, 2568975_999980.dta.

After the append is done, I'd like this file to be saved with a name such as firm_number.dta. For example, firm_1.dta, firm_2.dta, firm_3.dta etc. All those numbers mean a specific family. For example, number 1 might indicate "A_" files, while 2 might indicate "B_" files. That is, there should be order in the file names.

Also, files that have been used for the append, must be deleted. Ergo, only appended files must stay in the end. For example firm_1.dta, firm_2.dta, firm_3.dta etc.

Is there a way to deal with filenames in the way described above?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#2

07 Jan 2021, 17:14

Code:

clear* filelist, pattern(*_*.dta) norecursive split filename, parse("_") gen(firm) capture program drop one_family program define one_family local family = firm1[1] forvalues i = 1/`=_N' { frame building: clear local nextfile = filename[`i'] frame building: append using `"`nextfile'"' erase `"`nextfile'"' } frame building: save `"firm_`family'"', replace exit end frame create building runby one_family, by(firm1) verbose

Notes: As this uses frames, version 16.0 or later is required.
-filelist- is written by Robert Picard and is available from SSC.
-runby- is written by Robert Picard and me. It is also available from SSC.
This program assumes that all of the files to be handled are in the same directory, not organized into subdirectories. It also assumes that any file whose name matches *_*.dta in that directory is part of the process.
The Stata command -erase- does not move files to the Recycle Bin. It actually erases them from the storage device. So you need to be really sure you will not need to recover these files (or you have already backed them up some where). Use at your own risk.
3 likes
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#3

08 Jan 2021, 02:44

Originally posted by Clyde Schechter View Post

Code:

clear* filelist, pattern(*_*.dta) norecursive split filename, parse("_") gen(firm) capture program drop one_family program define one_family local family = firm1[1] forvalues i = 1/`=_N' { frame building: clear local nextfile = filename[`i'] frame building: append using `"`nextfile'"' erase `"`nextfile'"' } frame building: save `"firm_`family'"', replace exit end frame create building runby one_family, by(firm1) verbose

Notes: As this uses frames, version 16.0 or later is required.
-filelist- is written by Robert Picard and is available from SSC.
-runby- is written by Robert Picard and me. It is also available from SSC.
This program assumes that all of the files to be handled are in the same directory, not organized into subdirectories. It also assumes that any file whose name matches *_*.dta in that directory is part of the process.
The Stata command -erase- does not move files to the Recycle Bin. It actually erases them from the storage device. So you need to be really sure you will not need to recover these files (or you have already backed them up some where). Use at your own risk.

Dear Clyde,

Thanks for the code provided. Fortunately, I work with Stata 16.1, so frames could be utilized. Unfortunately, the code provided does not do what I had in mind. Maybe I did not explain well. Here is what I mean (with figures). Suppose that you have two distinct databases in the folder. For the sake of the example, assume their names are: a_c.dta and a_d.dta. Let's suppose these have the following elements.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str1 case_firm str10 firm_id int year str10 shareholder_id double(SIC3_code rivals beta_firm_s) float(numerator denominator) str2 matched_pair "A" "A" 2000 "s1" 251 2 .05 .01046875 .003725 "AC" "A" "A" 2000 "s4" 251 2 .035 . . "AC" "A" "C" 2000 "s1" 251 2 .1875 . . "AC" "A" "C" 2000 "s4" 251 2 .03125 . . "AC" end

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str1 case_firm str10 firm_id int year str10 shareholder_id double(SIC3_code rivals beta_firm_s) float(numerator denominator) str2 matched_pair "A" "A" 2000 "s1" 251 2 .05 .033414286 .013329 "AD" "A" "A" 2000 "s4" 251 2 .035 . . "AD" "A" "A" 2000 "s5" 251 2 .098 . . "AD" "A" "D" 2000 "s1" 251 2 .114285714 . . "AD" "A" "D" 2000 "s4" 251 2 .071428571 . . "AD" "A" "D" 2000 "s5" 251 2 .257142857 . . "AD" end

What I want from the code is to find those two firms and append them. The names of the datasets that must be appended, have always a common part before the underline (_), in this case it happens to be "a". (There can be several files that start with the same prefix before the underline (_)). In other cases, it might be a number. Ideally, I would have liked the code to make a new dataset based on the appending of the above two components that would look like this:

The code you provided just keeps the second database and renames it as firm_a.dta. For some reason, they do not append and do not produce the preferred outcome.

In addition, since there are many files in the folder, I would like those to be named as firm_number.dta. This number should be ordered so that it can be used in another loop later on. For example, for firm_a.dta it could become firm_1.dta. The precise number does not matter, as long as there is order in file naming, starting from 1 to a maximum N (which is unknown).

I hope it is clearer now.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#4

08 Jan 2021, 11:18

Sorry, my error. With regard to the created files being incorrect, I see the problem. The following code corrects that:

Code:

clear* filelist, pattern(*_*.dta) norecursive split filename, parse("_") gen(firm) capture program drop one_family program define one_family local family = firm1[1] frame building: clear forvalues i = 1/`=_N' { local nextfile = filename[`i'] frame building: append using `"`nextfile'"' erase `"`nextfile'"' } frame building: save `"firm_`family'"', replace exit end frame create building runby one_family, by(firm1) verbose

The only change is that the -frame building: clear- command has been moved out of the loop. That will allow the files to accumulate and append.

As for the preference to use numbers in the filenames, I'm going to bow out of that. Although it is not that hard to program, I have no idea what you mean by "as long as there is order in file naming" and I suspect that the thread will extend to many posts clarifying that. Moreover, although I'm sure you have your reasons for this, since, at a minimum, it wold also require maintaining some sort of crosswalk between the numbering scheme and which family of firms is represented in the files, this approach just strikes me as unnecessarily complicated. If you feel strongly enough that you must do it that way, I'll leave it to you to implement on your own.
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#5

08 Jan 2021, 11:43

Much obliged Clyde. Your help has been crucial.
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#6

09 Jan 2021, 10:24

I have a point to make about the code in #4.

I've tried to use it in a folder with lots of data and I observed something. The code run for most of the files in the folder, creating filenames in the format: firm_family.dta. I saved these files in another folder to compare later because the code worked for me in the toy example I provided, thus making me curious about what was going on. However, I saw that there were filenames that were not affected by the code and I had to run it another time. Then the rest of the filenames were created, again in the form firm_family.dta.

I observed that one file that was created, had the name firm_490.dta, which was already created (partially in the first run, combining some 40 something files out of the 89. I wonder why this might have happened, and why the code did not initially pick that. I am assuming something happened that broke it, albeit it did not stop running.

I also tried to not include file names 409_*.dta and re-run it. The problem is still there.

Last edited by Pantelis Kazakis; 09 Jan 2021, 10:31.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#7

09 Jan 2021, 13:38

I can't think of any explanation for the problem you are describing. While I can think of reasons why a file might be missed by the code, such as if its filename does not conform to the pattern *_*.dta, it remains inexplicable why it would be picked up in a later rerun of the code.

I would also point out that if you are running this on a Mac, filenames are case sensitive, so if a_C and A_D were originally created, on a Mac, they will not be brought together by this code, whereas in Windows they will.

Other than those things, I have no ideas.

And I don't think I can troubleshoot this over Statalist because it clearly relies on the files themselves, and there is no practical way to recreate your files and directory structure here. So I will have to suggest that you find somebody who can work with you in person directly on your machine. I'm sorry.
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#8

09 Jan 2021, 13:50

Originally posted by Clyde Schechter View Post

I can't think of any explanation for the problem you are describing. While I can think of reasons why a file might be missed by the code, such as if its filename does not conform to the pattern *_*.dta, it remains inexplicable why it would be picked up in a later rerun of the code.

I would also point out that if you are running this on a Mac, filenames are case sensitive, so if a_C and A_D were originally created, on a Mac, they will not be brought together by this code, whereas in Windows they will.

Other than those things, I have no ideas.

And I don't think I can troubleshoot this over Statalist because it clearly relies on the files themselves, and there is no practical way to recreate your files and directory structure here. So I will have to suggest that you find somebody who can work with you in person directly on your machine. I'm sorry.

Hi Clyde. Sure, thanks for your assistance so far. I am on Windows, so there should not be a problem of case sensitive cases.

After looking at the files, I found that some of them were empty; they had only variable names, but no content inside (i.e., no data). So, when I removed those, the code performed better, albeit I can still see that for one case only, it appended part of the family firm and its matched pair, but some files in that family-firm and firm pair did not match. I will investigate more on this.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#9

09 Jan 2021, 14:12

After looking at the files, I found that some of them were empty; they had only variable names, but no content inside (i.e., no data). So, when I removed those, the code performed better, albeit I can still see that for one case only, it appended part of the family firm and its matched pair, but some files in that family-firm and firm pair did not match. I will investigate more on this.

Here's a thought. It may be that some of the files that you want to put together in a single "family" are not completely append-compatible. For example, there could be a variable that is numeric in one of the files, but the variable of the same name is string in another. That will cause an error in the -append- command, and probably leads to the family file failing to be created. If you add the -verbose- option to the -runby- command you will see any error messages generated along the way. (Unfortunately, you will also see a lot of other output, and since this is a large-scale project, it may be difficult to sort through all of the output to find the error messages. Isolating a group files that exhibits the problem would reduce this difficulty.)
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 786

#10

09 Jan 2021, 14:16

Hopefully the file issues will be solved. Below is a tentative solution not using frames or any user written commands:

Code:

********************************************************************************
clear
 
local files : dir "." files "*_*.dta" ,  respectcase  

foreach fn of local files {
    
    local stub = substr("`fn'", 1, strpos("`fn'", "_")-1) 
     
    if ( inlist("`prev'", "", "`stub'") ) { // first file, or same stub as prev
            
        append using `fn'
    }  
    
    else { // different stub
        
        save `prev'  
        clear
        append using `fn'
    }
    
    local prev `stub'
    * erase `fn' // uncomment if you want to permanently delete files    
}

save `prev'   // the very last file; same stub as prev 

clear
********************************************************************************

Comment

Pantelis Kazakis

Join Date: Aug 2014
Posts: 123

#11

09 Jan 2021, 15:08

Originally posted by Bjarte Aagnes View Post

Hopefully the file issues will be solved. Below is a tentative solution not using frames or any user written commands:

Code:

********************************************************************************
clear

local files : dir "." files "*_*.dta" , respectcase

foreach fn of local files {

local stub = substr("`fn'", 1, strpos("`fn'", "_")-1)

if ( inlist("`prev'", "", "`stub'") ) { // first file, or same stub as prev

append using `fn'
}

else { // different stub

save `prev'
 clear
 append using `fn'
}

local prev `stub'
 * erase `fn' // uncomment if you want to permanently delete files 
}

save `prev' // the very last file; same stub as prev 

clear
********************************************************************************

Hi Bjarte,

Thanks for the code. Indeed this works too. Instead of keeping the faulty case in the folder, it removes it (after uncommenting part of your code above).

Now, the code informs me which firm might be problematic. There must be some data inconsistency for this specific firm. I will need to look at this in detail.

Announcement