does Stata store the filename in memory?

Kyle Smith

Join Date: Mar 2017

Posts: 117
#1

does Stata store the filename in memory?

07 Oct 2017, 12:31

I would like to create variables within a dataset using parts of the filename. Is this possible? Does STATA store the filename in memory somewhere?

The filename is in the following format:

identifier_date_form.txt

I would like to create 3 variables in the file: identifier, date, form.

This will be part of a loop that loops over thousands of files, thus I will not be calling/loading each file by hand.

I don't know if this is helpful or not, but separately I am able to define the variables using the filename command.

The code to do that:

filelist, dir("C:\Users\dir\temp") pat("*.txt") save("txt_datasets.dta")

use txt_datasets.dta, replace

split filename, p(_)
gen id=filename1
gen date=filename2
gen form=filename3

keep filename date form id

save txt_datasets, replace

I just do not know how to load/call/identify the filename once the file is in use within the loop.

Thanks in advance.
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3845
#2

07 Oct 2017, 12:40

See

Code:

help creturn

The filename last specified with use (or save) is stored in c(filename).

I do not know whether that helps, as you do neither state whether you wish to get the contents of these .txt files into Stata and if so, how you plan to do it.

Best
Daniel
Comment
Kyle Smith

Join Date: Mar 2017

Posts: 117
#3

07 Oct 2017, 12:57

Daniel,

Thanks for the help. Unfortunately, that must only work for dta files. I am loading the files using import delimited. The c(filename) is empty in the case of .txt files.

I haven't worked out all the steps yet. I was trying to figure this out one step at a time. First, how to do this for one individual file before building the loop to do it for all files in the directory.

The entire project is to count the number of times specific words are mentioned within 1000s of txt files. Then save those words counts as 1 row in a new data file. But I also need the ID, date, form in the same row as the word counts.

Each add'l txt file would add 1 row to the master file and would contain (ID, date, form, word counts).
Comment
daniel klein

Join Date: Mar 2014

Posts: 3845
#4

07 Oct 2017, 13:37

I am still not fully getting what you want. Reading you initial post once more, I even fail to see why you would want the filename from memory, when you seem to get the names via filelist (from SSC, I suppose). Looking through the help file for filelist, I believe the last example gives the basic setup that you want.

Best
Daniel
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#5

07 Oct 2017, 15:50

Why not use the extended macro functions to get a list of file names and use the words from the macro to create your filename variable?
Comment
Kyle Smith

Join Date: Mar 2017

Posts: 117
#6

07 Oct 2017, 16:39

I am getting closer.

This is the first part that reads in .txt and saves .dta files.

This part runs/works fine.

************************************************** *********
clear
cd "C:\Users\temp"
clear
set more off
local myfilelist : dir . files "*.txt"
foreach file of local myfilelist {
insheet using `file', clear
local outfile = subinstr("`file'",".txt","",.)
save "geo_`outfile'", replace
}
************************************************** *

This is the second part that creates the variables based on filenames. I need to split twice. The first part to get rid of the directory ("") and the second to split the variables I want ("_").

This part works on any individual file (if I open a file and then run the code within the loop, but don't run the loop; tried this one 10 different files), but not in the loop. I get an error

variable temp110 not found
r(111);

I think it runs the first time (first file) and then gives the error. I am not a loop expert.

**************************************************

clear
cd "C:\Users\dir"
clear
local allfiles: dir . files "*.dta"
set more off
foreach file of local allfiles {
use `file', clear

gen temp1=c(filename)
split temp1, p(\)
split temp110, p(_)
gen date=temp1102
gen form=temp1103
gen ID=temp1106

save `file', replace
}

************************************************

Any idea why it is not looping? Do you understand what I am trying to do?
Comment
Daniel Bela

Join Date: Apr 2014

Posts: 246
#7

08 Oct 2017, 06:10

Hi Kyle,

there are several issues with your code that may prevent the loop to continue if the filenames you're looping are not homogeneous.

Anyhow, please have a look at the FAQ on how to use code delimiters here in order to make your code examples readable.
From what I interprete into your question, you want to batch-convert a bunch of plain text data files to Stata datasets. And each of them should include a variable with the source file name.

Am I correct? Then there is no need for the second loop, or for Stata saving any filename internally---you're already looping over these filenames in the first place.

I put up a minimal example to illustrate how to do what I assume you try to do:

Code:

clear // set up example file 1 input contentvar1 1 2 end export delimited id1_date1_form1.txt , replace clear // set up example file 2 input contentvar1 3 4 end export delimited id2_date1_form1.txt , replace clear // loop over txt-files, and save each as dta local myfilelist : dir . files "*.txt" foreach file of local myfilelist { import delimited using `file', clear local sourcefile=subinstr("`file'",".txt","",.) // why not generate the filename-variable right here? generate sourcefile=`"`sourcefile'"' local outfile `"geo_`sourcefile'.dta"' save "`outfile'", replace * you can now save a growing list of all output datasets, and -append- them at the end, if you want to: * see -help macrolists- local outfilelist : list outfilelist | outfile } // append all saved Stata files onto each other, if you wish clear append using `outfilelist' // finally, split up the source file name into the parts you're interested in the result dataset split sourcefile , parse("_") rename (`r(varlist)') (id date form) order id date form sourcefile , first

What I want to illustrate is: As soon as you loop over the original .txt-files, you have each name as a local macro anyways; use this to populate your file name variable.

I added some code at the end to show how to auto-append all the files together, if this is desired. I guess it's easier to split up the source file names after combining all the data together, not before.

Regards
Bela
Comment
Kyle Smith

Join Date: Mar 2017

Posts: 117
#8

08 Oct 2017, 08:43

Bela,

I wish I was as smart with STATA as you.

(The reason I didn't create the names inside the loop previously is b/c I didn't know how to call the names (c(filename)) was empty. I didn't know your way.)

Also apologize for not using the delimiters. I use them now.

Before I append, I need to count the number of times specific words appear in the txt files. (These files are like word documents, not excel sheets, structurally speaking.)

The data in the text files is all text. So when I read it in, it enters as v1, v2, etc. Some of the files have 15 variables (v1-v15) of text and some only have 6 variables (v1-v6). Some are populated with missing (.) which means STATA thinks they are numerical, while most are read in as string (as they should be). Is there a way to know how many of these variables are in each file so I can create a variable that includes all of the text variables?

I can do something like:

Code:

local nvars = c(k) unab allvars : * loc lastvar: word `c(k)' of `r(varlist)' tostring v1-`lastvar', force replace foreach var of varlist { gen text=v1-`lastvar' drop v* }

Except I get some error in the tostring part. Also I suspect I am not creating "text" variable correctly?

I will run the above code before the filename sourcefile code. That way I can delete the extra text variables and keep things clean.

I am not sure what your last line in the loop does?

Code:

local outfilelist : list outfilelist | outfile

Also I am not sure if the part below the loop needs to go inside the loop?

Code:

// append all saved Stata files onto each other, if you wish clear append using `outfilelist'

After this I would like to save and append a separate datafile with only the word counts, form, ID, date. Then this huge problem (huge for me) will be solved.

The word counts will be like this:

Code:

moss text, match("positive") prefix(pos) moss text, match("negative") prefix(neg) egen total_pos=total(poscount) egen total_neg=total(negcount) keep total* ID date form duplicates drop

Thanks very much for your help.
Comment

Daniel Bela

Join Date: Apr 2014
Posts: 246

08 Oct 2017, 14:10

Hi again,

I feel you're trying to solve problems that can be avoided directly upon import of the text files. Have a look at the options -help import delimited-, and you will see that it is quite easy to (1) import everything as string and (2) import everything into a single variable. If you would have shown an example of the data you're working with, there would have been a quicker solution, I guess.

Thus said, I think all modifications necessary to my previous code are two more options to the -import delimited- statement. You can count words afterwards, as you wish:

Code:

clear
// set up example file 1
input str30(contentvar1)
 `"this is some text"'
 `"this is more text"'
end
outfile using id1_date1_form1.txt , replace wide noquote
clear
// set up example file 2
input str30(contentvar1)
 `"even more text is here"'
 `"this is even "quoted" text"'
end
outfile using id2_date1_form1.txt , replace wide noquote
clear

// loop over txt-files, and save each as dta
local myfilelist : dir . files "*.txt"
foreach file of local myfilelist {
    import delimited using `file', clear stringcols(_all) varnames(nonames)
    local sourcefile=subinstr("`file'",".txt","",.)
    // why not generate the filename-variable right here?
    generate sourcefile=`"`sourcefile'"'
    local outfile `"geo_`sourcefile'.dta"'
    save "`outfile'", replace
    * you can now save a growing list of all output datasets, and -append- them at the end, if you want to:
    * see -help macrolists-
    local outfilelist : list outfilelist | outfile
}

// append all saved Stata files onto each other, if you wish
clear
append using `outfilelist'

// finally, split up the source file name into the parts you're interested in the result dataset
split sourcefile , parse("_")
rename (`r(varlist)') (id date form)
order id date form , first
drop sourcefile

Regards
Bela

PS: If something does not work with this update, please add an example of the text files you're importing to the next post, ideally using -input- as I did for the minimal example. Otherwise, readers can not see what's producing problems.

Comment

Kyle Smith

Join Date: Mar 2017
Posts: 117

#10

13 Dec 2017, 15:42

I am revisiting this code and have some problem that I do not understand.

Code:

clear
// set up example file 1
input str50(contentvar1)
 `"this is some text"'
 `"name: joey"'
end
outfile using id1_date1_form1.txt , replace wide noquote
clear
// set up example file 2
input str50(contentvar1)
 `"even more text is here"'
 `"this is even "quoted" text"'
 `"name: billy"'
end
outfile using id2_date1_form1.txt , replace wide noquote
clear

// loop over txt-files, and save each as dta
local myfilelist : dir . files "*.txt"
foreach file of local myfilelist {
    import delimited using `file', clear stringcols(_all) varnames(nonames)
    local sourcefile=subinstr("`file'",".txt","",.)
    // why not generate the filename-variable right here?
    generate sourcefile=`"`sourcefile'"'
    local outfile `"geo_`sourcefile'.dta"'
    

    // this is the new part of the code
    gen temp1=strpos(v1,"name:")
    drop if temp1==0
    gen name=substr(v1,temp1+6,10)


    keep sourcefile name
    save "`outfile'", replace
    * you can now save a growing list of all output datasets, and -append- them at the end, if you want to:
    * see -help macrolists-
    local outfilelist : list outfilelist | outfile
}

// append all saved Stata files onto each other, if you wish
clear
append using `outfilelist'

// finally, split up the source file name into the parts you're interested in the result dataset
split sourcefile , parse("_")
rename (`r(varlist)') (id date form)
order id date form , first
drop sourcefile

However, I get the error:

no observations
r(2000);

right after

Code:

    local outfilelist : list outfilelist | outfile
}

This needs to loop over tens of thousands of .txt files. I have tried it on a subset of 30 files and it runs fine. I don't know why it fails when I try it on the larger sample?

It is possible that some of the files do not have "name:" in them. But I don't think that is the problem because if I change one of the files above so that it does not include "name:" the code still runs fine.

Any ideas/help would be much appreciated.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#11

13 Dec 2017, 16:04

I can't replicate your problem. The code runs without error messages or breaks on my installation.

One possibility: your code has a lot of local macros in it. You need to run this all in one fell swoop. If you run "chunks" of the code separately, the local macro definitions in one chunk are annulled before you get to the next chunk. If you are running this code in pieces, that is probably the source of your difficulty.
Comment
Kyle Smith

Join Date: Mar 2017

Posts: 117
#12

13 Dec 2017, 16:25

Thanks for your reply Clyde. And big congrats on 10000 posts.

I am running the code in one fell swoop. It works on 30 files, but not on the larger sample.

Is there another way to do what I am trying to do? Or is there a way to diagnose the problem? Run it "noisily"?
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#13

13 Dec 2017, 16:37

I prefer to use filelist and runby (both from SSC) to do this type of work. Note that Stata has a limit of 10,000 files per directory for the Mata function that filelist uses so you'll have to split the files into subdirectories if you have more than that.

The following example assumes that your two sample files are in a subdirectory called "text_files" within Stata's current directory (help cd). The first step is to use filelist to create a dataset of files to process:

Code:

clear all
filelist, dir("text_files")
keep if strmatch(filename, "*.txt")

If you list the results, you get:

Code:

. list

     +------------------------------------------+
     | dirname      filename              fsize |
     |------------------------------------------|
  1. | text_files   id1_date1_form1.txt     106 |
  2. | text_files   id2_date1_form1.txt     159 |
     +------------------------------------------+

When you are satisfied that the list of files you want to process is complete, you can use runby to import each file.

Code:

program import_txt
  local dsource = dirname
  local fsource = filename
  import delimited using "`dsource'/`fsource'", clear stringcols(_all) varnames(nonames)
  generate sourcefile =`"`fsource'"'
  generate sourcedir  =`"`dsource'"'

  keep if strpos(v1,"name:")
  gen name = subinstr(v1,"name:","",1)
end

runby import_txt, by(dirname filename) verbose

With runby, what's left in memory when your import_txt program terminates is considered results and is stored. These results accumulate and replace the data in memory when all the by-groups have been processed. Here's what's left in memory when the above code has run:

Code:

. list

     +-----------------------------------------------------------------------------------------+
  1. |                                                   v1 |          sourcefile |  sourcedir |
     | name: joey                                           | id1_date1_form1.txt | text_files |
     |-----------------------------------------------------------------------------------------|
     |                                                                name                     |
     |                      joey                                                               |
     +-----------------------------------------------------------------------------------------+

     +-----------------------------------------------------------------------------------------+
  2. |                                                   v1 |          sourcefile |  sourcedir |
     | name: billy                                          | id2_date1_form1.txt | text_files |
     |-----------------------------------------------------------------------------------------|
     |                                                                name                     |
     |                      billy                                                              |
     +-----------------------------------------------------------------------------------------+

Comment

Kyle Smith

Join Date: Mar 2017

Posts: 117
#14

13 Dec 2017, 17:06

Robert,

Thank you for the new suggestion. That is a very slick way of doing it. And fast as well. This method cruised right through the red "no observations" problem/error. It gave 4 such red error notes but kept on going until the end.

Regarding the 10,000 file number limit. Is that just *txt files or all files?

Also, is it possible to only investigate a subset of the files in the folder based on their filename? Some of my folders have 100,000 files in them so creating lots of subfolders would be costly. In the previous example the filename has id, date, form. Would it be possible to only run the program on files with form="form1"?

Much thanks,
Kyle
Comment
Kyle Smith

Join Date: Mar 2017

Posts: 117
#15

13 Dec 2017, 17:17

Richard,

If I only wanted form1 and form3 (from the .txt filename), what about something like:

Code:

filelist, dir("text_files") foreach j in form1 form3 { keep if strmatch(filename, "*`j*.txt") }

It seems reasonable, but doesn't work.
1 like
Comment

Announcement