Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • does Stata store the filename in memory?

    I would like to create variables within a dataset using parts of the filename. Is this possible? Does STATA store the filename in memory somewhere?

    The filename is in the following format:

    identifier_date_form.txt

    I would like to create 3 variables in the file: identifier, date, form.

    This will be part of a loop that loops over thousands of files, thus I will not be calling/loading each file by hand.


    I don't know if this is helpful or not, but separately I am able to define the variables using the filename command.

    The code to do that:

    filelist, dir("C:\Users\dir\temp") pat("*.txt") save("txt_datasets.dta")

    use txt_datasets.dta, replace

    split filename, p(_)
    gen id=filename1
    gen date=filename2
    gen form=filename3

    keep filename date form id

    save txt_datasets, replace


    I just do not know how to load/call/identify the filename once the file is in use within the loop.


    Thanks in advance.

  • #2
    See

    Code:
    help creturn
    The filename last specified with use (or save) is stored in c(filename).

    I do not know whether that helps, as you do neither state whether you wish to get the contents of these .txt files into Stata and if so, how you plan to do it.

    Best
    Daniel

    Comment


    • #3
      Daniel,

      Thanks for the help. Unfortunately, that must only work for dta files. I am loading the files using import delimited. The c(filename) is empty in the case of .txt files.

      I haven't worked out all the steps yet. I was trying to figure this out one step at a time. First, how to do this for one individual file before building the loop to do it for all files in the directory.

      The entire project is to count the number of times specific words are mentioned within 1000s of txt files. Then save those words counts as 1 row in a new data file. But I also need the ID, date, form in the same row as the word counts.

      Each add'l txt file would add 1 row to the master file and would contain (ID, date, form, word counts).

      Comment


      • #4
        I am still not fully getting what you want. Reading you initial post once more, I even fail to see why you would want the filename from memory, when you seem to get the names via filelist (from SSC, I suppose). Looking through the help file for filelist, I believe the last example gives the basic setup that you want.

        Best
        Daniel

        Comment


        • #5
          Why not use the extended macro functions to get a list of file names and use the words from the macro to create your filename variable?

          Comment


          • #6
            I am getting closer.

            This is the first part that reads in .txt and saves .dta files.

            This part runs/works fine.


            ************************************************** *********
            clear
            cd "C:\Users\temp"
            clear
            set more off
            local myfilelist : dir . files "*.txt"
            foreach file of local myfilelist {
            insheet using `file', clear
            local outfile = subinstr("`file'",".txt","",.)
            save "geo_`outfile'", replace
            }
            ************************************************** *

            This is the second part that creates the variables based on filenames. I need to split twice. The first part to get rid of the directory ("") and the second to split the variables I want ("_").

            This part works on any individual file (if I open a file and then run the code within the loop, but don't run the loop; tried this one 10 different files), but not in the loop. I get an error

            variable temp110 not found
            r(111);



            I think it runs the first time (first file) and then gives the error. I am not a loop expert.

            **************************************************

            clear
            cd "C:\Users\dir"
            clear
            local allfiles: dir . files "*.dta"
            set more off
            foreach file of local allfiles {
            use `file', clear

            gen temp1=c(filename)
            split temp1, p(\)
            split temp110, p(_)
            gen date=temp1102
            gen form=temp1103
            gen ID=temp1106

            save `file', replace
            }

            ************************************************


            Any idea why it is not looping? Do you understand what I am trying to do?

            Comment


            • #7
              Hi Kyle,

              there are several issues with your code that may prevent the loop to continue if the filenames you're looping are not homogeneous.

              Anyhow, please have a look at the FAQ on how to use code delimiters here in order to make your code examples readable.
              From what I interprete into your question, you want to batch-convert a bunch of plain text data files to Stata datasets. And each of them should include a variable with the source file name.

              Am I correct? Then there is no need for the second loop, or for Stata saving any filename internally---you're already looping over these filenames in the first place.

              I put up a minimal example to illustrate how to do what I assume you try to do:
              Code:
              clear
              // set up example file 1
              input contentvar1
               1
               2
              end
              export delimited id1_date1_form1.txt , replace
              clear
              // set up example file 2
              input contentvar1
               3
               4
              end
              export delimited id2_date1_form1.txt , replace
              clear
              
              // loop over txt-files, and save each as dta
              local myfilelist : dir . files "*.txt"
              foreach file of local myfilelist {
                  import delimited using `file', clear
                  local sourcefile=subinstr("`file'",".txt","",.)
                  // why not generate the filename-variable right here?
                  generate sourcefile=`"`sourcefile'"'
                  local outfile `"geo_`sourcefile'.dta"'
                  save "`outfile'", replace
                  * you can now save a growing list of all output datasets, and -append- them at the end, if you want to:
                  * see -help macrolists-
                  local outfilelist : list outfilelist | outfile
              }
              
              // append all saved Stata files onto each other, if you wish
              clear
              append using `outfilelist'
              
              // finally, split up the source file name into the parts you're interested in the result dataset
              split sourcefile , parse("_")
              rename (`r(varlist)') (id date form)
              order id date form sourcefile , first
              What I want to illustrate is: As soon as you loop over the original .txt-files, you have each name as a local macro anyways; use this to populate your file name variable.

              I added some code at the end to show how to auto-append all the files together, if this is desired. I guess it's easier to split up the source file names after combining all the data together, not before.

              Regards
              Bela

              Comment


              • #8
                Bela,

                I wish I was as smart with STATA as you.

                (The reason I didn't create the names inside the loop previously is b/c I didn't know how to call the names (c(filename)) was empty. I didn't know your way.)

                Also apologize for not using the delimiters. I use them now.

                Before I append, I need to count the number of times specific words appear in the txt files. (These files are like word documents, not excel sheets, structurally speaking.)

                The data in the text files is all text. So when I read it in, it enters as v1, v2, etc. Some of the files have 15 variables (v1-v15) of text and some only have 6 variables (v1-v6). Some are populated with missing (.) which means STATA thinks they are numerical, while most are read in as string (as they should be). Is there a way to know how many of these variables are in each file so I can create a variable that includes all of the text variables?

                I can do something like:

                Code:
                 local nvars = c(k)
                 unab allvars : *
                 loc lastvar: word `c(k)' of `r(varlist)'
                 tostring v1-`lastvar', force replace
                 
                 foreach var of varlist {
                    gen text=v1-`lastvar'
                 drop v*
                }
                Except I get some error in the tostring part. Also I suspect I am not creating "text" variable correctly?

                I will run the above code before the filename sourcefile code. That way I can delete the extra text variables and keep things clean.

                I am not sure what your last line in the loop does?
                Code:
                local outfilelist : list outfilelist | outfile
                Also I am not sure if the part below the loop needs to go inside the loop?
                Code:
                // append all saved Stata files onto each other, if you wish
                clear
                append using `outfilelist'
                After this I would like to save and append a separate datafile with only the word counts, form, ID, date. Then this huge problem (huge for me) will be solved.

                The word counts will be like this:
                Code:
                moss text, match("positive") prefix(pos)
                moss text, match("negative") prefix(neg)
                
                egen total_pos=total(poscount)
                egen total_neg=total(negcount)
                
                keep total* ID date form
                duplicates drop
                Thanks very much for your help.



                Comment


                • #9
                  Hi again,

                  I feel you're trying to solve problems that can be avoided directly upon import of the text files. Have a look at the options -help import delimited-, and you will see that it is quite easy to (1) import everything as string and (2) import everything into a single variable. If you would have shown an example of the data you're working with, there would have been a quicker solution, I guess.

                  Thus said, I think all modifications necessary to my previous code are two more options to the -import delimited- statement. You can count words afterwards, as you wish:
                  Code:
                  clear
                  // set up example file 1
                  input str30(contentvar1)
                   `"this is some text"'
                   `"this is more text"'
                  end
                  outfile using id1_date1_form1.txt , replace wide noquote
                  clear
                  // set up example file 2
                  input str30(contentvar1)
                   `"even more text is here"'
                   `"this is even "quoted" text"'
                  end
                  outfile using id2_date1_form1.txt , replace wide noquote
                  clear
                  
                  // loop over txt-files, and save each as dta
                  local myfilelist : dir . files "*.txt"
                  foreach file of local myfilelist {
                      import delimited using `file', clear stringcols(_all) varnames(nonames)
                      local sourcefile=subinstr("`file'",".txt","",.)
                      // why not generate the filename-variable right here?
                      generate sourcefile=`"`sourcefile'"'
                      local outfile `"geo_`sourcefile'.dta"'
                      save "`outfile'", replace
                      * you can now save a growing list of all output datasets, and -append- them at the end, if you want to:
                      * see -help macrolists-
                      local outfilelist : list outfilelist | outfile
                  }
                  
                  // append all saved Stata files onto each other, if you wish
                  clear
                  append using `outfilelist'
                  
                  // finally, split up the source file name into the parts you're interested in the result dataset
                  split sourcefile , parse("_")
                  rename (`r(varlist)') (id date form)
                  order id date form , first
                  drop sourcefile
                  Regards
                  Bela

                  PS: If something does not work with this update, please add an example of the text files you're importing to the next post, ideally using -input- as I did for the minimal example. Otherwise, readers can not see what's producing problems.

                  Comment


                  • #10
                    I am revisiting this code and have some problem that I do not understand.

                    Code:
                    clear
                    // set up example file 1
                    input str50(contentvar1)
                     `"this is some text"'
                     `"name: joey"'
                    end
                    outfile using id1_date1_form1.txt , replace wide noquote
                    clear
                    // set up example file 2
                    input str50(contentvar1)
                     `"even more text is here"'
                     `"this is even "quoted" text"'
                     `"name: billy"'
                    end
                    outfile using id2_date1_form1.txt , replace wide noquote
                    clear
                    
                    // loop over txt-files, and save each as dta
                    local myfilelist : dir . files "*.txt"
                    foreach file of local myfilelist {
                        import delimited using `file', clear stringcols(_all) varnames(nonames)
                        local sourcefile=subinstr("`file'",".txt","",.)
                        // why not generate the filename-variable right here?
                        generate sourcefile=`"`sourcefile'"'
                        local outfile `"geo_`sourcefile'.dta"'
                        
                    
                        // this is the new part of the code
                        gen temp1=strpos(v1,"name:")
                        drop if temp1==0
                        gen name=substr(v1,temp1+6,10)
                    
                    
                        keep sourcefile name
                        save "`outfile'", replace
                        * you can now save a growing list of all output datasets, and -append- them at the end, if you want to:
                        * see -help macrolists-
                        local outfilelist : list outfilelist | outfile
                    }
                    
                    // append all saved Stata files onto each other, if you wish
                    clear
                    append using `outfilelist'
                    
                    // finally, split up the source file name into the parts you're interested in the result dataset
                    split sourcefile , parse("_")
                    rename (`r(varlist)') (id date form)
                    order id date form , first
                    drop sourcefile
                    However, I get the error:

                    no observations
                    r(2000);

                    right after
                    Code:
                        local outfilelist : list outfilelist | outfile
                    }
                    This needs to loop over tens of thousands of .txt files. I have tried it on a subset of 30 files and it runs fine. I don't know why it fails when I try it on the larger sample?

                    It is possible that some of the files do not have "name:" in them. But I don't think that is the problem because if I change one of the files above so that it does not include "name:" the code still runs fine.

                    Any ideas/help would be much appreciated.

                    Comment


                    • #11
                      I can't replicate your problem. The code runs without error messages or breaks on my installation.

                      One possibility: your code has a lot of local macros in it. You need to run this all in one fell swoop. If you run "chunks" of the code separately, the local macro definitions in one chunk are annulled before you get to the next chunk. If you are running this code in pieces, that is probably the source of your difficulty.

                      Comment


                      • #12
                        Thanks for your reply Clyde. And big congrats on 10000 posts.

                        I am running the code in one fell swoop. It works on 30 files, but not on the larger sample.

                        Is there another way to do what I am trying to do? Or is there a way to diagnose the problem? Run it "noisily"?

                        Comment


                        • #13
                          I prefer to use filelist and runby (both from SSC) to do this type of work. Note that Stata has a limit of 10,000 files per directory for the Mata function that filelist uses so you'll have to split the files into subdirectories if you have more than that.

                          The following example assumes that your two sample files are in a subdirectory called "text_files" within Stata's current directory (help cd). The first step is to use filelist to create a dataset of files to process:
                          Code:
                          clear all
                          filelist, dir("text_files")
                          keep if strmatch(filename, "*.txt")
                          If you list the results, you get:
                          Code:
                          . list
                          
                               +------------------------------------------+
                               | dirname      filename              fsize |
                               |------------------------------------------|
                            1. | text_files   id1_date1_form1.txt     106 |
                            2. | text_files   id2_date1_form1.txt     159 |
                               +------------------------------------------+
                          When you are satisfied that the list of files you want to process is complete, you can use runby to import each file.
                          Code:
                          program import_txt
                            local dsource = dirname
                            local fsource = filename
                            import delimited using "`dsource'/`fsource'", clear stringcols(_all) varnames(nonames)
                            generate sourcefile =`"`fsource'"'
                            generate sourcedir  =`"`dsource'"'
                          
                            keep if strpos(v1,"name:")
                            gen name = subinstr(v1,"name:","",1)
                          end
                          
                          runby import_txt, by(dirname filename) verbose
                          With runby, what's left in memory when your import_txt program terminates is considered results and is stored. These results accumulate and replace the data in memory when all the by-groups have been processed. Here's what's left in memory when the above code has run:
                          Code:
                          . list
                          
                               +-----------------------------------------------------------------------------------------+
                            1. |                                                   v1 |          sourcefile |  sourcedir |
                               | name: joey                                           | id1_date1_form1.txt | text_files |
                               |-----------------------------------------------------------------------------------------|
                               |                                                                name                     |
                               |                      joey                                                               |
                               +-----------------------------------------------------------------------------------------+
                          
                               +-----------------------------------------------------------------------------------------+
                            2. |                                                   v1 |          sourcefile |  sourcedir |
                               | name: billy                                          | id2_date1_form1.txt | text_files |
                               |-----------------------------------------------------------------------------------------|
                               |                                                                name                     |
                               |                      billy                                                              |
                               +-----------------------------------------------------------------------------------------+

                          Comment


                          • #14
                            Robert,

                            Thank you for the new suggestion. That is a very slick way of doing it. And fast as well. This method cruised right through the red "no observations" problem/error. It gave 4 such red error notes but kept on going until the end.

                            Regarding the 10,000 file number limit. Is that just *txt files or all files?

                            Also, is it possible to only investigate a subset of the files in the folder based on their filename? Some of my folders have 100,000 files in them so creating lots of subfolders would be costly. In the previous example the filename has id, date, form. Would it be possible to only run the program on files with form="form1"?

                            Much thanks,
                            Kyle

                            Comment


                            • #15
                              Richard,

                              If I only wanted form1 and form3 (from the .txt filename), what about something like:

                              Code:
                              filelist, dir("text_files")
                                foreach j in form1 form3 {
                              keep if strmatch(filename, "*`j*.txt")
                              }
                              It seems reasonable, but doesn't work.

                              Comment

                              Working...
                              X