Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Loop over observations to open a large file

    Dear statalists,

    I am very new to Stata, and I would like to ask your help because I cannot create a loop to repeatedly open a very large dataset in smaller portions, in order to save it smaller parts, so that I could work with each separately
    Suppose I have a large dataset file (dataset.dta), which I cannot open alltogether because too large for my PC. I would like to:
    1) open it in the range 1/1000
    2) save the file as dataset_1.dta
    3) close it.
    4) open 1001/2000 and restart the process until the entire dataset has been saved.

    My idea was the following, but I think that it is impossible to use `i' in the range

    forval i = 1/100 { use "C:\Stata\dataset.dta" in [(`i'-1)*1000+1]/(1000*`i') save "C:\Stata\dataset_`i'.dta" }
    Many thanks in advance for your help, and I am sorry if it could be a silly question.

    Paolo

  • #2
    I am not sure what you are going to do with all these datasets. but I think this answers your question. Note as explained at http://www.statalist.org/forums/help#stata that CODE delimiters make code easier to read.

    Code:
    forval i = 1/100 { 
       local i2 = 1000 * `i' 
       local i1 = `i2' - 999 
       use "C:\Stata\dataset.dta" in `i1'/`i2'  
       save "C:\Stata\dataset_`i'.dta" 
    }
    There is a syntax for evaluating functions of locals on the fly, but I recommend something like the above while you are new to Stata.

    Comment


    • #3
      Thank you very much, works perfectly! Next time I will do my best to make it more readable.
      Thanks again

      Comment


      • #4
        Paolo Rossi given the expense of I/O operations, you might be better off finding some number of observations that could comfortable fit in memory (e.g., 500,000). Then open the file once and take a random sample of observations of the largest size you can work with on the machine. It will take a little bit longer initially if the system starts to swap memory, but you'll reduce the amount of clutter on the disk, and will have a single data set that best represents the larger file that you can work with to write your code. If you really wanted to split the file into multiple "shards", I would still take the approach of loading the whole file, then use export excel or export delimited to write multiple files without having to read the data back into memory several times (since both allow the if and in qualifiers).

        Comment

        Working...
        X