Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to split up stata files

    Hi all, I have a dataset with 800000 different observations, so I want to split up into 8000 observation/files (~100 files). I have tried to do as below, but I think my way is not intelligent. Therefore, I'm very appreciate any suggestions for me. Thanks all!!!!!!!

    Code:
        preserve
        keep if _n >= 0 & _n <= 8000
        save "$name/dn_mst_1.dta", replace
        restore
        
        preserve
        keep if _n >= 8001 & _n <= 16000
        save "$name/dn_mst_2.dta", replace
        restore
        
        preserve
        keep if _n >= 16001 & _n <= 32000
        save "$name/dn_mst_3.dta", replace
        restore
    .
    .
    .
    .
    .
        preserve
        keep if _n >= 792001 & _n <= 800000
        save "$name/dn_mst_100.dta", replace
        restore
    Last edited by Tan Nguyen; 21 Feb 2023, 05:25.

  • #2
    As you have noticed, -save- doesn't allow an -if- or -in- qualifier. That makes things complicated. This might work (untested):

    Code:
    gen recno=_n
    save tmp,replace
    local i=0
    while 1 {
       use tmp if recno>=`i'*8000&recno<(`i'+1)*8000,clear
       count
       if `r(N)'==0 exit
       drop recno
       local i=`i'+1
       save $name/dn_mst_`i',replace
    }
    You may need to experiment. I don't think -preserve/restore- is going to be much faster than just rereading the file, since the -preserve- must be repeated for each output file. (You can't -restore- the same file twice). Frames might be faster, I don't know. We have to define recno because _n in the -use- statement refers to the observation number in the workspace, not the input file.

    I expect someone will comment that there is no need to divide the file into pieces, 8 million records isn't really a lot and you are better off processing them all together, perhaps using the -by- prefix to get separate results for each block. I only post this code in the hopes of reminding any Stata staff reading this that -save- needs the -if- qualifier.

    Comment


    • #3
      @daniel: Thank you for your response. The reasons why I need to divide my stata file because I will do these stata file in R programming so 800000 obs is very large for web scraping.

      Comment

      Working...
      X