Subsetting a single CSV file into multiple CSV files

SeungYong Han

Join Date: Jul 2015

Posts: 53
#1

Subsetting a single CSV file into multiple CSV files

30 Jun 2021, 15:14

Hello,

I am trying to subset one CSV file into multiple CSV files. I wrote the codes below utilizing preserve and restore, and it's for one CSV file with 1150 records. It works.

Code:

import delimited "10_Exercise session.csv", clear scalar file_num=ceil(r(N)/100) preserve drop if _n>100 export delimited "10_Exercise session1.csv", replace restore preserve drop if 100<=_n & _n<200 export delimited "10_Exercise session2.csv", replace restore ... preserve drop if 1100<=_n export delimited "10_Exercise session12.csv", replace restore

However, I have other CSV files with different numbers of records (e.g., 500 records or 700 records) so the number of CSV files at the end will be different from file to file.
I tried to use scalar for r(N) then use foreach command but failed. I don't think I have enough coding skills for this yet, so I would appreciate it if you can advise on how to write codes for the task.

Thank you.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30083

30 Jun 2021, 17:07

I haven't tested this code, but barring some typos or minor logic errors, this is the gist of it:

Code:

import delimited myfile.csv, clear
local n_obs `r(N)'

local n_subfiles = ceil(`n_obs'/100)


preserve
forvalues i = 1/`n_subfiles' {
    local from = (`i'-1)*100
    local to = max(`n_obs', `i'*100)
    export delimited in `from'/`to' using myfile`i'.csv, replace
    restore, preserve
}

Comment

SeungYong Han

Join Date: Jul 2015
Posts: 53

30 Jun 2021, 21:03

Thank you so much! I understand the main idea and revised your codes a bit as below.

One problem I had was saving a CSV file for the records 1-100. I wrote the code for that at the beginning then `i' starts from 2, not 1. If it start from 1, then "from" local macro starts from 0 and there is no record#0, so I get an error message.
I also changed "max" to "min".

Please let me know if there is a better way!

Code:

import delimited "C:\Users\mroie\Downloads\10_Exercise session.csv", clear
export delimited in 1/100 using "C:\Users\mroie\Downloads\10_Exercis session0.csv", replace

local n_obs `r(N)'
local n_subfiles=ceil(`n_obs'/100)

display "`n_obs'"
display "`n_subfiles'"

forvalues i=2/`n_subfiles' {
    preserve
    local from=(`i'-1)*100+1
    local to=min(`n_obs', `i'*100)
    local j=`i'-1

    export delimited in `from'/`to' using "C:\Users\mroie\Downloads\10_Exercis session`j'.csv", replace
    restore
}

Comment

Vivek Gupta

Join Date: Sep 2016
Posts: 25

30 Jun 2021, 21:50

If your CSV has variables= names in first row, you may use varnames(1) option to import the first row as variables

Code:

  import delimited "C:\Users\mroie\Downloads\10_Exercise session.csv", clear varnames(1)

If this is for a student exercise, you may also consider the approach of taking a random sample of 100 records

Code:

cd "C:\Users\mroie\Downloads\" // Jump into the working directory
import delimited "10_Exercise session.csv", clear varnames(1)

set seed 2947982 // for reproducible sampling
forvalues i=1/10 { // Takes 10 sample
  preserve
  sample 100, count
  export delimited using "10_Exercis session`i'.csv", replace // creates CSV with variable names at top
  restore
}

Stata 15.1 (MP 2 core)
https://www.epidemiology.tech/category/stata/
Google Scholar Profile

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30083
#5

01 Jul 2021, 09:09

Actually, with your pointing out of the problem when i = 1, I realized that there is another problem with the code shown: observations 100, 200, ..., 1100 will all be put into two files, instead of appearing in only one. Both of them are easily fixed with a single change in the code:

Code:

import delimited myfile.csv, clear local n_obs `r(N)' local n_subfiles = ceil(`n_obs'/100) preserve forvalues i = 1/`n_subfiles' { local from = (`i'-1)*100 + 1 local to = max(`n_obs', `i'*100) export delimited in `from'/`to' using myfile`i'.csv, replace restore, preserve }

No need to do the first 100 separately. Sorry for the error in the original code.
1 like
Comment

SeungYong Han

Join Date: Jul 2015
Posts: 53

01 Jul 2021, 11:19

It's not for students, and I don't want to randomly select 100 cases for each file, so Clyde's codes work for me. Still, thank you Vivek for your time and efforts.
I just changed max to min so that each file has 100 records (e.g., 1-100, 101-200, till the end).

Here is the final version. Thank you all!

Code:

capture program drop redcapcsv
program define redcapcsv
    import delimited "`1'.csv", clear

    local n_obs `r(N)'
    local n_subfiles=ceil(`n_obs'/100)

    display "`n_obs'"
    display "`n_subfiles'"
    
    preserve
    forvalues i=1/`n_subfiles' {
        local from=(`i'-1)*100+1
        local to=min(`n_obs', `i'*100)
        export delimited in `from'/`to' using "`1'`i'.csv", replace
        restore, preserve
    }
end

Thank you so much, all!

Announcement