Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Subsetting a single CSV file into multiple CSV files

    Hello,

    I am trying to subset one CSV file into multiple CSV files. I wrote the codes below utilizing preserve and restore, and it's for one CSV file with 1150 records. It works.

    Code:
    import delimited "10_Exercise session.csv", clear
    scalar file_num=ceil(r(N)/100)
    
    preserve
    drop if _n>100
    export delimited "10_Exercise session1.csv", replace
    restore
    
    preserve
    drop if 100<=_n & _n<200
    export delimited "10_Exercise session2.csv", replace
    restore
    
    ...
    
    preserve
    drop if 1100<=_n 
    export delimited "10_Exercise session12.csv", replace
    restore
    However, I have other CSV files with different numbers of records (e.g., 500 records or 700 records) so the number of CSV files at the end will be different from file to file.
    I tried to use scalar for r(N) then use foreach command but failed. I don't think I have enough coding skills for this yet, so I would appreciate it if you can advise on how to write codes for the task.

    Thank you.



  • #2
    I haven't tested this code, but barring some typos or minor logic errors, this is the gist of it:

    Code:
    import delimited myfile.csv, clear
    local n_obs `r(N)'
    
    local n_subfiles = ceil(`n_obs'/100)
    
    
    preserve
    forvalues i = 1/`n_subfiles' {
        local from = (`i'-1)*100
        local to = max(`n_obs', `i'*100)
        export delimited in `from'/`to' using myfile`i'.csv, replace
        restore, preserve
    }

    Comment


    • #3
      Thank you so much! I understand the main idea and revised your codes a bit as below.

      One problem I had was saving a CSV file for the records 1-100. I wrote the code for that at the beginning then `i' starts from 2, not 1. If it start from 1, then "from" local macro starts from 0 and there is no record#0, so I get an error message.
      I also changed "max" to "min".

      Please let me know if there is a better way!

      Code:
      import delimited "C:\Users\mroie\Downloads\10_Exercise session.csv", clear
      export delimited in 1/100 using "C:\Users\mroie\Downloads\10_Exercis session0.csv", replace
      
      local n_obs `r(N)'
      local n_subfiles=ceil(`n_obs'/100)
      
      display "`n_obs'"
      display "`n_subfiles'"
      
      forvalues i=2/`n_subfiles' {
          preserve
          local from=(`i'-1)*100+1
          local to=min(`n_obs', `i'*100)
          local j=`i'-1
      
          export delimited in `from'/`to' using "C:\Users\mroie\Downloads\10_Exercis session`j'.csv", replace
          restore
      }

      Comment


      • #4
        If your CSV has variables= names in first row, you may use varnames(1) option to import the first row as variables
        Code:
          import delimited "C:\Users\mroie\Downloads\10_Exercise session.csv", clear varnames(1)
        If this is for a student exercise, you may also consider the approach of taking a random sample of 100 records

        Code:
        cd "C:\Users\mroie\Downloads\" // Jump into the working directory
        import delimited "10_Exercise session.csv", clear varnames(1)
        
        set seed 2947982 // for reproducible sampling
        forvalues i=1/10 { // Takes 10 sample
          preserve
          sample 100, count
          export delimited using "10_Exercis session`i'.csv", replace // creates CSV with variable names at top
          restore
        }
        Stata 15.1 (MP 2 core)
        https://www.epidemiology.tech/category/stata/
        Google Scholar Profile

        Comment


        • #5
          Actually, with your pointing out of the problem when i = 1, I realized that there is another problem with the code shown: observations 100, 200, ..., 1100 will all be put into two files, instead of appearing in only one. Both of them are easily fixed with a single change in the code:

          Code:
          import delimited myfile.csv, clear
          local n_obs `r(N)'
          
          local n_subfiles = ceil(`n_obs'/100)
          
          
          preserve
          forvalues i = 1/`n_subfiles' {
              local from = (`i'-1)*100 + 1
              local to = max(`n_obs', `i'*100)
              export delimited in `from'/`to' using myfile`i'.csv, replace
              restore, preserve
          }
          No need to do the first 100 separately. Sorry for the error in the original code.

          Comment


          • #6
            It's not for students, and I don't want to randomly select 100 cases for each file, so Clyde's codes work for me. Still, thank you Vivek for your time and efforts.
            I just changed max to min so that each file has 100 records (e.g., 1-100, 101-200, till the end).

            Here is the final version. Thank you all!

            Code:
            capture program drop redcapcsv
            program define redcapcsv
                import delimited "`1'.csv", clear
            
                local n_obs `r(N)'
                local n_subfiles=ceil(`n_obs'/100)
            
                display "`n_obs'"
                display "`n_subfiles'"
                
                preserve
                forvalues i=1/`n_subfiles' {
                    local from=(`i'-1)*100+1
                    local to=min(`n_obs', `i'*100)
                    export delimited in `from'/`to' using "`1'`i'.csv", replace
                    restore, preserve
                }
            end
            Thank you so much, all!

            Comment

            Working...
            X