Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Very Large CSV file

    Good afternoon,

    I am new STATA use and working with a very large .csv file (58 million observations and 24 variables). I am trying to figure out the best way to analyze this data. The statistics that I am looking for from this dataset are not complicated (primarily just summary statistics). I was thinking that the best way would be to somehow import the file piece by piece into STATA, run the same code on each piece, and then aggregate the results. Any help would be greatly appreciated!

    Thanks,
    Beckie

  • #2
    Well, I'm not convinced that breaking the data into subsets and then combining the descriptive statistics is really the best way to go here. I think the time it will take you to code the re-combination of the descriptive statistics (and the risk of doing it incorrectly) will outweigh the time it will take you to just run the descriptive statistics on the whole data set. The largest dataset I currently have on my computer has 14,342,307 observations and contains 10 numeric variables. It took Stata 37 seconds to calculate descriptive statistics on all 10 variables.

    Anyway if you want to just break your data set up into, say, 6 chunks of about 10,000,000 observations each, you could do this:

    Code:
    capture program drop save_one_chunk
    program define save_one_chunk
        local chunk_num = chunk[1]
        drop chunk
        save chunk_`chunk_num', replace
        exit
    end
    
    import delimited my_big_csv_file.csv, clear
    gen int chunk = ceil(_n/10000000)
    runby save_one_chunk, by(chunk)
    To use this, you must install the -runby- command, written by Robert Picard and me, available from SSC. This code will create 6 Stata data sets, called chunk1.dta through chunk6.dta, each containing 10,000,000 observations (except the last will be a bit smaller.)

    But frankly, for the purpose you describe, I'd just save the whole thing as a single data file and run my analyses on that. Unless you had a way to enforce parallel processing on the separate chunks, I don't see any gain from breaking it up.


    Comment


    • #3
      Even with Stata/IC, you can have ~2 billion observations so I don't see a technical reason preventing Stata from analyzing the data altogether, per se. I have not tried to import such a large file into Stata, so I tested the scenario with dummy data to see if the import facility would have issues. My code was:

      Code:
      set obs 58000000
      gen id = _n
      gen x = rnormal(0, 1)
      export delimited test.csv
      This took about 10 minutes to write out a ~1.0 GB file with the 58M observations and 2 variables. I then imported the dataset using:

      Code:
      drop _all
      import delimited id x using test.csv
      On my reasonably fast machine, this took a few minutes to read in all data. So that shows there s no issue with importing a large file.

      Summary statistics (and other commands) would then be used as normal, though given the size of the dataset, I expect will take some time (for example, -- summ x, detail -- took a minute).
      Last edited by Leonardo Guizzetti; 02 Apr 2018, 12:48. Reason: fixed grammar

      Comment


      • #4
        Originally posted by Leonardo Guizzetti View Post
        Even with Stata/IC, you can have ~2 billion observations so I don't see a technical reason preventing Stata from analyzing the data altogether, per se. I have not tried to import such a large file into Stata, so I tested the scenario with dummy data to see if the import facility would have issues. My code was:

        Code:
        set obs 58000000
        gen id = _n
        gen x = rnormal(0, 1)
        export delimited test.csv
        This took about 10 minutes to write out a ~1.0 GB file with the 58M observations and 2 variables. I then imported the dataset using:

        Code:
        drop _all
        import delimited id x using test.csv
        On my reasonably fast machine, this took a few minutes to read in all data. So that shows there s no issue with importing a large file.

        Summary statistics (and other commands) would then be used as normal, though given the size of the dataset, I expect will take some time (for example, -- summ x, detail -- took a minute).
        Hi,

        I am working with large csv files some of which having the size of 1 to 2 GB. But I cannot even import them to Stata in such a short time, for example with one 1.53 GB file I had to wait 2 hours and in the end the process was not complete and stata stopped responding. I would appreciate your comment on this and why do you think this is happening.

        Best,
        Shadi

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          Well, I'm not convinced that breaking the data into subsets and then combining the descriptive statistics is really the best way to go here. I think the time it will take you to code the re-combination of the descriptive statistics (and the risk of doing it incorrectly) will outweigh the time it will take you to just run the descriptive statistics on the whole data set. The largest dataset I currently have on my computer has 14,342,307 observations and contains 10 numeric variables. It took Stata 37 seconds to calculate descriptive statistics on all 10 variables.

          Anyway if you want to just break your data set up into, say, 6 chunks of about 10,000,000 observations each, you could do this:

          Code:
          capture program drop save_one_chunk
          program define save_one_chunk
          local chunk_num = chunk[1]
          drop chunk
          save chunk_`chunk_num', replace
          exit
          end
          
          import delimited my_big_csv_file.csv, clear
          gen int chunk = ceil(_n/10000000)
          runby save_one_chunk, by(chunk)
          To use this, you must install the -runby- command, written by Robert Picard and me, available from SSC. This code will create 6 Stata data sets, called chunk1.dta through chunk6.dta, each containing 10,000,000 observations (except the last will be a bit smaller.)

          But frankly, for the purpose you describe, I'd just save the whole thing as a single data file and run my analyses on that. Unless you had a way to enforce parallel processing on the separate chunks, I don't see any gain from breaking it up.

          Hi,

          I am working with large csv files (500 MB to 2 GB) and I have troubles importing my files into stata. I tried a loop at first as followed to drop the unwanted variables and observations only and save the results in separate dta file. But stata keeps working and crashes after some several hours. Even with the directory only contains two of the cvs files (there are over 50 files in total). Here is the loop I tried for importing files from the directory:



          Code:
               global path "E:/Comtradet1"    
          cd "$path"
          local datafiles: dir "." files "*.csv"    
          dir     foreach file of local datafiles {        
          import delimited "`file'" , clear        
          some commands                
          save "$path/`file'.dta", replace     }
          As I said I cannot even import the data successfully. Please let me know if I have made mistakes in my script.
          To check if the code works I even tried this with only one file in my directory and still the same thing happened.

          Then I tried importing a single csv file using the code below:

          Code:
              global path "E:/Comtradet"
              import delimited "E:\Comtrade\type-C_r-ALL_ps-1977_freq-A_px-S2_pub-20041207_fmt-csv_ex-20190512", clear
              cd "$path"    
                 some commands
                  compress
                  
                  save "$path/1977.dta", replace
          Which actually worked but this is my smallest file (280 MB) and when I run the same code for another csv file (1.50 GB) stata keeps working and after some hours crashes. After reading the post here however I found out that it's also not very normal that it takes even one hours. So please let me know if you could think of a solution.
          Best,
          Shadi
          Last edited by Shadi Shad; 07 Sep 2021, 03:59.

          Comment

          Working...
          X