Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Maximum CSV size in Stata SE and what to do about 'fat' data

    I have a dataset with 4600 observations of 3500 variables. The .xlsx file is 90 MB and the .csv version of it is 75 MB.

    I'm currently running Stata 13 IC, which has served me fine in the past. I'm running into an issue here trying to import my data, however, since the maximum amount of variables in Stata IC is 2048. I also ran into issues with the maximum import size for .xlsx files being 40 MB.

    I'm thinking about purchasing Stata 16 SE to allow me to work with this bigger data. I just wanted to check with people here to make sure Stata 16 SE would handle my dataset, at least in .csv format; is this correct? Would it handle that dataset in .xlsx format?

  • #2
    I think there's a version 13 solution. If I recall correctly, the -import delimited- command existed in version 13. If so, you can read the first (say) 2000 variables using the colrange() option follows:
    Code:
    import delimited "YourFile.csv", colrange(1:2000)
    You could then drop the variables not needed, create an observation number variable using -gen obsnum = _n, and save the file. Then, you can read in the remaining variables from the original file (e.g., colrange(2001:4500) ), drop unwanted variables, -gen obsnum = _n- and merge with the previous partial data set. I have not done this, but I can't think of a reason offhand why it would not work, so I'd be curious to hear.

    If you think you might need all the variables at some time or another (unlikely, I'd say), you can store all the variables into separate files of 1000 variables or so along with an id variable, and then obtain them via -merge- onto a master file as needed.

    There isn't a maximum CSV file size per se (I don't think), but there is a number of variables limit as you say. Neither _N = 4600, nor a file size of 75 mB is particularly large in the 21st century.

    Comment


    • #3
      Wouldn't Stata prevent me from merging in additional variables once I got past 2048?

      And good to hear re: no max .csv size. Thanks!

      Comment


      • #4
        Also, I seem to be unable to open the .csv file in Stata. It just crashes, similar to this issue: https://www.statalist.org/forums/for...ting-csv-files, but without any window managers running.

        Comment


        • #5
          You said: "Wouldn't Stata prevent me from merging in additional variables once I got past 2048?"

          With what I'm suggesting, you would *not* be merging more than 2048 variables. That's the point/purpose of only importing a selection of variables. You'd only use a limited selection of variables, and there would never be more than 2048 (or even fewer) variables in the data set resident in Stata. Note that my suggestion did not involve merging in all the variables at once, but rather dropping the likely hundreds (1,000s?) of variables not immediately relevant before merging in others, or "chunking" the file into smaller subsets. (In that regard, you might experiment with a very small selection, e.g. -import delimited ..., colrange(1:5) and see what happens.)

          However, it *is* possible that Stata would refuse to go beyond 2048 with the colrange() option, but that most likely lead to an error message rather than a crash.

          The crashing you describe is not a file size issue per se, I don't think. (See -help limits- in any event.) The description at the URL you cited sounded like some kind of Mac-specific dialog box problem. I'd suggest you try to import the file as I suggest, *not using dialog boxes* but just using the command line window, and then report back *verbatim* any error messages you get. Perhaps you've already tried the command line and perhaps there was no error message you could have reported, in which case my advice would be moot.

          Comment


          • #6
            Ah okay I see, thank you.

            Yes, the crashing issue seems to be something else. There's no error message to report, though - Stata just crashes immediately when I put that code in the command line. I've also tried copying and pasting my data into the data editor manually, but Stata glitches and doesn't copy anything.

            Comment


            • #7
              I still find this puzzling, and what occurs to me is that either there is an unfixed bug in Stata 13, or your version of Stata 13 is somehow defective. (If the latter is the case, I would guess that Stata tech support can give you a good copy of the software, so I'd try that angle.) If I were in your situation, I'd try some experiments, trying to import just a few observations and a few variables, e.g.
              Code:
              import delimited "YourFile.csv", varnames(1) rowrange(1:11) colrange(1:3)
              I'd also try making some "practice" csv files and see at what size of input file the problem occurred. ( For what it's worth, I'm running Stata v. 15.1, but I -set max var 2048- and tried importing a subset from a csv file the size of yours (e.g., vars 3001:3005), and didn't have any problem..)

              Code:
              // Make a practice CSV file
              clear
              set obs 4600  //
              forval i = 1/3500 {   // you could work your way up to 2048
                 gen x`i' = runiform()
              }
              export delimited using "c:/temp/test.csv", delim(",") replace
              // Try to import it.
              // I used -set maxvar 2048- here.
              clear
              import delimited using "c:/temp/test.csv", delim(",") replace

              Comment


              • #8
                Thanks for this Mike! I'll experiment with it a bit and see what happens.

                Comment

                Working...
                X