Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Error Message on Exceeding Max Width When Importing .csv

    Hello Statalist,

    This is my first post, and I am looking for some insight on an error message I receive (return code 902) when trying to import a .csv file into Stata using the import delimited command. Any help would be much appreciated.

    I am using Stata/MP 15.1 and am working with a dataset that contains 455 observations and approximately 50,000 variables. The reason the dataset is so large is because we have a very long survey with incoming data collected at multiple levels of observations (e.g. individual level, plot level, household level), and the data is currently in wide format with observations unique at the household level, so any sub-household level variables were automatically reshaped wide.

    When running the import do file a few days ago on an older batch of data, I had no issues (there were approximately 400 observations at the time). I have set both the maximum variable and memory limit. When I import, I do so from a .csv file that comes directly from the programming platform in which it was collected - SurveyCTO. The platform generates the import do file automatically, so I've written minimal code to tinker with it. Attached is that do file - line 209 is where the code is currently breaking.

    Yet, when running it on the most updated version of the dataset, I receive an error message which reads "no room to add more variables because of width. Width refers to the number of bytes required to store a single observation; it is the sum of the widths of the individual variables. The maximum width allowed is 1,048,576 bytes. You just attempted to exceed that."

    Is it possible that although I am under the variable limit of 120,000 I could still be brushing up against some dataset width limit? Is there any way to work around this? Thank you very much and please let me know if sharing any other information would be of help.

    Best,
    Sarina

    Attached Files

  • #2
    Well, in addition to the limit on the number of variables, Stata also has a limit on the total width of an observation. It appears you have exceeded it. I infer from this that the data set has a large number of variables that are being imported as strings, and fairly long ones at that. If all of your variables were numeric, even all double precision (8 bytes each), 50,000 variables would amount to 400,000 bytes per observation, well under the limit.

    To be honest, I cannot imagine how anyone proposes to analyze a data set with 50,000 variables. I suspect that many of them are series of variables that, for purposes of management and analysis in Stata, would be best rearranged into long data layout. Be that as it may, you can't accomplish any of that until you bring in the data.

    Here's what I would do. -import delimited- has a -colrange()- option that will allow you to import the data in chunks, each chunk consisting of a group of consecutive columns in the .csv file. You can then -compress- each of those chunks. Also, take a look carefully at the string variables to see if they can be shortened by the use of the -trim()- and -itrim()- functions or other string manipulations. You might also consider -encode-ing them, espeically those that contain only a small number of distinct but long values. Once you have shrunk each chunk as much as you can, you can then put them together with -merge 1:1 _n-.

    If you can't shrink the chunks enough, you might consider just omitting some of the variables. As I said, it is difficult to imagine that you need to have, or even could possibly use, 50,000 variables, so my guess is that a very large number of them are unnecessary for your purposes.

    Comment


    • #3
      Welcome to Statalist, Sarina.

      First to address Clyde's comments.

      The naive layout of the file provided by SurveyCTO causes an explosion in the number of variables in the CSV. Suppose you have 2500 variables per individual and a maximum of 20 individuals per household. Since the data is organized at the household level, you have the same 2500 variables repeated 20 times, for 50,000 variables.

      The do-file imports the CSV with the varn(nonames) option, so the SurveyCTO "variable names" in the first row of the CSV, which consequently become the first observation of each variable, cause everything to be read in as string variables, and fairly long strings at that, since a comment tells us some of the SurveyCTO variable names are 32 characters or more. The do-file uses the SurveyCTO variable names in the first observation to construct Stata variable names, then drops the first observation and converts string data to numeric as appropriate. But until that point, the dataset is indeed constructed of strings of substantial length.

      Having looked at the 17,000 line do-file, it is clear that it is crucial to the correct construction of a Stata dataset from the SurveyCTO data, so you cannot reliably use the SurveyCTO data without applying the accompanying do-file.

      I see two approaches to resolving the width problem. Both involve reducing the number of variables in the SurveyCTO output, so you're going to have to work with the people responsible for SurveyCTO output.

      One is to have SurveyCTO output the data at the individual level.

      The other, which would be my approach if possible, is to have SurveyCTO separately output the household data, 1 observation per household; the individual data, 1 observation per individual; and so forth; each CSV including the identifiers of all the higher levels within which each observation is nested.

      In either case, existing code that you may have written to use the data as it is now laid out will need to be rewritten. That's a good thing, because the layout currently provided by SurveyCTO is totally unsuitable for analysis in Stata at the individual level.
      Last edited by William Lisowski; 12 Jul 2018, 06:26.

      Comment


      • #4
        Sarina,

        according to SurveyCTO documentation it supports export to both long and wide formats:
        https://www.surveycto.com/support/release-notes/

        Perhaps it would be simpler to go back to the source and re-export the data in the long format.
        Whatever the nature of the problem is, the current layout necessitating such monstrous data and script looks difficult to work with.

        For detecting the nature of the problem it would help if you ran this code with your data file and report what it says:
        Code:
        clear all
        set more off
        
        program ltype
                        version 13
                        local 0 `"using `0'"'
                        syntax using/
                        tempname fh
                        file open `fh' using `"`using'"', read
                        file read `fh' line
                        local m=0
                        while r(eof)==0 {
                                local m1=strlen(`"`line'"')
                                if (`m1'>`m') local m=`m1'
                                file read `fh' line
                        }
                        file close `fh'
                        display "{text}Max length of line: {result:`m'}"
        end
        
        ltype "import_wave_3_household_instrument.do"
        (refer to your data file in the last command). For the example (the code file you've sent) it reports 4,249 which means some lines are long, but nowhere near the limits of Stata. So it should be possible to import it (even though it would not make any sense).

        If the value reported by this code is above the Stata's limit that you saw in the message, then you need to break your file into several blocks and then re-import each block separately (with a guaranteed fun of manually editing the 17000+ lines of code do file).

        Best, Sergiy


        Comment


        • #5
          Hello,

          Thank you to all for you responses. I definitely agree that it would ideal to work with the data disaggregated by the unit level at which it was collected. I don't think this is something SurveyCTO currently offers, but I think after this experience it would be a good idea for me to reach out to them and suggest it. Alternatively, it could have been easier if they had allowed for export by survey module.

          You were correct in that I was triggering the error message because of the fact that I had tried to import variable names as the first row of observations. The total string length of all the names was what was exceeding Stata's maximum width. When I imported with the first row denoted as variable names, there was no issue.

          Because I needed to rename some of the variables before importing, I came up with another way of renaming by looping through each variable, determining whether its length exceeded the maximum character limit, and then updating the name accordingly. I have included the code for this in the attached do file (lines 207-266).

          My only other concern is whether this same error could potentially crop up further down the line as more observations are collected and the dataset bumps up against Stata's memory limit. If that happens, is there a workaround (i.e. something similar to compress that could be done upon importing)? I only care about keeping the dataset wide for the purpose of running aggregate data checks; once we check the data then the next step is to break it into smaller, more manageable sections.

          Best,
          Sarina
          Attached Files

          Comment

          Working...
          X