Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata gives “too many variables specified” error when importing csv file

    I have datasets in csv format containing more than 1 million lines. For example, one of my datasets contains 1950929 lines plus the header line, and when I tried to import this file into Stata using the command: insheet using "...", Stata gives me the following error message "too many variables specified, error in line 1950930 of file, r(103)". Any idea what could be causing the problem?

    Many thanks!

  • #2
    If you're using Stata 13 you should be using the command import delimited rather than insheet. I don't think I would necessarily expect that to fix the problem, though.

    Stata appears to be saying that there's an error at the end of the file. I'd look carefully at the last line of the file. Maybe there's something weird there that Stata is trying to import as separate variables causing it to go over the limit.

    Comment


    • #3
      First, you haven't shown us the exact command you gave Stata: "..." may hide a problem that we can only wonder about.

      Second, the error message seems a bit strange, since "too many variables specified", which is error message 103, would refer to your command syntax and would not refer to any particular line in the data file. It is possible, however, that the "two many variables specified" part was issued by some command that was called (directly or indirectly) by -insheet- and that the problem arose while processing line 1950930.

      Third, -insheet- is no longer a current command in Stata; it has been superseded by -import delimited-. See if you can get your file properly read in using the latter.

      If you get the same error message with -import delimited-, since the line number being reported is that of the last line of the file, you should inspect the end of the file in a text editor or some other application to see if there is something wrong with it. For example, there might be a string variable that begins with quotes but has no close quote, so you have a huge dangling string at the end of the file. If you can't get it to work with -import delimited- and you can't see anything wrong with the contents of the file, try posting again. In doing so:

      1. Copy and paste a specimen of the data into your post, particularly some data from the end of the file.
      2. Copy and paste (do not retype) the exact command you gave and the exact response you got from Stata.
      3. Do the above in code blocks. (Click on the underlined A button. Then click on the # button. Two code-block delimters will appear. Paste your information between those.

      Comment


      • #4
        Thanks for both of your replies.

        I just put the location of the file in the double quotation marks so nothing fancy there. The problem does not seem to lie in the last line of the file - I tried to redownload the data in .txt format with tab delimiters, only this time I could actually import the data into Stata now, but I am missing a significant chunk of the data (for exmple for the dataset with 1950929 observations, only 647866 are imported). Also with the .csv format files, some of the observations have commas in their values for the string variable Product Name, but I guess this would not be an actual problem because the values are put within quotation marks so that Stata can still distinguish bewteen the delimiter commas and the value commas. I am pasting a segment of the start of the same data file in both .csv and .txt format below, hopefully this will help you get a better idea of what I am trying to cope with.

        Thanks!


        .csv format:

        "Selected Nomen","Native Nomen","Reporter","Reporter Name","Product","Product Name","Partner","Partner Name","Tariff Year","Trade Year","Trade Source","DutyType","Simple Average","Weighted Average","Standard Deviation","Minimum Rate","Maximum Rate","Nbr of Total Lines","Nbr of DomesticPeaks","Nbr of InternationalPeaks","Imports Value in 1000 USD","Binding Coverage"
        "HS","H0","356","India","010111","(-2001) - - Pure-bred breeding animals","000"," World","1990","1990","CMT","MFN","55.00","",".00", "55.00","55.00","1","0","1","",""
        "HS","H0","356","India","010111","(-2001) - - Pure-bred breeding animals","000"," World","1992","1992","CMT","MFN","60.00","",".00", "60.00","60.00","1","0","1","",""
        "HS","H0","356","India","010119","(-2001) - - Other","000"," World","1990","1990","CMT","MFN","55.00","",".00", "55.00","55.00","1","0","1","",""
        "HS","H0","356","India","010119","(-2001) - - Other","000"," World","1992","1992","CMT","MFN","60.00","",".00", "60.00","60.00","1","0","1","",""


        .txt format:

        "Selected Nomen" "Native Nomen" "Reporter" "Reporter Name" "Product" "Product Name" "Partner" "Partner Name" "Tariff Year" "Trade Year" "Trade Source" "DutyType" "Simple Average" "Weighted Average" "Standard Deviation" "Minimum Rate" "Maximum Rate" "Nbr of Total Lines" "Nbr of DomesticPeaks" "Nbr of InternationalPeaks" "Imports Value in 1000 USD" "Binding Coverage"
        "HS" "H0" "356" "India" "010111" "(-2001) - - Pure-bred breeding animals" "000" " World" "1990" "1990" "CMT" "MFN" "55.00" "" ".00" "55.00" "55.00" "1" "0" "1" "" ""
        "HS" "H0" "356" "India" "010111" "(-2001) - - Pure-bred breeding animals" "000" " World" "1992" "1992" "CMT" "MFN" "60.00" "" ".00" "60.00" "60.00" "1" "0" "1" "" ""
        "HS" "H0" "356" "India" "010119" "(-2001) - - Other" "000" " World" "1990" "1990" "CMT" "MFN" "55.00" "" ".00" "55.00" "55.00" "1" "0" "1" "" ""
        "HS" "H0" "356" "India" "010119" "(-2001) - - Other" "000" " World" "1992" "1992" "CMT" "MFN" "60.00" "" ".00" "60.00" "60.00" "1" "0" "1" "" ""
        Last edited by Xing Lan; 11 Oct 2014, 12:10.

        Comment


        • #5
          These are vexing types of problems. There is nothing obviously wrong with the excerpts of the data you have provided. And the issue is not total memory available if the remaining records look like this one. The newest clue is that when you imported the tab-delimited version, only 647866 records came in. That would suggest that there might be something anomalous in the 647866th or 647867th record that caused the importation to terminate. If you can locate that part of the text file and see what is there, you may get a lead as to what to do next.

          Another thought is to write a loop where you run the text file repeatedly through -filefilter- to remove any weird ASCII codes, and see if it loads properly after that.

          If none of this helps, you might need to take this up with Stata technical support.

          Comment


          • #6
            Hi Clyde, thank you so very much for the prompt feedback! I now think that the problem is not just with part of the data file, it somehow is very prevalent across the file to cause Stata to import incorrectly. The reason behind this is that I have a variable for years in the data, and I do know for sure that I should have the years 1988-2014, but after I tab the year variable in Stata, this is what shows up:


            Tariff Year | Freq. Percent Cum.
            ------------+-----------------------------------
            1990 | 65,203 10.06 10.06
            1992 | 66,907 10.33 20.39
            1997 | 154,708 23.88 44.27
            1999 | 164,775 25.43 69.70
            2001 | 196,205 30.28 99.99
            2004 | 31 0.00 99.99
            2005 | 36 0.01 100.00
            ------------+-----------------------------------
            Total | 647,865 100.00

            and it is clear that most of the years that I should have are missing in the imported data.

            Any idea why this could happen?

            Thanks!

            Comment


            • #7
              Before -insheet- or -import delimited- you *might* try modifying the header line so that it has legitimate Stata variable names.

              Comment


              • #8
                Hi Ben,

                How would you suggest I should modify? Is there any specific code that you would advise using?

                Thanks!

                Comment


                • #9
                  Oh, I was just figuring opening it in a text editor and doing it manually. There are ways Stata can manipulate raw text files without truly importing it and making a dataset, and if you had 100 files or 1000 variables, worth looking into. But for a single relatively small (width-wise) dataset, manually editing isn't bad, and sometimes despite delimiters/quotation marks, I've seen it do weird things when you have spaces and such in the variable names. A couple of things to be careful of is that your text editor doesn't make quotation marks into "smart quotes" (don't use Word) and that it respects the width of the lines. UltraEdit and Notepad+ are a couple of pure-text-editors that come to mind..

                  Comment

                  Working...
                  X