Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unable to load csv file into Database - problem with Stata or the data?

    Hi.
    I have been given x3 csv files to analyse.

    One of which opens without issues (628,178 KB)
    Code used in stata:
    Import delimited “filename1”

    I am now trying to load the second file (2,537,900kb) in Stata using the same code
    - It keeps loading forever until it freezes and I have to force quit (see picture highlighted in yellow)

    I tried loading the file using excel (as I thought it may be as data is all in one row):
    Excel > Browse > All Files > SelecttheFile > Open

    At the bottom right : It begins to download and when the loading bar completes and Excel says 'Ready' bottom left it fails to load any data at all.

    What's wrong?
    Is there a problem with the data?

    System:
    Stata SE15

    Thanks

    Martin Borg - don't worry stata are updating my name.
    Attached Files
    Last edited by Martin Imelda Borg; 10 Feb 2022, 09:10.

  • #2
    When two robust programs that can read csv files choke, yes, there's almost certainly a problem with the data. Two reasonable guesses would be that there are problems with the end of line markers, or the length of line. I'd suggest you start a diagnosis with Stata's -hexdump- command, as follows, and post the output here if you don't know how to interpret or use it:
    Code:
     hexdump "YourBadFile.csv", analyze results

    Comment


    • #3
      Dear Dr Lacy,

      1. I've run the hexdump code - picture attached


      2. To note, the data is stored on a remote desktop of the organisation and I am given remote access to analyse the data

      3. I have contacted them to check if the data is corrupt or not, as I can load the 628,178 KB file but not the 2,537,900kb file
      Attached Files

      Comment


      • #4
        Update -

        I tried to open the data rather than using code by using:

        File > Import > Import delimited text data > Open:Selected the File >

        In the preview section I can actually see the data sorted in appropriate columns and row
        However, when I click Ok - the 09.53AM (first post takes place)

        Keeps loading until it freezes.

        Comment


        • #5
          You will notice that the error message tells you that in the process of importing it had already occupied 8GB of memory and that exhausted what the computer would supply.

          Based on the output of hexdump, I am going to guess that, because it is much longer than the next five rows, the first row has column headers/variable names/descriptions. You should either (a) start your import with the second row or (b) tell import to use your first row for variable names. It seems likely to me that import chose to import all the columns into string variables whose length was determined by the longer-than-average contents of the first row, and that lead to running out of memory.

          Perhaps you did that already, but you chose to use a yellow rectangle to obscure the actual command you used, or was generated by Stata. So you are asking us for help but choosing to obscure information that might help us. Do you see the asymmetry there?


          Comment


          • #6
            (Crossed with William Lisowski and partly duplicating him.)

            Your experience with the preview aside, and relying on what -hexdump- revealed, I have a few thoughts, but I hope some other people will join in, too, particularly regarding the encoding issues, which I mention below and about which I'm ignorant:

            1) The report that there are Windows line-end characters and nothing else is encouraging, as it indicates that this does appear to be a text file.

            2) Your long first line (6,996) characters suggests that there are variable names on the first line. (You could verify this by going to the Windows command line and using -type filename.csv" on this file. I suspect the file is too large to fit into most word or text processing programs, but that, too, would be worth a try as a way to inspect it.) Presuming variable names are there, simply putting the -varnames(1)- option on your -import- command might help.

            3) You have many tabs as well as spaces listed among your separator characters, and few commas. On that evidence, I'd guess that this is a tab-delimited file, with commas and spaces merely appearing within text variables in the file. Using the -delimiter(tab)- option would be a good thing to try.

            4) Importing only part of each data line as an experiment would be worthwhile, just to see if that works, so you might want to try the -colrange- option, with perhaps columns 1:10 specified.

            5) It might be that part of the file is ok, but there's a bad line somewhere. On that hypothesis, specifying the -rowrange- option would be wort while. In fact, I might try this first, and I'd iterate through, first trying (for example), the first 10 rows, then the first 1000, etc. You could even try importing the first half of the rows, andif there's a problem, then try the first one-fourth, etc., and by doing so, you might discover where the problem is.

            6) Your file has a lot of extended characters, opening the possibility that you may need to specify the encoding as something other than the default Latin1 encoding. I have almost no knowledge about encodings, but others do. In this regard, can you tell us anything about the file, i.e, should it contain a lot of text material, and if so, what language/alphabet might have been used? It's also possible that extended characters are present because the file is indeed damaged or badly prepared in some way, so knowing what to expect would help.

            You can get more documentation on what I said here by checking out -help import delimited- in Stata.


            So, try these things and report back. *Before you do report back,* though, please re-read the StataList FAQ (tab at the top of the screen). A point of particular interest in its advice is the suggestion not to use screenshots. Your screenshot is better than most, but they're generally quite disliked here, as they're not easy to read. When it's relevant to include output, just paste it as text and enclose it within StataList's [CODE] delimiters (#) to make sure a nonproportional and readable font is used to display it on StataList. and
            Last edited by Mike Lacy; 10 Feb 2022, 12:45.

            Comment


            • #7
              Hi again.

              So I tried

              1. import delimited G:\mydata\mydata.csv, varnames(1)

              This came with the following error message:
              op sys. refused to provide memory

              Stata's data storage memory manager has already allocated 8960m bytes and it just attempted to allocate another 32 m mbytes. The operating system said no. Perhaps you are running another memory consuming task and the command will work later when the task completes. Perhaps you are on a multiuser system that is especially and the command will work later when activity quiets down. Perhaps a system administrator has put a limit on what you can allocate. An error occurred while writing data.

              ________

              2. I then tried importing with File option (see above)

              File > Import > Import delimited text data > Open:Selected the File >

              Realising that the data within the columns is full of string variables (see picture)

              _______

              I suspect as they are all string variables hence the reason why Stata can't use its memory, is there a solution for this?
              Attached Files

              Comment


              • #8
                Update:

                I have now tried importing parts of the data, which worked

                Code:
                import delimited G:\mydata.csv, rowrange(1:11) colrange (1:8) varnames (1)

                This worked however, the dataset is massive, am I going to import each section one at a time (as Dr Lacy mentioned in no 4, 5)

                I have tried importing up to rowrange (1:20000) but I am not sure how many rows there are in the dataset . As mentioned I can not import into excel.

                Issues:

                1. At some point I will reach Stata's capacity and the system will crash
                2. I will keep increasing my rowranges, but I don't know what row no the data finishes at and I could miss out on data.

                There must be an alternative option to this?
                Last edited by Martin Imelda Borg; 14 Feb 2022, 08:45.

                Comment


                • #9
                  It seems your dataset uses the string "NULL" to denote missing values. And if the variable is otherwise numeric, it is possible that by being forced to be stored as string it is taking more space than is needed.

                  I would consider trying to use the filefilter command to translate "NULL" into "" in your dataset. That will then be considered missing and if the remaining values are numeric, the variable will be imported as numeric rather than string.
                  Code:
                  filefilter G:\mydata\mydata.csv G:\mydata\mydata_new.csv, from(NULL) to ("")
                  Here is an example that shows a str4 (4 bytes) string variable reduced to a single byte numeric variable.
                  Code:
                  . type example.csv, lines(3)
                  v1,v2,v3
                  1,NULL,3
                  1,2,3
                  
                  . import delimited example.csv, varnames(1) clear
                  (encoding automatically selected: ISO-8859-2)
                  (3 vars, 2 obs)
                  
                  . describe *
                  
                  Variable      Storage   Display    Value
                      name         type    format    label      Variable label
                  ------------------------------------------------------------------------------------------------
                  v1              byte    %8.0g                 
                  v2              str4    %9s                   
                  v3              byte    %8.0g                 
                  
                  . 
                  . filefilter example.csv example_new.csv, from(NULL) to ("")
                  
                  . type example_new.csv, lines(3)
                  v1,v2,v3
                  1,,3
                  1,2,3
                  
                  . import delimited example_new.csv, varnames(1) clear
                  (encoding automatically selected: ISO-8859-2)
                  (3 vars, 2 obs)
                  
                  . describe *
                  
                  Variable      Storage   Display    Value
                      name         type    format    label      Variable label
                  ------------------------------------------------------------------------------------------------
                  v1              byte    %8.0g                 
                  v2              byte    %8.0g                 
                  v3              byte    %8.0g                 
                  
                  .

                  Comment


                  • #10
                    Good idea from William Lisowski, which appeared just as I was about to post. Here are some additional thoughts that may also be useful.

                    (What I'll suggest may not work because I don't know the results of all the mini-experiments I suggested above.)

                    The short story is that you might try to import the file in pieces and append them.

                    The -hexdump- says there are 1,233,349 lines ("rows"), so that's not a mystery. I'd guess that you will be able to import whole lines (i.e., all columns) for a reasonable number of rows. (Don't know if you tested this.) If that works, I would try a loop to import the file in pieces and save each one to a separate file. Presuming that goes OK, and assuming that you don't need *all* the variables in each line of the file, you should be able to read in each file, keep just the variables of interest, and -append- it to make a complete file of reasonable size. ("Reasonable" depends on the memory on the machine on which you're working. If you do need all the variables, there might not be a good solution.)

                    Code:
                    cd "YourFolder"
                    local nrow 10000 // not too big, I hope
                    local maxlines 1235000 // somewhat bigger than what -hexdump- says
                    // Import from the big file, nrows at a time, and save each one
                    forval start = 1(`nrow')`maxlines' {
                      local stop = `start' + `nrow' - 1
                      // Assume tab delimited; could be wrong.
                      import delimited using " G:\mydata.csv", ///  
                         clear varnames(1) delimiter(tab) rowrange(`start':`stop') //
                      save "part`start'.dta"
                    }
                    //  Inspect, say, part1, part100, ....
                    //
                    // Later, when you know which variables are of interest.
                    clear
                    local nrow 10000
                    local maxlines 1235000
                    save "allfiles.dta", emptyok  // blank file to accumulate the parts
                    forval start = 1(`nrow')`maxlines' {
                       use "part`start'.dta", clear
                       keep var1 var2 var3 .... or drop var1 var2 var3
                       append using "allfiles.dta"
                       save "allfiles.dta", replace
                    }
                    All this being said, there might still be various anomalies in the big file, based on some things revealed by -hexdump-, such as character encoding issues.

                    Comment


                    • #11
                      Dr Lacy, re your experiments (Post dated 10 Feb) , yes I have tried each one

                      No 2 - I have already done - system crashes

                      No 4, no 5 - works —> as I said I have tried to import up to 1,000,000 which worked and hasn’t crashed yet.

                      No 6 - Normal English language.

                      ___

                      I have now tried Dr Lisowski's advice, however the system crashed as soon as I attempted to import (Step 2)

                      filefilter G:\mydata\mydata.csv G:\mydata\mydata_new.csv, from(NULL) to ("")
                      import delimited example_new.csv, varnames(1) clear

                      I will now try using Dr Lacy's method, importing using the code below but instead trying to reach 1,233,349 to see if this works.


                      import delimited G:\mydata.csv, rowrange(1:11) colrange (1:8) varnames (1)

                      Comment


                      • #12
                        Let me be sure here: Am I understanding correctly that you *can* get -import delimited- to import *all* the columns for at least *some* row range? If so, I'd suggest trying the code I suggested that at #10, which has *no* column restrictions. My loop avoids the problem of the file simply being too big to store in memory, and is a good way to reveal if there is a problem that happens at some particular line (row).


                        The fact that you say the file should have only "normal English Language" (i.e., "ordinary" ASCII characters) conflicts with what -hexdump- reports, so the file might well be bad. The fact that according to -hexdump- at least one line has only 1 character also suggests that something is wrong with the file. I suspect that eventually you're going to need to contact whomever supplied the file about possible problems. Or, maybe there was a download problem? - Mike



                        Comment


                        • #13
                          For anyone who refers to this thread in the future, I got round to this by asking the remote platform to increase the MB to import the data. The good thing is that with this thread I managed to find out with hexdump that I didn't have a problem with my data and the exact number of records in the data.

                          Comment


                          • #14
                            Originally posted by Mike Lacy View Post
                            When two robust programs that can read csv files choke, yes, there's almost certainly a problem with the data. Two reasonable guesses would be that there are problems with the end of line markers, or the length of line. I'd suggest you start a diagnosis with Stata's -hexdump- command, as follows, and post the output here if you don't know how to interpret or use it:
                            Code:
                             hexdump "YourBadFile.csv", analyze results
                            I just discovered hexdump here, and it just blows my mind! Thank you!

                            Comment

                            Working...
                            X