Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Import delimited - not all observations imported

    I have a .txt file with 18 variables and 1,368,958 observations/lines with piping as a delimiter (|,|). Using

    import delimited "file.txt", delimiter("|,", asstring) bindquote(strict) maxquotedrows(unlimited)

    I can only load 648,496 observations - size 901.11M, memory 993M. I double-checked the .txt file and there is nothing unusual at line 648,496.

    Any advice?

    Thank you!

  • #2
    What happens if you import the second part using the additional option -rowrange(648497)- ? I'm not aware that there should be any limits around that size, but this lets you see if the rest of the file can be imported. Worst case is that you will import in two parts and combine after.

    Comment


    • #3
      Are you certain you don't have an unmatched quote somewhere? Check the last row of data after importing into Stata (-list in `=_N'-). One of the variables may contain the remaining ~700000 observations within itself as a strL due to an unmatched quote, since you specified strict in the bindquote option and unlimited in the maxquotedrows option.

      Comment


      • #4
        Using

        import delimited "file.txt", delimiter("|,", asstring) bindquote(strict) maxquotedrows(unlimited) rowrange(648497)

        drops it to 351,005 observations

        Comment


        • #5
          Originally posted by Ali Atia View Post
          Are you certain you don't have an unmatched quote somewhere? Check the last row of data after importing into Stata (-list in `=_N'-). One of the variables may contain the remaining ~700000 observations within itself as a strL due to an unmatched quote, since you specified strict in the bindquote option and unlimited in the maxquotedrows option.
          My last two observations in the Data Viewer (648,495 and 648,496) are the last two observations in my .txt file (1,368,957 and 1,368,958)

          The first observation in each is identical.
          Last edited by Tom Groll; 20 Jul 2021, 21:01.

          Comment


          • #6
            What I guess has happened is that there is a stray unmatched quote somewhere in between the first and last observation, and a stray extra quote on some line after that which unintentionally matches the unmatched quote. See this example textfile:

            Code:
            var1|,var2|,var3
            1|,2|,"a"
            1|,2|,"b"
            1|,2|,c"
            1|,2|,"c"
            1|,2|,"c"
            1|,2|,"c"
            1|,2|,"c"
            1|,2|,"c""
            1|,2|,"c"
            1|,2|,"d"
            Note rows 3 and 9. When imported using your code, this is the result:

            Code:
            . list
            
                 +---------------------------------------------------------------------+
                 | var1   var2                                                    var3 |
                 |---------------------------------------------------------------------|
              1. |    1      2                                                       a |
              2. |    1      2                                                       b |
              3. |    1      2   c" 1|,2|,"c" 1|,2|,"c" 1|,2|,"c" 1|,2|,"c" 1|,2|,"c"" |
              4. |    1      2                                                       c |
              5. |    1      2                                                       d |
                 +---------------------------------------------------------------------+
            One approach to solve this is to search, inside the imported Stata dataset, for a disproportionately large single observation. You can use a function like strlen() to generate a string length variable, and sort by it to identify long strings.

            An alternative starting point is to try decreasing the amount of rows in the maxquotedrows option. That should help confirm whether my diagnosis of the issue is correct.
            Last edited by Ali Atia; 20 Jul 2021, 21:18.

            Comment


            • #7
              Got it - there were various " in the .txt file! After replacing them, I got all data.

              Thank you!

              Comment

              Working...
              X