Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing delimited csv with special characters including double quotes

    I am trying to import a large comma-delimited csv (~1M rows). Variables are bound by double quotes because some of the variables contain commas. Additionally, some of the rows contain the two-character combination ", and occasionally this is the last sequence in the variable before the final binding double quote, so that two adjacent double-quotes appear.

    I am struggling to find a way to get Stata to ignore these " combinations without removing the binding quotes and so also inadvertently interpreting interior commas as delimiters. It may be possible in another language to do a replace-all of " in the raw csvs, but as I am not very familiar with e.g. python, I am hoping there is a complete solution in Stata.

    Here is an example mytest1.csv with six rows, including some of the problem rows:

    Code:
    "Kuzmin, S","UKR","0.0","Rakhmanin, Y","Lugansk","8","2019-07-21","14129876","Kuzmin, S","1741","20","14126630","Rakhmanin, Y","2053","0.00","-2.80","b","0","r"
    "Kuzmin, S","UKR","0.0","Medvedsky, V","Lugansk","9","2019-07-21","14129876","Kuzmin, S","1741","20","14138395","Medvedsky, V","1817","0.00","-8.00","b","0","r"
    "Vysochin, S","UKR","1.0","Drobot, S","\"Cup Independence - 2019 - \"A\" \"Open\"","1","2019-08-23","14103516","Vysochin, S","2493","10","14131129","Drobot, S","2093","1.00","0.80","b","0","r"
    "Piesik, P","POL","1.0","Kaluzny, K","Turniej Szachowy \"Ferie Zimowe 2009' - grupa A - o Puchar Burmistrza Malborka","3","2009-02-03","1136194","Piesik, P","2197","15","1147285","Kaluzny, K","1847","1.00","1.65","w","0","r"
    "Gopal, , K.n.","IND","1.0","Karthik, P","Namuduru,","1","2009-01-27","5001447","Gopal, , K.n.","2204","15","5089719","Karthik, P","1854","1.00","1.65","w","0","r"
    "Blackman, J","BAR","1.0","Wilson, A",\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N
    I have played around with the quote options for import delimited, but with no success. For example, leaving the default options generates a dataset with problems in rows 3 and 4; here are the relevant variables:
    Code:
    import delimited "mytest1",  clear
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str12 v4 strL v5 str34 v6 str36 v7
    "Rakhmanin, Y" "Lugansk"                                                                                                              "8"                                    "2019-07-21"                            
    "Medvedsky, V" "Lugansk"                                                                                                              "9"                                    "2019-07-21"                            
    "Drobot, S"    `"\"Cup Independence - 2019 - \"A\" \"Open\","1","2019-08-23","14103516","Vysochin"'                                   `" S","2493","10","14131129","Drobot"' `" S","2093","1.00","0.80","b","0","r""'
    "Kaluzny, K"   `"Turniej Szachowy \"Ferie Zimowe 2009' - grupa A - o Puchar Burmistrza Malborka","3","2009-02-03","1136194","Piesik"' `" P","2197","15","1147285","Kaluzny"' `" K","1847","1.00","1.65","w","0","r""'
    "Karthik, P"   "Namuduru,"                                                                                                            "1"                                    "2009-01-27"                            
    "Wilson, A"    "\N"                                                                                                                   "\N"                                   "\N"                                    
    end
    Specifying bindquotes(nobind) fixes this issue:
    Code:
    import delimited "mytest1", bindquotes(nobind)  clear
    ...but causes Stata to interpret the extra commas in row 5 as delimiters, generating:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str9 v1 str3 v2 str6 v3 str5 v4 str10 v5 str8 v6 str80 v7
    `""Kuzmin"'   `" S""' `""UKR""'  `""0.0""' `""Rakhmanin"' `" Y""'      `""Lugansk""'                                                                       
    `""Kuzmin"'   `" S""' `""UKR""'  `""0.0""' `""Medvedsky"' `" V""'      `""Lugansk""'                                                                       
    `""Vysochin"' `" S""' `""UKR""'  `""1.0""' `""Drobot"'    `" S""'      `""\"Cup Independence - 2019 - \"A\" \"Open\""'                                     
    `""Piesik"'   `" P""' `""POL""'  `""1.0""' `""Kaluzny"'   `" K""'      `""Turniej Szachowy \"Ferie Zimowe 2009' - grupa A - o Puchar Burmistrza Malborka""'
    `""Gopal"'    " "     `" K.n.""' `""IND""' `""1.0""'      `""Karthik"' `" P""'                                                                             
    `""Blackman"' `" J""' `""BAR""'  `""1.0""' `""Wilson"'    `" A""'      "\N"                                                                                
    end
    I also tried importing without stripping the quotes:
    Code:
    import delimited "mytest1", stripquotes(nobind)  clear
    ...but this didn't help:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str15 v1 str5(v2 v3) str14 v4 strL v5 str34 v6
    `""Kuzmin, S""'     `""UKR""' `""0.0""' `""Rakhmanin, Y""' `""Lugansk""'                                                                                                           `""8""'                               
    `""Kuzmin, S""'     `""UKR""' `""0.0""' `""Medvedsky, V""' `""Lugansk""'                                                                                                           `""9""'                               
    `""Vysochin, S""'   `""UKR""' `""1.0""' `""Drobot, S""'    `""\"Cup Independence - 2019 - \"A\" \"Open\"","1","2019-08-23","14103516","Vysochin"'                                  `" S","2493","10","14131129","Drobot"'
    `""Piesik, P""'     `""POL""' `""1.0""' `""Kaluzny, K""'   `""Turniej Szachowy \"Ferie Zimowe 2009' - grupa A - o Puchar Burmistrza Malborka","3","2009-02-03","1136194","Piesik"' `" P","2197","15","1147285","Kaluzny"'
    `""Gopal, , K.n.""' `""IND""' `""1.0""' `""Karthik, P""'   `""Namuduru,""'                                                                                                         `""1""'                               
    `""Blackman, J""'   `""BAR""' `""1.0""' `""Wilson, A""'    "\N"                                                                                                                    "\N"                                  
    end

  • #2
    How about open the raw file first in a text editor and replace the double double quotes with single ones and then try to import in Stata. Or try to open it in excel and save as xlsx (if the number of rows are not much more than 1M) and import with import excel?

    Comment


    • #3
      Originally posted by Maria Boutchkova View Post
      How about open the raw file first in a text editor and replace the double double quotes with single ones and then try to import in Stata. Or try to open it in excel and save as xlsx (if the number of rows are not much more than 1M) and import with import excel?
      Unfortunately, this is being done at scale, with 130 csv files in total, each with 250K - 1.5M rows.

      Comment


      • #4
        Ah I see, this can be a pain. Sometimes the window interface can help (blasphemy I know), but try to import using File - Import - Text data and play with the options in the window and see if this might work.

        Comment


        • #5
          We are thinking the same! I tried this too, but unfortunately, the window interface for previewing has always immediately crashed Stata on OS X for me, regardless of the version of Stata, the version of OS X or the physical Mac machine. Seems to be an unfortunate problem for us few Mac users. So instead, I manually tried all combinations of bindquote and stripquote, but these didn't work. Perhaps there are other relevant delimited options that I should try?

          Comment


          • #6
            Not sure why an initial attempt at importing using MS Excel can't at least be tried with one file. Could it be possible to go back to the source and export with a different delimiter, such as something that wouldn't appear in the data fields?

            Comment


            • #7
              Yes, as Leonardo suggests, and you could loop in Python from inside Stata (starting with version 16) and re-export all files with a different delimiter. You don't need to be a python whiz - the internet is full of solutions.

              Comment


              • #8
                You don't need to be a Python whiz - Stata can handle this directly. What we need to do, if I understand the data correctly, is replace the sequence
                Code:
                \"
                with something that does not appear elsewhere in the data. That could be as simple as the exclamation point I use in the example below, or a longer string like DOUBLE-QUOTE. The file nonstandard.csv contains the example data from post #1.
                Code:
                filefilter nonstandard.csv fixed.csv, from(\BS\Q) to(!) replace
                import delimited fixed.csv
                list v5, clean
                replace v5 = ustrregexra(v5,`"!"',`"""')
                list v5, clean
                Code:
                . filefilter nonstandard.csv fixed.csv, from(\BS\Q) to(!) replace
                (file fixed.csv was replaced)
                
                . import delimited fixed.csv
                (encoding automatically selected: ISO-8859-9)
                (19 vars, 6 obs)
                
                . list v5, clean
                
                                                                                                  v5  
                  1.                                                                         Lugansk  
                  2.                                                                         Lugansk  
                  3.                                           !Cup Independence - 2019 - !A! !Open!  
                  4.   Turniej Szachowy !Ferie Zimowe 2009' - grupa A - o Puchar Burmistrza Malborka  
                  5.                                                                       Namuduru,  
                  6.                                                                              \N  
                
                . replace v5 = ustrregexra(v5,`"!"',`"""')
                (2 real changes made)
                
                . list v5, clean
                
                                                                                                  v5  
                  1.                                                                         Lugansk  
                  2.                                                                         Lugansk  
                  3.                                           "Cup Independence - 2019 - "A" "Open"  
                  4.   Turniej Szachowy "Ferie Zimowe 2009' - grupa A - o Puchar Burmistrza Malborka  
                  5.                                                                       Namuduru,  
                  6.                                                                              \N
                Last edited by William Lisowski; 11 Aug 2021, 07:59.

                Comment


                • #9
                  Let me add a piece of hard-earned experience.

                  The data in post #1 is an example of why CSV format should be avoided for data with even moderately complex character strings. The doublequotes surround each field to shield commas within fields. But the fields also have doublequotes, so those have to be preceded with a backslash to distinguish them from the doublequotes that surround fields, and that is not an understood standard.

                  If you have a choice, select tab-delimited output. Your data is extraordinarily unlikely to contain tab characters, so commas and quotes have no special meaning, and the fields are unambiguously separated by the tab characters. If you want a file where there are no "invisible" characters like tabs, then choose something like the "pipe" character "|" as the delimiter - that's often a good bet to not appear in your data - and then be sure that the process exporting the data doesn't snatch defeat out of the jaws of victory by choosing to apply quotes and the like.

                  I know you may not have a choice when you receive data, but when you do, or when you are preparing data for others, it can be helpful to keep this in mind.

                  Comment


                  • #10
                    Another addition in response to post #5 above, regarding the import delimited dialog box on Stata.

                    As I suggested in one of the topics linked in that post, I am a macOS user too, and I was able to trace the problem on my system to a window-switching app I run on my Mac. Before developing and testing my solution in post #8, I quit the app, and I had no problem with the import delimited dialog while I was developing post #8. I don't name the app here, because I have tried more than one such app, and they all suffer the problem.

                    See https://www.statalist.org/forums/for...ting-csv-files for more details on the problem in Stata for Mac.

                    Comment


                    • #11
                      Many thanks William & others.

                      Comment

                      Working...
                      X