Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issues with Stata export to CSV

    Hi all,

    I'm having issues with exporting from Stata to a CSV file. Values from one column are showing up in another, cells that shouldn't be empty are showing up empty. I've never seen this kind of issue but suspect it's because I have lots of long strings in my data, including strings that have spaces in between different words. How do I do this export without these issues? I can send a snippet of my data if required but a lot of the long strings are contained in PII variables, so I wanted to see if I could solve the issue without doing that. I also apologize if there are threads on this already - I mostly saw threads on errors importing data, not exporting.

    Thank you so much!

  • #2
    Have you used the quote option when using export delimited? Also, have you checked to see if there are any double quotes in the string data you currently have? If so, you may want to replace them with single quotes first.

    Comment


    • #3
      It would be helpful if you told us exactly the command you use to export your data so we don't spend time explaining something that it turns out you are already aware of. Also, exactly what you are doing to read the CSV file that displays the problems you describe?

      Let me add that if you are using Excel to read the CSV file, you might be better off creating an Excel workbook directly using export Excel.

      Comment


      • #4
        Hi all, thanks for your reply! I used export delimited initially without quote and without specifying any specific delimiting character (which I think makes comma the default delimiter), and then I did it with the option quote upon Daniel Shin 's suggestion. That definitely seemed to fix some issues, but not all. I then dropped some variables that I thought might be problematic because they contain things like addresses with many different kinds of characters in it, including commas. This fixed the problem for the most part, but not all the way through. Still not sure why for some observations values from different cells are getting smushed together into one cell, and don't know if there's a more rigorous and reproducible approach to this problem than manually dropping potentially problematic variables.

        To answer William Lisowski's questions:
        (1) I used Python to load in the CSV file. Also tried re-loading it into Stata and could see all the issues I mentioned in both.
        (2) I am converting from DTA straight into CSV. I tried convert into Excel files, but the datasets are too heavy and softwares like Python are struggling to use them - hence looking to convert into CSV only.

        Thank you so much again!
        Last edited by Mansi Jain; 06 May 2020, 04:10.

        Comment


        • #5
          What is your method of checking the exported csv file? Are you reimporting it into Stata or using Python to import to check? With the quotes option and replacing double quotes with single quotes in the source strings, i.e. substr(var,char(34),char(39),.), the exported CSV file should not have any issues when imported with comma as delimiter and double quotes as string binding parameter, as commas in the strings will not be used to parse. It will be helpful to see an example observation (PII removed of course) to identify the issue.

          Comment


          • #6
            Please post the result from running:
            Code:
            update q
            about
            which export
            and a ​​​test:
            Code:
            clear
            set obs 1
            gen v1 = "A"
            gen v2 = char(34) + 10 * ","
            gen v3 = "z" + char(34)
            
            tempfile commas
            export delimited using "`commas'"
            datasignature
            local ds_org  `r(datasignature)'
            type "`commas'"
            import delimited using "`commas'" , clear varnames(1)
            datasignature
            assert "`r(datasignature)'" == "`ds_org'"
            in addition to report the exact code used to export to csv.

            use -chartab- (ssc install chartab) to list characters in variables you suspect is problematic.
            Last edited by Bjarte Aagnes; 06 May 2020, 07:41.

            Comment


            • #7
              adding to #6: invisible control characters may be problematic, adding a example stripping off control characters.
              Code:
              ********************************************************************************
              version 14.2
              
              clear all
              set obs 1
              
              #delim ;
              gen strL v1 = 2 * char(34) + 3 * "," 
               + ustrunescape("\u000A") /* LF invisible control character */ 
               + ustrunescape("\u2029") /* add some spaces ... */
               + ustrunescape("\u0020")  + ustrunescape("\u180E")   
               + ustrunescape("\u2008")  + ustrunescape("\u3000")      
               + 3 * char(34) + ","
              ;
              #delim cr
              compress
              
              gen v2 =  v1 + " " + v1 
              gen v3 = ","
              gen strL v4 = 10^5 * v2
              expand 5
              
              chartab v4 /* ssc install chartab */
              
              foreach v of varlist * {
                  
               capture confirm string var `v'
               
               if ( _rc == 0 ) {
                   
                   * strip of Control characters
                   replace `v' = ustrregexra(`v', "\p{C}", "", . ) 
               }
               
              }
              
              compress 
              datasignature
              local ds_org  `r(datasignature)'
               
              tempfile commas
              export delimited using "`commas'"
              import delimited using "`commas'" , clear varnames(1) encoding(UTF-8) 
              compress
              datasignature
              assert "`r(datasignature)'" == "`ds_org'" 
              chartab v4
              
              ********************************************************************************
              version 16.1
              
              tempfile dta
               
              python :
              
              import pandas as pd
              
              df = pd.read_csv(r'`commas'' , sep=',' , encoding='utf-8')  
              
              df.to_stata(r'`dta'' , version=118 , write_index=0) 
              
              end 
              
              use "`dta'" , clear
              compress
              datasignature
              assert "`r(datasignature)'" == "`ds_org'" 
              chartab v4
              ********************************************************************************

              Comment

              Working...
              X