Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Export delimited CSV / Empty rows [BUG?]

    Dear STATALIST,
    I have a problem which I can't find reason for and I couldn't find any reference to that in previous discussions.
    I have a STATA 14.1 SE version


    I have a big dataset of 593,000 rows and 18 cols.
    If I simply save the dataset it has no empty rows. [save "temp.dta"]
    If I export in csv format I have the issue described below. [export delimited "temp.csv"]

    From 351,646 to 366,451 I will have some entry behaving like in the attached table.

    Is it possible that this is a STATA bug in the export function?

    Regards,
    Valerio Giuffrida
    Click image for larger version

Name:	Capture.JPG
Views:	1
Size:	153.2 KB
ID:	1328411

    Last edited by Valerio Giuffrida; 26 Feb 2016, 07:29.

  • #2
    No I don't think it's a bug.
    (Without seeing your file) My best guess is that you have ascii codes in the string variables in these rows that create line breaks.

    These line break ascii codes are hidden characters. There are two common types: carriage returns which are ascii code 13 (commonly \r) and linefeeds which are ascii value 10 (commonly \n). [The combination of these two is a CRLF and marked as \r\n]

    You want to remove these from within your strings. To do so, you can run something like:
    Code:
    **remove the problematic ascii codes from all string vars
    ds, has(type string)
    foreach x in `r(varlist)' {
        replace `x' = subinstr(`x', `"`=char(10)'"', "", .)
        }

    or you can explicitly list the variables you think are problematic. In order to find these problematic ascii codes, you can use the SSC package -charlist- . Additionally, you can help diagnose other issues with your file by using -hexdump- (useful if my suspicions are wrong and the issue is not related to the ascii codes 10 or 13).

    Here's some example do-file code where I (1) recreate your issue in the system dataset "auto.dta" that is (2) exported to Excel/csv. (3) Then I use the code above to remove the problem ascii codes .


    Code:
    sysuse auto, clear
    export delimited using temp1.csv, replace
    replace make = make + `" `=char(10)'"' if rep78==3
    l make rep78 in 25/50 //examine the problem
    export delimited using temp2.csv, replace
    
    
    **examine
    hexdump temp2.csv , analyze res
    cap which charlist //from SSC
    if _rc ssc install charlist, replace
    charlist make
    local x `"`r(ascii)'"'
    local y =  `=regexm(`"`x'"', "(10|13)")'
    local y =  `=regexs(1)'   
    di `"`y' code exists in your data"'  //shows the problematic ascii codes in 'y'
    
    
    **remove the problematic ascii codes from all string vars
    ds, has(type string)
    foreach x in `r(varlist)' {
        replace `x' = subinstr(`x', `"`=char(10)'"', "", .)
        }
    l make rep78 in 25/50 //FIXED
    Last edited by eric_a_booth; 26 Feb 2016, 09:00.
    Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

    Comment


    • #3
      Thank you Eric for your quick reply,
      I will run this code on my files soon and I'll let you know if that is the case.
      I'm not sharing the complete file because of its dimension, but I can imagine the possibility you pointed out is realistic. If it's not the case I will share a portion of the database.

      Regards,
      Valerio

      Comment


      • #4
        Thank you very much for this discussion. Thank you eric. Your following code worked in my case for exporting stata data to csv files:
        ds, has(type string)
        foreach x in `r(varlist)' {
        replace `x' = subinstr(`x', `"`=char(10)'"', "", .)
        }
        Even if there is nothing to replace, do we lose anything if we type the above code anytime before exporting the stata data as a csv file? I actually have to deal with large datasets and will not like to check everytime whether the exported file is proper.

        Thank You!

        Comment

        Working...
        X