Issues with Stata export to CSV

Mansi Jain

Join Date: Apr 2019

Posts: 44
#1

Issues with Stata export to CSV

05 May 2020, 12:25

Hi all,

I'm having issues with exporting from Stata to a CSV file. Values from one column are showing up in another, cells that shouldn't be empty are showing up empty. I've never seen this kind of issue but suspect it's because I have lots of long strings in my data, including strings that have spaces in between different words. How do I do this export without these issues? I can send a snippet of my data if required but a lot of the long strings are contained in PII variables, so I wanted to see if I could solve the issue without doing that. I also apologize if there are threads on this already - I mostly saw threads on errors importing data, not exporting.

Thank you so much!
Tags: None
Daniel Shin

Join Date: Mar 2020

Posts: 146
#2

05 May 2020, 13:00

Have you used the quote option when using export delimited? Also, have you checked to see if there are any double quotes in the string data you currently have? If so, you may want to replace them with single quotes first.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

05 May 2020, 13:02

It would be helpful if you told us exactly the command you use to export your data so we don't spend time explaining something that it turns out you are already aware of. Also, exactly what you are doing to read the CSV file that displays the problems you describe?

Let me add that if you are using Excel to read the CSV file, you might be better off creating an Excel workbook directly using export Excel.
1 like
Comment
Mansi Jain

Join Date: Apr 2019

Posts: 44
#4

06 May 2020, 04:01

Hi all, thanks for your reply! I used export delimited initially without quote and without specifying any specific delimiting character (which I think makes comma the default delimiter), and then I did it with the option quote upon Daniel Shin 's suggestion. That definitely seemed to fix some issues, but not all. I then dropped some variables that I thought might be problematic because they contain things like addresses with many different kinds of characters in it, including commas. This fixed the problem for the most part, but not all the way through. Still not sure why for some observations values from different cells are getting smushed together into one cell, and don't know if there's a more rigorous and reproducible approach to this problem than manually dropping potentially problematic variables.

To answer William Lisowski's questions:
(1) I used Python to load in the CSV file. Also tried re-loading it into Stata and could see all the issues I mentioned in both.
(2) I am converting from DTA straight into CSV. I tried convert into Excel files, but the datasets are too heavy and softwares like Python are struggling to use them - hence looking to convert into CSV only.

Thank you so much again!

Last edited by Mansi Jain; 06 May 2020, 04:10.
Comment
Daniel Shin

Join Date: Mar 2020

Posts: 146
#5

06 May 2020, 06:48

What is your method of checking the exported csv file? Are you reimporting it into Stata or using Python to import to check? With the quotes option and replacing double quotes with single quotes in the source strings, i.e. substr(var,char(34),char(39),.), the exported CSV file should not have any issues when imported with comma as delimiter and double quotes as string binding parameter, as commas in the strings will not be used to parse. It will be helpful to see an example observation (PII removed of course) to identify the issue.
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 785
#6

06 May 2020, 07:30

Please post the result from running:

Code:

update q about which export

and a test:

Code:

clear set obs 1 gen v1 = "A" gen v2 = char(34) + 10 * "," gen v3 = "z" + char(34) tempfile commas export delimited using "`commas'" datasignature local ds_org `r(datasignature)' type "`commas'" import delimited using "`commas'" , clear varnames(1) datasignature assert "`r(datasignature)'" == "`ds_org'"

in addition to report the exact code used to export to csv.

use -chartab- (ssc install chartab) to list characters in variables you suspect is problematic.

Last edited by Bjarte Aagnes; 06 May 2020, 07:41.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 785

06 May 2020, 12:38

adding to #6: invisible control characters may be problematic, adding a example stripping off control characters.

Code:

********************************************************************************
version 14.2

clear all
set obs 1

#delim ;
gen strL v1 = 2 * char(34) + 3 * "," 
 + ustrunescape("\u000A") /* LF invisible control character */ 
 + ustrunescape("\u2029") /* add some spaces ... */
 + ustrunescape("\u0020")  + ustrunescape("\u180E")   
 + ustrunescape("\u2008")  + ustrunescape("\u3000")      
 + 3 * char(34) + ","
;
#delim cr
compress

gen v2 =  v1 + " " + v1 
gen v3 = ","
gen strL v4 = 10^5 * v2
expand 5

chartab v4 /* ssc install chartab */

foreach v of varlist * {
    
 capture confirm string var `v'
 
 if ( _rc == 0 ) {
     
     * strip of Control characters
     replace `v' = ustrregexra(`v', "\p{C}", "", . ) 
 }
 
}

compress 
datasignature
local ds_org  `r(datasignature)'
 
tempfile commas
export delimited using "`commas'"
import delimited using "`commas'" , clear varnames(1) encoding(UTF-8) 
compress
datasignature
assert "`r(datasignature)'" == "`ds_org'" 
chartab v4

********************************************************************************
version 16.1

tempfile dta
 
python :

import pandas as pd

df = pd.read_csv(r'`commas'' , sep=',' , encoding='utf-8')  

df.to_stata(r'`dta'' , version=118 , write_index=0) 

end 

use "`dta'" , clear
compress
datasignature
assert "`r(datasignature)'" == "`ds_org'" 
chartab v4
********************************************************************************

Announcement

Issues with Stata export to CSV

Comment

Comment

Comment

Comment

Comment

Comment