Double encoding file while importing csv

Brian Cepparulo

Join Date: Jun 2022

Posts: 4
#1

Double encoding file while importing csv

09 Oct 2023, 11:21

Dear all,

I am facing the following problem. The university provided me with a windows machine to specifically work with some data. Previously I have been working with the same data and writing STATA codes from my Mac IOS machine. Unfortunately, the same codes and files that worked on my machine now are not working on the new one.

Shortly, my code imports csv files, run some cleaning/merging and then saves the files in a different location as dta, everything in a loop. This has always worked on my IOS machine. Moreover, the files are the original csv obtained from the provider and have been only transferred from the IOS to an external hard disk without any manipulation. If I import them individually the problem does not arise.

Now instead STATA appears to encode and save the file twice, the second time using the wrong encoding and, thus, without possibility to read the file further. Let me show you what I mean.

Here a reduced form of the code that I have written to trace the problem:

Code:

clear foreach j of num 16/21 { local yr = 2000 +`j' local files : dir "$uni/`yr'" file "*.csv" foreach file of local files { cd "$uni/`yr'" import delimited "`file'", clear cd "$dbs/Universe/txs" local new : subinstr local file ".csv" "_.dta", all save "`new'", replace clear } }

When I run the code I get the following output - I have obscured (###) some names for privacy, and I have put in bold the line which are unexpected and problematic:

D:\#####\csv\History\Universe\2016
(encoding automatically selected: ISO-8859-1)
(25 vars, 780,167 obs)
D:\#####\Stata\dtas\Universe\txs
(file clientoutputmain_2016_01_04_2016_01_10_.dta not found)
file clientoutputmain_2016_01_04_2016_01_10_.dta saved
D:\######\csv\History\Universe\2016
(encoding automatically selected: windows-1252)
Note: 3,903 binary zeros were ignored in the source file. The first instance
occurred on line 3. Binary zeros are not valid in text data. Inspect
your data carefully.
(2 vars, 3 obs)
D:\#####\Stata\dtas\Universe\txs
(file ._clientoutputmain_2016_01_04_2016_01_10_.dta not found)
file ._clientoutputmain_2016_01_04_2016_01_10_.dta saved
D:\######\csv\History\Universe\2016
(encoding automatically selected: ISO-8859-1)
(25 vars, 882,454 obs)
D:\#####\Stata\dtas\Universe\txs
(file clientoutputmain_2016_01_11_2016_01_17_.dta not found)
file clientoutputmain_2016_01_11_2016_01_17_.dta saved
D:\#####\csv\History\Universe\2016
(encoding automatically selected: windows-1252)
Note: 3,903 binary zeros were ignored in the source file. The first instance
occurred on line 3. Binary zeros are not valid in text data. Inspect
your data carefully.
(2 vars, 3 obs)
D:\#####\Stata\dtas\Universe\txs
(file ._clientoutputmain_2016_01_11_2016_01_17_.dta not found)
file ._clientoutputmain_2016_01_11_2016_01_17_.dta saved

As you can see the program encodes the file a second time as windows-1252 and then save the file using the same name but with a prefix of "._". This problem persists even if I specify the encoding when importing the csv.

I have absolutely no idea about what is going on and I could not find any resource online.

Any help is highly appreciated,

Regards,
Brian
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30103
#2

09 Oct 2023, 12:10

Well, it is not actually making duplicate files. Notice that the first filename and the second one are not exactly the same:

file clientoutputmain_2016_01_04_2016_01_10_.dta saved
file ._clientoutputmain_2016_01_04_2016_01_10_.dta saved

Notice the initial ._ that is not present in the original filename.

Moreover notice the error message:

Note: 3,903 binary zeros were ignored in the source file. The first instance
occurred on line 3. Binary zeros are not valid in text data. Inspect
your data carefully.

This tells us that the ._clinetoutputmain... file is not actually a .csv file, even though, apparently, it has a .csv filename extension.

I am not myself a Mac user, but I work with many, and I have noticed that when directories of files are transferred from a Mac to a Windows box, they are often accompanied by some system files that Mac uses in its directory system. Normally those files are not displayed in the Finder window (nor in Windows Explorer unless you modify your settings to make them show up.) I think these are what you are picking up here. To se if I have this right, open Windows Explorer to the directory you are working with. Go to the File tab and select "Change Folder and Search Options." Now select the View tab, and click "Show hidden files, folders, and drives." OK your selection and you will see these files there. For Windows, these files serve non purpose (and as you can see, they can get in the way.) So delete them. I think everything will be fine then.

Added: I am running Windows 10. If you have Windows 11, the procedure for showing hidden files may be different; I don't know.
2 likes
Comment
Brian Cepparulo

Join Date: Jun 2022

Posts: 4
#3

11 Oct 2023, 04:38

Dear Clyde,

Thank you very much for your quick reply.

You are absolutely right, in fact "the system files" generated by the transfer between systems were the problem. The solution was also easy as you pointed out. I should have actually realized that myself, as the system files did include a reference to a location on Mac system.

Regards,
Comment

Announcement

Double encoding file while importing csv

Comment

Comment