Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Double encoding file while importing csv

    Dear all,

    I am facing the following problem. The university provided me with a windows machine to specifically work with some data. Previously I have been working with the same data and writing STATA codes from my Mac IOS machine. Unfortunately, the same codes and files that worked on my machine now are not working on the new one.

    Shortly, my code imports csv files, run some cleaning/merging and then saves the files in a different location as dta, everything in a loop. This has always worked on my IOS machine. Moreover, the files are the original csv obtained from the provider and have been only transferred from the IOS to an external hard disk without any manipulation. If I import them individually the problem does not arise.

    Now instead STATA appears to encode and save the file twice, the second time using the wrong encoding and, thus, without possibility to read the file further. Let me show you what I mean.

    Here a reduced form of the code that I have written to trace the problem:

    Code:
     clear
    foreach j of num 16/21 {
     
    local yr = 2000 +`j'
     
     
    local files : dir "$uni/`yr'" file "*.csv"
     
    foreach file of local files {
                  
    cd "$uni/`yr'"    
    import delimited "`file'",  clear
     
    cd "$dbs/Universe/txs"
    local new : subinstr local file ".csv" "_.dta", all
    save "`new'", replace
     
    clear
     
    }
    }

    When I run the code I get the following output - I have obscured (###) some names for privacy, and I have put in bold the line which are unexpected and problematic:

    D:\#####\csv\History\Universe\2016
    (encoding automatically selected: ISO-8859-1)
    (25 vars, 780,167 obs)
    D:\#####\Stata\dtas\Universe\txs
    (file clientoutputmain_2016_01_04_2016_01_10_.dta not found)
    file clientoutputmain_2016_01_04_2016_01_10_.dta saved
    D:\######\csv\History\Universe\2016
    (encoding automatically selected: windows-1252)
    Note: 3,903 binary zeros were ignored in the source file. The first instance
    occurred on line 3. Binary zeros are not valid in text data. Inspect
    your data carefully.
    (2 vars, 3 obs)
    D:\#####\Stata\dtas\Universe\txs
    (file ._clientoutputmain_2016_01_04_2016_01_10_.dta not found)
    file ._clientoutputmain_2016_01_04_2016_01_10_.dta saved

    D:\######\csv\History\Universe\2016
    (encoding automatically selected: ISO-8859-1)
    (25 vars, 882,454 obs)
    D:\#####\Stata\dtas\Universe\txs
    (file clientoutputmain_2016_01_11_2016_01_17_.dta not found)
    file clientoutputmain_2016_01_11_2016_01_17_.dta saved
    D:\#####\csv\History\Universe\2016
    (encoding automatically selected: windows-1252)
    Note: 3,903 binary zeros were ignored in the source file. The first instance
    occurred on line 3. Binary zeros are not valid in text data. Inspect
    your data carefully.
    (2 vars, 3 obs)
    D:\#####\Stata\dtas\Universe\txs
    (file ._clientoutputmain_2016_01_11_2016_01_17_.dta not found)
    file ._clientoutputmain_2016_01_11_2016_01_17_.dta saved



    As you can see the program encodes the file a second time as windows-1252 and then save the file using the same name but with a prefix of "._". This problem persists even if I specify the encoding when importing the csv.

    I have absolutely no idea about what is going on and I could not find any resource online.

    Any help is highly appreciated,

    Regards,
    Brian

  • #2
    Well, it is not actually making duplicate files. Notice that the first filename and the second one are not exactly the same:
    file clientoutputmain_2016_01_04_2016_01_10_.dta saved
    file ._clientoutputmain_2016_01_04_2016_01_10_.dta saved
    Notice the initial ._ that is not present in the original filename.

    Moreover notice the error message:
    Note: 3,903 binary zeros were ignored in the source file. The first instance
    occurred on line 3. Binary zeros are not valid in text data. Inspect
    your data carefully.
    This tells us that the ._clinetoutputmain... file is not actually a .csv file, even though, apparently, it has a .csv filename extension.

    I am not myself a Mac user, but I work with many, and I have noticed that when directories of files are transferred from a Mac to a Windows box, they are often accompanied by some system files that Mac uses in its directory system. Normally those files are not displayed in the Finder window (nor in Windows Explorer unless you modify your settings to make them show up.) I think these are what you are picking up here. To se if I have this right, open Windows Explorer to the directory you are working with. Go to the File tab and select "Change Folder and Search Options." Now select the View tab, and click "Show hidden files, folders, and drives." OK your selection and you will see these files there. For Windows, these files serve non purpose (and as you can see, they can get in the way.) So delete them. I think everything will be fine then.

    Added: I am running Windows 10. If you have Windows 11, the procedure for showing hidden files may be different; I don't know.

    Comment


    • #3
      Dear Clyde,

      Thank you very much for your quick reply.

      You are absolutely right, in fact "the system files" generated by the transfer between systems were the problem. The solution was also easy as you pointed out. I should have actually realized that myself, as the system files did include a reference to a location on Mac system.

      Regards,

      Comment

      Working...
      X