Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unicode - analyze - does not translate all files of a folder

    Dear Statalisters,

    I am using Stata/MP2 14 for Windows. I have a folder with more than 40,000 textfiles.
    Since those files come in different encodings, I want to use Stata's new -unicode translate - features to translate those files to UTF-8.
    In this context, I stumbled across a problem.

    Here is an example:

    Code:
    local initial_dir `c(pwd)'
    
    **** create a folder
    
    capture mkdir folder
    
    cd folder
    
    **** create and store 20,000 datasets
    
    qui {
    forvalues f=1/22222 {
        noisily di "`f' of 22222"
        clear
        set obs 1
        gen str string=`"test string"'
        save file`f', replace
        }
    }
    
    **** unicode analysis
    
    clear
    set more on     
    unicode analyze *       
    
    clear all
    cd `initial_dir'
    The code generates a folder with 22,222 files and then tells Stata to - unicode analyze - each file.
    However, it seems that Stata only analyzes 10,000 of those files, here is the result:

    Code:
      File summary (before starting):
        10000  file(s) specified
        10000  file(s) to be examined ...
    So 12,222 files are not analyzed, the same happens with - unicode translate -. 10,000 seems to be a general limit.

    Can anybody reproduce this result, or does anyone have a solution? Please let me know, I would be very grateful.

    Many thanks

    Ali

  • #2
    unicode translate uses Mata function dir() to get a list of files. Unfortunately, dir() has a hard coded limit of returning at most 10000 results. We will look into if this limit can be increased without undesirable side effects. Meanwhile, your work around has to be using sub-folders to limit the number of files in each sub-folder to be less than 10000.

    Comment


    • #3
      Hua Peng (StataCorp) do you know whether or not this limit has changed in the most recent release?

      Comment


      • #4
        wbuchanan, please see #30 of this thread that shows that the limit is still there. filelist (from SSC) is written in Mata, why do you think I put so much effort trying to find a workaround for someone who created 20 directories with 100,000 files in each (see #17)?
        Last edited by Robert Picard; 08 May 2018, 07:35.

        Comment

        Working...
        X