Unicode - analyze - does not translate all files of a folder

Alexander Koplenig

Join Date: Jul 2014

Posts: 39
#1

Unicode - analyze - does not translate all files of a folder

10 Sep 2015, 07:36

Dear Statalisters,

I am using Stata/MP2 14 for Windows. I have a folder with more than 40,000 textfiles.
Since those files come in different encodings, I want to use Stata's new -unicode translate - features to translate those files to UTF-8.
In this context, I stumbled across a problem.

Here is an example:

Code:

local initial_dir `c(pwd)' **** create a folder capture mkdir folder cd folder **** create and store 20,000 datasets qui { forvalues f=1/22222 { noisily di "`f' of 22222" clear set obs 1 gen str string=`"test string"' save file`f', replace } } **** unicode analysis clear set more on unicode analyze * clear all cd `initial_dir'

The code generates a folder with 22,222 files and then tells Stata to - unicode analyze - each file.
However, it seems that Stata only analyzes 10,000 of those files, here is the result:

Code:

File summary (before starting): 10000 file(s) specified 10000 file(s) to be examined ...

So 12,222 files are not analyzed, the same happens with - unicode translate -. 10,000 seems to be a general limit.

Can anybody reproduce this result, or does anyone have a solution? Please let me know, I would be very grateful.

Many thanks

Ali
Tags: None
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#2

10 Sep 2015, 08:04

unicode translate uses Mata function dir() to get a list of files. Unfortunately, dir() has a hard coded limit of returning at most 10000 results. We will look into if this limit can be increased without undesirable side effects. Meanwhile, your work around has to be using sub-folders to limit the number of files in each sub-folder to be less than 10000.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#3

08 May 2018, 06:21

Hua Peng (StataCorp) do you know whether or not this limit has changed in the most recent release?
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#4

08 May 2018, 07:32

wbuchanan, please see #30 of this thread that shows that the limit is still there. filelist (from SSC) is written in Mata, why do you think I put so much effort trying to find a workaround for someone who created 20 directories with 100,000 files in each (see #17)?

Last edited by Robert Picard; 08 May 2018, 07:35.
Comment

Announcement

Unicode - analyze - does not translate all files of a folder

Comment

Comment

Comment