replacing values during import

Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#1

replacing values during import

02 Apr 2019, 10:37

Dear All,

given a dataset in tab-delimited format, is there any way to tell the insheet/import delimited (orany other import command) to replace certain words in columns during import before the types of columns are determined?

Specifically, the tab file looks like this:

Code:

x y z 1 alpha 2 alpha NaN NaN alpha 9 3 beta 9 banana 9 2 banana 9

I would like to import it specifying that the specific value NaN should be treated as .a (ext missing) during import and permit the column to be imported as numeric during the types analysis. Clearly NaN may occur in string literals such as in "banana" in the example. I would like to preserve the distinction between the empty value and the NaN value. My desired result in Stata is:

Code:

x y z 1 alpha . 2 alpha .a .a alpha 9 3 beta 9 . banana 9 2 banana 9

where x and z are imported as numeric and y as a string variable.There are couple of other (known) keywords denoting problems detected by the data-producing system that I would like to filter-out that way. Conceptually it is similar to the ignore() option of the destring command, but for the whole file, rather than one variable.

I understand it is possible to import the column as string, and replace the particular value, then recast or destring. This route I can follow myself, but I am looking for a solution that would utilize a standard command.

If it matters, I have thousands of these files with unknown structure (which column is string or numeric), so eyeballing it would not be possible.

If there is any advice, please provide, before I go with custom code.

Thank you, Sergiy Radyakin
Tags: import delimited, missing values, replace
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

02 Apr 2019, 12:07

Sergiy-- What about doing some kind of search/replace/save using fileread/filewrite on each file? I did some work with that several years ago, and it's quite fast, even with regexr() and the like. I'm not sure if this fits your specs, but here's what I'm thinking about:

Code:

set obs 1 // could of course be done for many file names gen fname = "whatever.txt" local tab = char(9) gen strL f = fileread(fname) // up to about 2e8 bytes, I think replace f = subinstr(f, "`tab'NaN", "`tab'.a", .) tempfile temp gen b = filesave("`temp'", f) clear import delimited using `temp' ...
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

02 Apr 2019, 14:24

It seems to me this problem may amenable to (perhaps repeated) use of the filefilter command.

Code:

. type gnxl.txt
x       y       z
1       alpha   .
2       alpha   NaN
NaN     alpha   9
3       beta    9
        banana  9
2       banana  9

. filefilter gnxl.txt gnxl2.txt, from("NaN") to(".a") replace
(file gnxl2.txt was replaced)

. type "gnxl2.txt"
x       y       z
1       alpha   .
2       alpha   .a
.a      alpha   9
3       beta    9
        banana  9
2       banana  9

. import delimited gnxl2.txt, clear 
(3 vars, 6 obs)

. describe

Contains data
  obs:             6                          
 vars:             3                          
 size:            48                          
------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------
x               byte    %8.0g                 
y               str6    %9s                   
z               byte    %8.0g                 
------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

. list, clean

        x        y    z  
  1.    1    alpha    .  
  2.    2    alpha   .a  
  3.   .a    alpha    9  
  4.    3     beta    9  
  5.    .   banana    9  
  6.    2   banana    9  

.

Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2416

02 Apr 2019, 16:30

Bill and I are thinking along the same lines here. I have a stylistic fondness for fileread/filewrite, and per an experiment I just tried (below), the execution times of the two approaches are the same. One can also use regular expressions with the string produced by fileread(), which can be helpful.

Code:

clear
local reps 100
// Simulate a file
set obs 1
set seed 4846
gen strL s = ""
forval i = 1/200000 {
  if (mod(`i',50000) == 0) di "char # `i'; " _continue
  qui replace s = s + char(ceil(runiform() * 255))
}
di""
tempfile in out
gen b = filewrite("`in'", s)
ls "`in'"
//
di "fileread/write"
timer clear 1
timer on 1
forval i = 1/`reps' {
  clear
  qui set obs 1
  qui gen strL s = fileread("`in'")
  quiet replace s = subinstr(s, "x", "y", .)
  gen b = filewrite("`out'", s, 1)
}
timer off 1
//
di "filefilter"
timer clear 2
timer on 2
forval i = 1/`reps'  {
  quiet filefilter "`in'" "`out'", from("x") to("y") replace
}
timer off 2
timer list
//

Comment

Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#5

02 Apr 2019, 16:55

Dear Mike, William,

thank you for suggesting filefilter approach.

I ruled it out early on, since I only want this replacement done in numeric columns (meaning to say that the column would conform to the numeric requirements if such a replacement is to take place).

One of the substitutions I want to make is "Infinity" ---> .i
(if you care, these are common in output of data in .Net Framework: https://docs.microsoft.com/en-us/dotnet/api/system.globalization.numberformatinfo.positiveinfi nitysymbol?view=netframework-4.7.2)

But I don't want to accidentally replace car model Infinity in the car model column (e.g. if it gets misspelled in the data as Infinity; the correct spelling of the brand isInfiniti with an i, which is not a common spelling).

Both suggested code fragments are ignorant of the types of the column and would decide on the cell-by-cell basis, which, unfortunately will not deliver the desired result (even though it would work for the example dataset I've posted in the first message).

To refine, imagine this input data:

Code:

x y z make 1 alpha Elantra 2 alpha NaN Corolla NaN alpha 9 Infinity 3 beta 9 Mustang banana 9 2 banana 9 Magnum 1 gamma Infinity Deville

I'd like the Infinity in the last column to be retained as-is since that column will not become numeric after a substitution, due to other strings shown there.

Thank you, Sergiy
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

02 Apr 2019, 17:11

Mike Lacy - The code in post #2 is slick, I will keep that approach in mind. The filefilter approach requires multiple passes of the file to achieve multiple replacements; your approach allows multiple replacements in a single pass.

However, the subinstr() in post#2 does not catch the NaN in the variable x because a tab does not precede it. Adding a second subinstr moving the two tabs to follow the NaN does the trick. This won't work on a file with a single variable, however. I think a suitably complicated regular expression replacement would do the trick, however.

On the other hand, my subinstr in post #3 replaces baNaNa with ba.aa.

Sergiy Radyakin - You probably should take Mike's and my suggestions as jumping-off points rather than a complete solution.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

02 Apr 2019, 20:11

Post #5 crossed with my post #6.

Clearly, reading the documentation for import delimited there is no option to do what was requested in post #1.

Post #1 was oversimplified and did not include the requirement that the substitution was to apply to only those columns somehow determined to be numeric once the substitutions were all applied and which are cannot be known to be numeric in advance of making the substitutions. From the Statalist FAQ linked to from the top of the page,

10. How should I write questions?

Do write carefully; be precise and include all relevant detail.

So ther solution is to import everything as string, and for each variable, make a copy, make the substitutions in the copy, destring the copy creating a third variable, and if the third copy is numeric, use it, otherwise use the original. Delete the two unneeded variables, rename as needed, and repeat on the next variable.

Last edited by William Lisowski; 02 Apr 2019, 20:21.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#8

03 Apr 2019, 12:08

Dear William,

thank you for the advice.

Perhaps, I will add this to the wishlist for future versions of Stata as it seems to be a fairly common, useful and at the same time simple to implement addition.

Regards, Sergiy
Comment
Raza Ali

Join Date: Aug 2015

Posts: 5
#9

05 Apr 2019, 04:38

Dear Sergiy,

My understanding of your problem is that you have a lot of text files; they contain a mixture of string and numeric columns but the numeric columns contain strings such as "NaN" or "Infinity". You want Stata to import the numeric columns to begin with rather than going through a series of -replace- and/or -destring-, setting "NaN" etc. to missings.

My suggestion is that you consider using the -rowrange()- option of -import delimited- to read a small portion of the data (very quickly), identify the variable indices that are numeric and then pass these to the -numericcols()- option of -import delimited- for the whole file. This should be fast. The code below illustrates the principle:

Code:

clear all version 14.2 set obs 100 //make some data forv i = 1/10{ g var`i'_num = string(_n) replace var`i'_num = "NaN" in 95/99 replace var`i'_num = "Infinity" in 100 g var`i'_str = word("`c(alpha)'", `i') } g rand_sort = runiform() sort rand_sort drop rand_sort export delimited "testimport.txt", replace //fast import to identify indices of numeric vars using -rowrange()- to read top of data import delimited using "testimport.txt", /// clear numericc(_all) asfloat varn(1) case(lower) rowrange(1:10) //generate a local macro of variables that contain numbers foreach var of varlist *{ cap assert `var' == . if _rc loc numericvars `numericvars' `var' } di "`numericvars'" //get positions using mata's st_varindex() mata: st_local("numlist", invtokens(strofreal((st_varindex(tokens("`numericvars'")))))) di "`numlist'" //import whole file specifying numeric columns import delimited using "testimport.txt", /// clear numericc(`numlist') asfloat varn(1) case(lower) //verify result foreach var of varlist *{ if regexm("`var'", "num") confirm numeric variable `var' else confirm string variable `var' } describe erase "testimport.txt"

There are a couple of limitations:
this approach doesn't allow for your desire to use extended missings. Strings are simply set to missing.

If a numeric variable only contains numbers and "NaN"s, then Stata doesn't convert them to missings but to a special value "1.#QNAN". I don't know how to reference or change these values. I wonder if they are a bug related to the underlying C. When tabulated, they are not recognised as the same value but appear separately. Does anyone know about this?

As I say, likely not a suitable solution but thought it worth mentioning. I'm also curious about the "1.#QNAN" values.

Regards,
Raza
Comment

Announcement

replacing values during import

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment