Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • replacing values during import

    Dear All,

    given a dataset in tab-delimited format, is there any way to tell the insheet/import delimited (orany other import command) to replace certain words in columns during import before the types of columns are determined?

    Specifically, the tab file looks like this:

    Code:
    x    y    z
    1    alpha    
    2    alpha    NaN
    NaN    alpha    9
    3    beta    9
        banana    9
    2    banana    9


    I would like to import it specifying that the specific value NaN should be treated as .a (ext missing) during import and permit the column to be imported as numeric during the types analysis. Clearly NaN may occur in string literals such as in "banana" in the example. I would like to preserve the distinction between the empty value and the NaN value. My desired result in Stata is:

    Code:
    x    y    z
    1    alpha    .
    2    alpha    .a
    .a    alpha    9
    3    beta    9
    .    banana    9
    2    banana    9


    where x and z are imported as numeric and y as a string variable.There are couple of other (known) keywords denoting problems detected by the data-producing system that I would like to filter-out that way. Conceptually it is similar to the ignore() option of the destring command, but for the whole file, rather than one variable.

    I understand it is possible to import the column as string, and replace the particular value, then recast or destring. This route I can follow myself, but I am looking for a solution that would utilize a standard command.

    If it matters, I have thousands of these files with unknown structure (which column is string or numeric), so eyeballing it would not be possible.

    If there is any advice, please provide, before I go with custom code.

    Thank you, Sergiy Radyakin

  • #2
    Sergiy-- What about doing some kind of search/replace/save using fileread/filewrite on each file? I did some work with that several years ago, and it's quite fast, even with regexr() and the like. I'm not sure if this fits your specs, but here's what I'm thinking about:

    Code:
    set obs 1  // could of course be done for many file names
    gen fname = "whatever.txt"
    local tab = char(9)
    gen strL f = fileread(fname) // up to about 2e8 bytes, I think
    replace f = subinstr(f, "`tab'NaN", "`tab'.a", .)
    tempfile temp
    gen b = filesave("`temp'", f)
    clear
    import delimited using `temp' ...

    Comment


    • #3
      It seems to me this problem may amenable to (perhaps repeated) use of the filefilter command.
      Code:
      . type gnxl.txt
      x       y       z
      1       alpha   .
      2       alpha   NaN
      NaN     alpha   9
      3       beta    9
              banana  9
      2       banana  9
      
      . filefilter gnxl.txt gnxl2.txt, from("NaN") to(".a") replace
      (file gnxl2.txt was replaced)
      
      . type "gnxl2.txt"
      x       y       z
      1       alpha   .
      2       alpha   .a
      .a      alpha   9
      3       beta    9
              banana  9
      2       banana  9
      
      . import delimited gnxl2.txt, clear 
      (3 vars, 6 obs)
      
      . describe
      
      Contains data
        obs:             6                          
       vars:             3                          
       size:            48                          
      ------------------------------------------------------------------------------------------------
                    storage   display    value
      variable name   type    format     label      variable label
      ------------------------------------------------------------------------------------------------
      x               byte    %8.0g                 
      y               str6    %9s                   
      z               byte    %8.0g                 
      ------------------------------------------------------------------------------------------------
      Sorted by: 
           Note: Dataset has changed since last saved.
      
      . list, clean
      
              x        y    z  
        1.    1    alpha    .  
        2.    2    alpha   .a  
        3.   .a    alpha    9  
        4.    3     beta    9  
        5.    .   banana    9  
        6.    2   banana    9  
      
      .

      Comment


      • #4

        Bill and I are thinking along the same lines here. I have a stylistic fondness for fileread/filewrite, and per an experiment I just tried (below), the execution times of the two approaches are the same. One can also use regular expressions with the string produced by fileread(), which can be helpful.

        Code:
        clear
        local reps 100
        // Simulate a file
        set obs 1
        set seed 4846
        gen strL s = ""
        forval i = 1/200000 {
          if (mod(`i',50000) == 0) di "char # `i'; " _continue
          qui replace s = s + char(ceil(runiform() * 255))
        }
        di""
        tempfile in out
        gen b = filewrite("`in'", s)
        ls "`in'"
        //
        di "fileread/write"
        timer clear 1
        timer on 1
        forval i = 1/`reps' {
          clear
          qui set obs 1
          qui gen strL s = fileread("`in'")
          quiet replace s = subinstr(s, "x", "y", .)
          gen b = filewrite("`out'", s, 1)
        }
        timer off 1
        //
        di "filefilter"
        timer clear 2
        timer on 2
        forval i = 1/`reps'  {
          quiet filefilter "`in'" "`out'", from("x") to("y") replace
        }
        timer off 2
        timer list
        //

        Comment


        • #5
          Dear Mike, William,

          thank you for suggesting filefilter approach.

          I ruled it out early on, since I only want this replacement done in numeric columns (meaning to say that the column would conform to the numeric requirements if such a replacement is to take place).

          One of the substitutions I want to make is "Infinity" ---> .i
          (if you care, these are common in output of data in .Net Framework: https://docs.microsoft.com/en-us/dotnet/api/system.globalization.numberformatinfo.positiveinfi nitysymbol?view=netframework-4.7.2)

          But I don't want to accidentally replace car model Infinity in the car model column (e.g. if it gets misspelled in the data as Infinity; the correct spelling of the brand isInfiniti with an i, which is not a common spelling).

          Both suggested code fragments are ignorant of the types of the column and would decide on the cell-by-cell basis, which, unfortunately will not deliver the desired result (even though it would work for the example dataset I've posted in the first message).

          To refine, imagine this input data:
          Code:
          x    y    z    make
          1    alpha        Elantra
          2    alpha    NaN    Corolla
          NaN    alpha    9    Infinity
          3    beta    9    Mustang
              banana    9    
          2    banana    9    Magnum
          1    gamma    Infinity    Deville
          I'd like the Infinity in the last column to be retained as-is since that column will not become numeric after a substitution, due to other strings shown there.

          Thank you, Sergiy

          Comment


          • #6
            Mike Lacy - The code in post #2 is slick, I will keep that approach in mind. The filefilter approach requires multiple passes of the file to achieve multiple replacements; your approach allows multiple replacements in a single pass.

            However, the subinstr() in post#2 does not catch the NaN in the variable x because a tab does not precede it. Adding a second subinstr moving the two tabs to follow the NaN does the trick. This won't work on a file with a single variable, however. I think a suitably complicated regular expression replacement would do the trick, however.

            On the other hand, my subinstr in post #3 replaces baNaNa with ba.aa.

            Sergiy Radyakin - You probably should take Mike's and my suggestions as jumping-off points rather than a complete solution.

            Comment


            • #7
              Post #5 crossed with my post #6.

              Clearly, reading the documentation for import delimited there is no option to do what was requested in post #1.

              Post #1 was oversimplified and did not include the requirement that the substitution was to apply to only those columns somehow determined to be numeric once the substitutions were all applied and which are cannot be known to be numeric in advance of making the substitutions. From the Statalist FAQ linked to from the top of the page,

              10. How should I write questions?

              Do write carefully; be precise and include all relevant detail.
              So ther solution is to import everything as string, and for each variable, make a copy, make the substitutions in the copy, destring the copy creating a third variable, and if the third copy is numeric, use it, otherwise use the original. Delete the two unneeded variables, rename as needed, and repeat on the next variable.
              Last edited by William Lisowski; 02 Apr 2019, 20:21.

              Comment


              • #8
                Dear William,

                thank you for the advice.

                Perhaps, I will add this to the wishlist for future versions of Stata as it seems to be a fairly common, useful and at the same time simple to implement addition.

                Regards, Sergiy

                Comment


                • #9
                  Dear Sergiy,

                  My understanding of your problem is that you have a lot of text files; they contain a mixture of string and numeric columns but the numeric columns contain strings such as "NaN" or "Infinity". You want Stata to import the numeric columns to begin with rather than going through a series of -replace- and/or -destring-, setting "NaN" etc. to missings.

                  My suggestion is that you consider using the -rowrange()- option of -import delimited- to read a small portion of the data (very quickly), identify the variable indices that are numeric and then pass these to the -numericcols()- option of -import delimited- for the whole file. This should be fast. The code below illustrates the principle:

                  Code:
                  
                  clear  all
                  version 14.2
                  set obs 100
                  
                  //make some data
                  forv i = 1/10{
                      g var`i'_num = string(_n)
                      replace var`i'_num = "NaN" in 95/99
                      replace var`i'_num = "Infinity" in 100
                      g var`i'_str = word("`c(alpha)'", `i')
                  }
                  g rand_sort = runiform()
                  sort rand_sort
                  drop rand_sort
                  export delimited "testimport.txt", replace
                  
                  //fast import to identify indices of numeric vars using -rowrange()- to read top of data
                  import delimited using "testimport.txt", ///
                      clear numericc(_all) asfloat varn(1) case(lower) rowrange(1:10)
                  
                  //generate a local macro of variables that contain numbers
                  foreach var of varlist *{
                      cap assert `var' == .
                      if _rc loc numericvars `numericvars' `var'
                  }
                  di "`numericvars'"
                  
                  //get positions using mata's st_varindex()
                  mata: st_local("numlist", invtokens(strofreal((st_varindex(tokens("`numericvars'"))))))
                  di "`numlist'"
                  
                  //import whole file specifying numeric columns
                  import delimited using "testimport.txt", ///
                      clear numericc(`numlist') asfloat varn(1) case(lower)
                  
                  //verify result
                  foreach var of varlist *{
                      if regexm("`var'", "num") confirm numeric variable `var'
                      else confirm string variable `var'
                  }
                  describe
                  erase "testimport.txt"
                  There are a couple of limitations:
                  • this approach doesn't allow for your desire to use extended missings. Strings are simply set to missing.
                  • If a numeric variable only contains numbers and "NaN"s, then Stata doesn't convert them to missings but to a special value "1.#QNAN". I don't know how to reference or change these values. I wonder if they are a bug related to the underlying C. When tabulated, they are not recognised as the same value but appear separately. Does anyone know about this?
                  As I say, likely not a suitable solution but thought it worth mentioning. I'm also curious about the "1.#QNAN" values.

                  Regards,
                  Raza

                  Comment

                  Working...
                  X