Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Precision in egen

    We all know we should avoid comparing two floating point numbers for equality. Lots of discussions can be found in the Statalist archive and various FAQs, eg here. However in some cases the comparison seems to be beyond the control of the user. And here I am looking for a proper workaround.

    Consider the following statement:
    Code:
    egen wq=cut(wage), group(10)
    It cuts the variable wage into 10 equal-frequency (and hence unequal width) groups. The newly created variable wq is non-missing wherever the original variable wage is not missing. Which is what I expect.

    If instead I wanted to create 10 equal-width groups (potentially unequally filled), I could write:
    Code:
    egen wc = cut(wage), at(`r(min)'(`=(`r(max)'-`r(min)')/10')`r(max)')
    The result of which, however, is not as expected. Specifically the values on the edges of the distribution are left missing when their original values were not missing.

    I believe this is not an intended behavior, since inclusion of the endpoints happens only sometimes, for some values and not the other.

    I can see two explanations of why this can be happening:
    1. Egen receives it's arguments as macro substitution, so precision is lost during conversion of the value from a scalar r(min) to its textual representation.
    2. Egen internally compares two floating point numbers in a way it shouldn't.
    Since there appears no way to specify the argument indirectly (by providing a name of the scalar, which egen should take) this looks like a problem.
    And I am more interested in finding a nice workaround. Here is an example code:

    Code:
    sysuse nlsw88, clear
    sort wage, stable
    keep wage
    keep if inrange(wage,2,32)
    summarize wage
    egen wc = cut(wage), at(`r(min)'(`=(`r(max)'-`r(min)')/10')`r(max)')
    egen wq = cut(wage), group(10)
    list if missing(wc), sepby(wage)
    Changing e.g. the right margin of the interval from 32 to 12 you would notice the problem.

    So far it seems I need to do two additional cleanups:
    Code:
    summarize wc
    replace wc=r(min) if missing(wc) & !missing(wage) & abs(wage-r(min))<0.000001
    replace wc=r(max) if missing(wc) & !missing(wage)
    Note the omission of the epsilon-comparison in the second statement is intentional - the implied value (r(max)) is smaller than the unclassified value - by the size of the interval, which is way larger than epsilon.

    I would appreciate if someone can identify a better way to call egen to get the intended result.

    Thank you, Sergiy Radyakin.

  • #2
    Usually I despise, loathe and abominate these information-losing categorisation functions, not least because you need to look at the code to be sure of boundary decisions, but that's not the question(*). It's correct that (slight editing here)

    egen receives its arguments by macro substitution, so precision is lost during conversion of the value from a scalar r(min) to its textual representation
    although that itself doesn't rule out other problems. Insisting on scalar(r(min)) as input might help. You might need to go

    Code:
    `=scalar(r(min))'
    I assume that someone in power is insisting on this kind of calculation.

    (*) In some quarters it seems to be believed that what quantile bin a value is in is somehow more useful or informative than the value itself.
    Last edited by Nick Cox; 16 Jul 2014, 10:44.

    Comment


    • #3
      Sergiy,

      Here's a half-baked idea that you might be able to finish off:

      Code:
      summarize wage
      numlist "`r(min)'(`=(`r(max)'-`r(min)')/10')`r(max)'"
      Now r(numlist) contains the list of cuts and you can edit the first one to be lower by epsilon and edit the upper one to be higher by epsilon. I can't think off the top of my head how to do that, but perhaps you can.

      Incidentally, one wouldn't think that adding epsilon to the upper limit is necessary since anything higher than the penultimate cut should be fine. However, there is some suspicious code in _gcut.ado (which is called by egen):

      Code:
      if "`at'" != "" {
              local n : word count `at'
              local top : word `n' of `at'
              qui replace `x' = . if `x' >= `top'
      }
      Thus, it is intentionally setting to missing any values that are greater than or equal to the upper endpoint of your interval. Setting to missing all values that are greater than the upper limit makes sense, but I would say that the "=" is a bug. Perhaps there is an explanation for this buried somewhere in the documentation or in Stata's internal specifications.

      Regards,
      Joe

      Comment


      • #4
        Dear Nick, Joe, thank you for your responses.

        Nick, I tried using scalar() as you suggested, but there were no noticeable difference in results. This is what I typed:
        Code:
        egen wc = cut(wage), at(`=scalar(r(min))'(`=(scalar(r(max))-scalar(r(min)))/10')`=scalar(r(max))')
        Stata still responded with "5 missing values generated".

        Joe, I created a copy of the _gcut.ado edited it for equality sign in the statement you suspected and re-run, also without any effect.

        So far adding the two replace statements that I posted seems the easiest workaround, but I will consider doing the classification myself without the use of egen alltogether to stay in full control or as a custom extension to egen.

        Thank you, Sergiy Radyakin

        Comment


        • #5
          I suspect you may need double precision to do justice to the scalar idea.

          Comment


          • #6
            The problem is due to _gcut's use of numlist in the syntax statement:

            Code:
                syntax newvarname(gen) =/exp [if] [in] /*
                    */ [, AT(numlist asc) Group(real 0) ICodes Label BY(string)]
            Here's a simple example that shows the issue

            Code:
            program define test_numlist
                version 7, missing
                syntax , AT(numlist asc)
            
                dis "`0'"
                dis "at    = `at'"
                dis "c(pi) = `c(pi)'"
                local first : word 1 of `at'
                dis "diff  = " c(pi) - `first'
            end
            which results in

            Code:
            . test_numlist, at(`c(pi)'(3)10)
            , at(3.141592653589793(3)10)
            at    = 3.14159265359 6.14159265359 9.14159265359
            c(pi) = 3.141592653589793
            diff  = -2.069e-13
            In other words, numlist expands the list but the resulting string elements do not reflect the full accuracy of double precision.

            Comment


            • #7
              Sergiy,

              A couple of additional thoughts:

              One possible, albeit drastic, way to solve your problem is to modify _gcut.ado. To solve the lower endpoint problem you can change this line:
              Code:
              qui replace `varlist' = `cutp'  if `x' >= `cutp' & `touse'
              to:

              Code:
              qui replace `varlist' = `cutp'  if float(`x') >= float(`cutp') & `touse'
              The upper endpoint problem is more problematic. You can change the line (referrred to in my previous message):
              Code:
               qui replace `x' = . if `x' >= `top'
              to:

              Code:
                
               qui replace `x' = . if float(`x') > float(`top')
              This solves the missing value problem but results in an extra category containing only the maximum. This is the way egen cut is supposed to work, i.e., each number in the at(numlist) is the left-hand side of an interval. So, if the maximum is listed in the numlist, the maximum will be in its own category. There are probably ways to fix this, e.g., by adding an epsilon to the maximum or by using the numlist command (as in my previous message) and chopping off the last number.

              Hope this helps.

              Regards,
              Joe

              Comment


              • #8
                For those who have mentioned double precision, it's worth keeping in mind that Sergiy doesn't *need* double precision for what he is doing (nor do most people). Rather, the problem arises because Stata performs all calculations in double precision (see quote below from the data types help file). Thus, when comparing a variable with a constant (as in this case), both the variable and the constant need to be double precision or both need to be float. If there is a mix there will be precision problems. For whatever reason, the use of a numlist results in about 12 digits of accuracy, compared to 7 for float and 16 for double, hence the precision problems. As noted in the help file and suggested in my previous post, one solution is to use the float() function to ensure that both sides of the equal sign are the same precision.

                Note, by the way, that making wage a double does not solve the problem because _gcut creates a new variable, using a simple gen statement. (It appears that there was the intention to provide a type for the new variable, but I can't figure out how it is supposed to work.) Thus, the variable in question is converted by default to float and any precision obtained by making the original variable a double is lost. This can be fixed by modifying _gcut to make sure that the new variable is created as a double.

                ...Stata, however, performs all
                calculations in double precision. If you were to store 0.1 in a float called x and
                then ask, say, "list if x==.1", there would be nothing in the list. The .1 that you
                just typed was converted to double, with 16 digits of accuracy
                (.100000000000000014...), and that number is never equal to 0.1 stored with float
                accuracy.

                One solution is to type "list if x==float(.1)". The float() function rounds its
                argument to float accuracy; see [D] functions. The other alternative would be store
                your data as double, but this is probably a waste of memory....

                Comment


                • #9
                  Thank you Joe,

                  This solves the missing value problem but results in an extra category containing only the maximum.
                  this would be not compatible with why I am cutting the variable into bins in the first place.

                  I am fine with changing the _gcut.ado but only for investigation purposes. Even if we find a bug in it now, it will not be patched in e.g. Stata 10. The other negative side is that then I might forget that I have a patched version and have problems later on working on a different project.

                  @Robert: I agree 100%, what's the workaround? (with egen)

                  The verbal formulation of the task would be something like: assign to each observation of x where x is nonmissing a numeric code 1 to 10 depending on which interval it falls to from the set of 10 equal sized intervals covering [x_min,x_max] including the endpoints.

                  I have already wrote the code that delivers this functionality directly, without using egen and I don't think it can be accommodated by egen with a simple set of options.

                  Thank you, Sergiy

                  Comment


                  • #10
                    Did anyone mention xtile? I prefer being in charge of what happens at bin bounds, as discussed in http://www.stata-journal.com/article...article=pr0054

                    Naturally, egen, cut() has wider scope than quantile-based binning.

                    Comment


                    • #11
                      No solution within egen without changing the syntax of _gcut.ado and its reliance on numlist to generate the bin cutoffs. This is not a very hard problem so I would simply do as you did and code it. It would go something like:

                      Code:
                      cap program drop cutat
                      program def cutat
                      version 10
                      syntax varname [if] [in] , bins(integer) gen(name)
                      
                      marksample touse
                      
                      local vtype : type `varlist'
                      qui gen `vtype' `gen' = .
                      
                      sum `varlist' if `touse', meanonly
                      tempname cutp inc
                      scalar `inc' = (r(max) - r(min)) / `bins'
                      forvalues i=1/`bins' {
                          scalar `cutp' = r(min) + (`i'-1) * `inc'
                          qui replace `gen' = `cutp' if `varlist' >= `cutp' & `touse'
                      }
                      qui replace `gen' = . if `varlist' > r(max)
                      
                      end
                      
                      sysuse nlsw88, clear
                      cutat wage if inrange(wage,2,32), bins(10) gen(wc2)
                      
                      keep wage wc2
                      keep if inrange(wage,2,32)
                      summarize wage
                      egen wc = cut(wage), at(`r(min)'(`=(`r(max)'-`r(min)')/10')`r(max)')
                      egen wq = cut(wage), group(10)
                      
                      assert wc == wc2 if !mi(wc)
                      
                      list if missing(wc), sepby(wage)

                      Comment


                      • #12
                        Originally posted by Joe Canner View Post
                        Note, by the way, that making wage a double does not solve the problem because _gcut creates a new variable, using a simple gen statement. (It appears that there was the intention to provide a type for the new variable, but I can't figure out how it is supposed to work.)
                        Actually, egen, cut() can work with a variable that is a double and will generate a double if requested. Using Sergiy's example, if wage was double, you could type

                        Code:
                        egen double wc = cut(wage), at(`r(min)'(`=(`r(max)'-`r(min)')/10')`r(max)')
                        There's is however a bug in _gcut.ado; the following line

                        Code:
                        qui gen `typelist' `x' = `exp'
                        should be replaced with

                        Code:
                        qui gen `typlist' `x' = `exp'
                        From the help syntax page:

                        If there are new variables (you coded newvarname or newvarlist), the macro `typlist' is also defined, containing the storage type of each of the new variables, listed one after the other.
                        The attempt here is to create a copy of the variable being cut (as stored in the `exp' expression) using the type specified for the variable created by egen. Since the macro is incorrectly referenced, the effect is to ignore the explicit typing.

                        I concur with Joe in #3 that the following line is incorrect as well
                        Code:
                        qui replace `x' = . if `x' >= `top'
                        Even if these were corrected, the more fundamental problem of specifying cut points using numlist is not solved as each element of the expanded numeric list lacks the accuracy of double precision. This will cause problems at both ends even when everything is double:

                        Code:
                        cap program drop cutat
                        program def cutat
                        version 10
                        syntax varname [if] [in] , bins(integer) gen(name)
                        
                        marksample touse
                        
                        local vtype : type `varlist'
                        qui gen `vtype' `gen' = .
                        
                        sum `varlist' if `touse', meanonly
                        tempname cutp inc
                        scalar `inc' = (r(max) - r(min)) / `bins'
                        forvalues i=1/`bins' {
                            scalar `cutp' = r(min) + (`i'-1) * `inc'
                            qui replace `gen' = `cutp' if `varlist' >= `cutp' & `touse'
                        }
                        end
                        
                        set seed 12345
                        sysuse nlsw88, clear
                        keep wage
                        gen double wage2 = wage + runiform()
                        bys wage: replace wage2 = wage2[1]
                        sort wage2
                        
                        summarize wage2
                        egen double wc = cut(wage2), at(`r(min)'(`=(`r(max)'-`r(min)')/10')`r(max)')
                        
                        cutat wage2, bins(10) gen(wc2)
                        
                        assert float(wc) == float(wc2) if !mi(wc)
                        list if missing(wc), sepby(wage2)

                        Comment

                        Working...
                        X