Precision in egen

Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#1

Precision in egen

16 Jul 2014, 10:23

We all know we should avoid comparing two floating point numbers for equality. Lots of discussions can be found in the Statalist archive and various FAQs, eg here. However in some cases the comparison seems to be beyond the control of the user. And here I am looking for a proper workaround.

Consider the following statement:

Code:

egen wq=cut(wage), group(10)

It cuts the variable wage into 10 equal-frequency (and hence unequal width) groups. The newly created variable wq is non-missing wherever the original variable wage is not missing. Which is what I expect.

If instead I wanted to create 10 equal-width groups (potentially unequally filled), I could write:

Code:

egen wc = cut(wage), at(`r(min)'(`=(`r(max)'-`r(min)')/10')`r(max)')

The result of which, however, is not as expected. Specifically the values on the edges of the distribution are left missing when their original values were not missing.

I believe this is not an intended behavior, since inclusion of the endpoints happens only sometimes, for some values and not the other.

I can see two explanations of why this can be happening:
Egen receives it's arguments as macro substitution, so precision is lost during conversion of the value from a scalar r(min) to its textual representation.

Egen internally compares two floating point numbers in a way it shouldn't.

Since there appears no way to specify the argument indirectly (by providing a name of the scalar, which egen should take) this looks like a problem.
And I am more interested in finding a nice workaround. Here is an example code:

Code:

sysuse nlsw88, clear sort wage, stable keep wage keep if inrange(wage,2,32) summarize wage egen wc = cut(wage), at(`r(min)'(`=(`r(max)'-`r(min)')/10')`r(max)') egen wq = cut(wage), group(10) list if missing(wc), sepby(wage)

Changing e.g. the right margin of the interval from 32 to 12 you would notice the problem.

So far it seems I need to do two additional cleanups:

Code:

summarize wc replace wc=r(min) if missing(wc) & !missing(wage) & abs(wage-r(min))<0.000001 replace wc=r(max) if missing(wc) & !missing(wage)

Note the omission of the epsilon-comparison in the second statement is intentional - the implied value (r(max)) is smaller than the unclassified value - by the size of the interval, which is way larger than epsilon.

I would appreciate if someone can identify a better way to call egen to get the intended result.

Thank you, Sergiy Radyakin.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35730
#2

16 Jul 2014, 10:42

Usually I despise, loathe and abominate these information-losing categorisation functions, not least because you need to look at the code to be sure of boundary decisions, but that's not the question(*). It's correct that (slight editing here)

egen receives its arguments by macro substitution, so precision is lost during conversion of the value from a scalar r(min) to its textual representation

although that itself doesn't rule out other problems. Insisting on scalar(r(min)) as input might help. You might need to go

Code:

`=scalar(r(min))'

I assume that someone in power is insisting on this kind of calculation.

(*) In some quarters it seems to be believed that what quantile bin a value is in is somehow more useful or informative than the value itself.

Last edited by Nick Cox; 16 Jul 2014, 10:44.
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#3

16 Jul 2014, 11:54

Sergiy,

Here's a half-baked idea that you might be able to finish off:

Code:

summarize wage numlist "`r(min)'(`=(`r(max)'-`r(min)')/10')`r(max)'"

Now r(numlist) contains the list of cuts and you can edit the first one to be lower by epsilon and edit the upper one to be higher by epsilon. I can't think off the top of my head how to do that, but perhaps you can.

Incidentally, one wouldn't think that adding epsilon to the upper limit is necessary since anything higher than the penultimate cut should be fine. However, there is some suspicious code in _gcut.ado (which is called by egen):

Code:

if "`at'" != "" { local n : word count `at' local top : word `n' of `at' qui replace `x' = . if `x' >= `top' }

Thus, it is intentionally setting to missing any values that are greater than or equal to the upper endpoint of your interval. Setting to missing all values that are greater than the upper limit makes sense, but I would say that the "=" is a bug. Perhaps there is an explanation for this buried somewhere in the documentation or in Stata's internal specifications.

Regards,
Joe
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#4

16 Jul 2014, 13:03

Dear Nick, Joe, thank you for your responses.

Nick, I tried using scalar() as you suggested, but there were no noticeable difference in results. This is what I typed:

Code:

egen wc = cut(wage), at(`=scalar(r(min))'(`=(scalar(r(max))-scalar(r(min)))/10')`=scalar(r(max))')

Stata still responded with "5 missing values generated".

Joe, I created a copy of the _gcut.ado edited it for equality sign in the statement you suspected and re-run, also without any effect.

So far adding the two replace statements that I posted seems the easiest workaround, but I will consider doing the classification myself without the use of egen alltogether to stay in full control or as a custom extension to egen.

Thank you, Sergiy Radyakin
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35730
#5

16 Jul 2014, 13:06

I suspect you may need double precision to do justice to the scalar idea.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

16 Jul 2014, 13:27

The problem is due to _gcut's use of numlist in the syntax statement:

Code:

    syntax newvarname(gen) =/exp [if] [in] /*
        */ [, AT(numlist asc) Group(real 0) ICodes Label BY(string)]

Here's a simple example that shows the issue

Code:

program define test_numlist
    version 7, missing
    syntax , AT(numlist asc)

    dis "`0'"
    dis "at    = `at'"
    dis "c(pi) = `c(pi)'"
    local first : word 1 of `at'
    dis "diff  = " c(pi) - `first'
end

which results in

Code:

. test_numlist, at(`c(pi)'(3)10)
, at(3.141592653589793(3)10)
at    = 3.14159265359 6.14159265359 9.14159265359
c(pi) = 3.141592653589793
diff  = -2.069e-13

In other words, numlist expands the list but the resulting string elements do not reflect the full accuracy of double precision.

Comment

Joe Canner

Join Date: Mar 2014

Posts: 580
#7

16 Jul 2014, 13:30

Sergiy,

A couple of additional thoughts:

One possible, albeit drastic, way to solve your problem is to modify _gcut.ado. To solve the lower endpoint problem you can change this line:

Code:

qui replace `varlist' = `cutp' if `x' >= `cutp' & `touse'

to:

Code:

qui replace `varlist' = `cutp' if float(`x') >= float(`cutp') & `touse'

The upper endpoint problem is more problematic. You can change the line (referrred to in my previous message):

Code:

qui replace `x' = . if `x' >= `top'

to:

Code:

qui replace `x' = . if float(`x') > float(`top')

This solves the missing value problem but results in an extra category containing only the maximum. This is the way egen cut is supposed to work, i.e., each number in the at(numlist) is the left-hand side of an interval. So, if the maximum is listed in the numlist, the maximum will be in its own category. There are probably ways to fix this, e.g., by adding an epsilon to the maximum or by using the numlist command (as in my previous message) and chopping off the last number.

Hope this helps.

Regards,
Joe
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#8

16 Jul 2014, 14:40

For those who have mentioned double precision, it's worth keeping in mind that Sergiy doesn't *need* double precision for what he is doing (nor do most people). Rather, the problem arises because Stata performs all calculations in double precision (see quote below from the data types help file). Thus, when comparing a variable with a constant (as in this case), both the variable and the constant need to be double precision or both need to be float. If there is a mix there will be precision problems. For whatever reason, the use of a numlist results in about 12 digits of accuracy, compared to 7 for float and 16 for double, hence the precision problems. As noted in the help file and suggested in my previous post, one solution is to use the float() function to ensure that both sides of the equal sign are the same precision.

Note, by the way, that making wage a double does not solve the problem because _gcut creates a new variable, using a simple gen statement. (It appears that there was the intention to provide a type for the new variable, but I can't figure out how it is supposed to work.) Thus, the variable in question is converted by default to float and any precision obtained by making the original variable a double is lost. This can be fixed by modifying _gcut to make sure that the new variable is created as a double.

...Stata, however, performs all
calculations in double precision. If you were to store 0.1 in a float called x and
then ask, say, "list if x==.1", there would be nothing in the list. The .1 that you
just typed was converted to double, with 16 digits of accuracy
(.100000000000000014...), and that number is never equal to 0.1 stored with float
accuracy.

One solution is to type "list if x==float(.1)". The float() function rounds its
argument to float accuracy; see [D] functions. The other alternative would be store
your data as double, but this is probably a waste of memory....
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#9

16 Jul 2014, 16:03

Thank you Joe,

This solves the missing value problem but results in an extra category containing only the maximum.

this would be not compatible with why I am cutting the variable into bins in the first place.

I am fine with changing the _gcut.ado but only for investigation purposes. Even if we find a bug in it now, it will not be patched in e.g. Stata 10. The other negative side is that then I might forget that I have a patched version and have problems later on working on a different project.

@Robert: I agree 100%, what's the workaround? (with egen)

The verbal formulation of the task would be something like: assign to each observation of x where x is nonmissing a numeric code 1 to 10 depending on which interval it falls to from the set of 10 equal sized intervals covering [x_min,x_max] including the endpoints.

I have already wrote the code that delivers this functionality directly, without using egen and I don't think it can be accommodated by egen with a simple set of options.

Thank you, Sergiy
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35730
#10

16 Jul 2014, 16:09

Did anyone mention xtile? I prefer being in charge of what happens at bin bounds, as discussed in http://www.stata-journal.com/article...article=pr0054

Naturally, egen, cut() has wider scope than quantile-based binning.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#11

16 Jul 2014, 17:09

No solution within egen without changing the syntax of _gcut.ado and its reliance on numlist to generate the bin cutoffs. This is not a very hard problem so I would simply do as you did and code it. It would go something like:

Code:

cap program drop cutat
program def cutat
version 10
syntax varname [if] [in] , bins(integer) gen(name)

marksample touse

local vtype : type `varlist'
qui gen `vtype' `gen' = .

sum `varlist' if `touse', meanonly
tempname cutp inc
scalar `inc' = (r(max) - r(min)) / `bins'
forvalues i=1/`bins' {
    scalar `cutp' = r(min) + (`i'-1) * `inc'
    qui replace `gen' = `cutp' if `varlist' >= `cutp' & `touse'
}
qui replace `gen' = . if `varlist' > r(max)

end

sysuse nlsw88, clear
cutat wage if inrange(wage,2,32), bins(10) gen(wc2)

keep wage wc2
keep if inrange(wage,2,32)
summarize wage
egen wc = cut(wage), at(`r(min)'(`=(`r(max)'-`r(min)')/10')`r(max)')
egen wq = cut(wage), group(10)

assert wc == wc2 if !mi(wc)

list if missing(wc), sepby(wage)

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#12

17 Jul 2014, 13:44

Originally posted by Joe Canner View Post

Note, by the way, that making wage a double does not solve the problem because _gcut creates a new variable, using a simple gen statement. (It appears that there was the intention to provide a type for the new variable, but I can't figure out how it is supposed to work.)

Actually, egen, cut() can work with a variable that is a double and will generate a double if requested. Using Sergiy's example, if wage was double, you could type

Code:

egen double wc = cut(wage), at(`r(min)'(`=(`r(max)'-`r(min)')/10')`r(max)')

There's is however a bug in _gcut.ado; the following line

Code:

qui gen `typelist' `x' = `exp'

should be replaced with

Code:

qui gen `typlist' `x' = `exp'

From the help syntax page:

If there are new variables (you coded newvarname or newvarlist), the macro `typlist' is also defined, containing the storage type of each of the new variables, listed one after the other.

The attempt here is to create a copy of the variable being cut (as stored in the `exp' expression) using the type specified for the variable created by egen. Since the macro is incorrectly referenced, the effect is to ignore the explicit typing.

I concur with Joe in #3 that the following line is incorrect as well

Code:

qui replace `x' = . if `x' >= `top'

Even if these were corrected, the more fundamental problem of specifying cut points using numlist is not solved as each element of the expanded numeric list lacks the accuracy of double precision. This will cause problems at both ends even when everything is double:

Code:

cap program drop cutat program def cutat version 10 syntax varname [if] [in] , bins(integer) gen(name) marksample touse local vtype : type `varlist' qui gen `vtype' `gen' = . sum `varlist' if `touse', meanonly tempname cutp inc scalar `inc' = (r(max) - r(min)) / `bins' forvalues i=1/`bins' { scalar `cutp' = r(min) + (`i'-1) * `inc' qui replace `gen' = `cutp' if `varlist' >= `cutp' & `touse' } end set seed 12345 sysuse nlsw88, clear keep wage gen double wage2 = wage + runiform() bys wage: replace wage2 = wage2[1] sort wage2 summarize wage2 egen double wc = cut(wage2), at(`r(min)'(`=(`r(max)'-`r(min)')/10')`r(max)') cutat wage2, bins(10) gen(wc2) assert float(wc) == float(wc2) if !mi(wc) list if missing(wc), sepby(wage2)
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment