We all know we should avoid comparing two floating point numbers for equality. Lots of discussions can be found in the Statalist archive and various FAQs, eg here. However in some cases the comparison seems to be beyond the control of the user. And here I am looking for a proper workaround.
Consider the following statement:
It cuts the variable wage into 10 equal-frequency (and hence unequal width) groups. The newly created variable wq is non-missing wherever the original variable wage is not missing. Which is what I expect.
If instead I wanted to create 10 equal-width groups (potentially unequally filled), I could write:
The result of which, however, is not as expected. Specifically the values on the edges of the distribution are left missing when their original values were not missing.
I believe this is not an intended behavior, since inclusion of the endpoints happens only sometimes, for some values and not the other.
I can see two explanations of why this can be happening:
And I am more interested in finding a nice workaround. Here is an example code:
Changing e.g. the right margin of the interval from 32 to 12 you would notice the problem.
So far it seems I need to do two additional cleanups:
Note the omission of the epsilon-comparison in the second statement is intentional - the implied value (r(max)) is smaller than the unclassified value - by the size of the interval, which is way larger than epsilon.
I would appreciate if someone can identify a better way to call egen to get the intended result.
Thank you, Sergiy Radyakin.
Consider the following statement:
Code:
egen wq=cut(wage), group(10)
If instead I wanted to create 10 equal-width groups (potentially unequally filled), I could write:
Code:
egen wc = cut(wage), at(`r(min)'(`=(`r(max)'-`r(min)')/10')`r(max)')
I believe this is not an intended behavior, since inclusion of the endpoints happens only sometimes, for some values and not the other.
I can see two explanations of why this can be happening:
- Egen receives it's arguments as macro substitution, so precision is lost during conversion of the value from a scalar r(min) to its textual representation.
- Egen internally compares two floating point numbers in a way it shouldn't.
And I am more interested in finding a nice workaround. Here is an example code:
Code:
sysuse nlsw88, clear sort wage, stable keep wage keep if inrange(wage,2,32) summarize wage egen wc = cut(wage), at(`r(min)'(`=(`r(max)'-`r(min)')/10')`r(max)') egen wq = cut(wage), group(10) list if missing(wc), sepby(wage)
So far it seems I need to do two additional cleanups:
Code:
summarize wc replace wc=r(min) if missing(wc) & !missing(wage) & abs(wage-r(min))<0.000001 replace wc=r(max) if missing(wc) & !missing(wage)
I would appreciate if someone can identify a better way to call egen to get the intended result.
Thank you, Sergiy Radyakin.
Comment