Numerical precision issues with Stata's pctile and altdef in IC and SE

Mauricio Caceres

Join Date: Sep 2015

Posts: 130
#1

Numerical precision issues with Stata's pctile and altdef in IC and SE

18 Nov 2017, 09:02

(This is cross-posted here: https://stackoverflow.com/questions/...f-in-ic-and-se)

Stata's `altdef` formula in `pctile` gives the wrong result for certain certain numbers in IC and SE (this will also affect `xtile` one the bug with `altdef` there is fixed).

Code:

clear set obs 89750 gen double x = 7.2439548890446011 pctile fp = x, nq(500) altdef pctile double dp = x, nq(500) altdef assert (x[1] == fp) | mi(fp) assert (x[1] == dp) | mi(dp)

The above assertions should be true, or at least the second one, but both fail. (Note that in Stata/MP, the second assertion goes through; at least that was the case for me in testing). We can see that

Code:

. levelsof fp 7.243954658508301 . levelsof dp 7.2439548890446 7.243954889044601 7.243954889044602

This happens because `altdef` takes an average. The formula is:

Code:

scalar perc = 100 * 148 / 500 scalar ith = (_N + 1) * perc / 100 scalar i = floor(ith) scalar h = ith - i scalar q = (1 - h) * x[i] + h * x[i + 1] assert x[i] == x[i - 1] assert q == dp[148] assert q == x[i]

The first two assertions succeeded but the third fails. Stata's `pctile` fails to recognize that `x[i]` is equal to `x[i - 1]`.

(Note: Naturally my actual use case involved a variable that had different values, but one of them was `7.2439548890446011` and that caused the problem.)

double - Numerical precision issues with Stata's pctile and altdef in IC and SE - Stack Overflow

https://stackoverflow.com

Stata's altdef formula in pctile gives the wrong result for certain certain numbers in IC and SE (this will also affect xtile one the bug with altdef there is fixed). clear set obs 89750 gen doubl...
Tags: altdef, pctile, percentile, stata
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

18 Nov 2017, 14:51

I suspect the problem could be corrected by changing (the equivalent in Stata's code of)

Code:

scalar q = (1 - h) * x[i] + h * x[i + 1]

to (the equivalent of)

Code:

scalar q = x[i] + h * (x[i + 1] - x[i])

But realistically, this is not an error, it's just a particularly noticeable instance of the imprecision caused by truncated representation of hexidecimal fractions. The rewritten version captures the noticeable problem when x[i+1]==x[i], that is all.

Last edited by William Lisowski; 18 Nov 2017, 14:57.
1 like
Comment
Mauricio Caceres

Join Date: Sep 2015

Posts: 130
#3

18 Nov 2017, 17:58

I mean, if it gives the wrong answer it is an error, woudn't you say? Specially when rewriting the formula would give the correct answer (as you point out).

The issue is not how to fix it manually. The issue is if I want to use the built-in pctile, _pctile, or xtile (with xtile in particular, it gives the wrong category, which is how I noticed this in the first place). I think there must be a way to fix it since I don't get the issue in MP, just IC and SE.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

18 Nov 2017, 18:24

While there are people from StataCorp who follow this forum regularly, I think this is probably something you should call directly to the attention of technical support. Send them a link to this thread, and also information about your setup(s) (basically the output from the -about- command).

I am intrigued that you get this result in IC and SE but not MP. It is hard for me to understand how that could happen. Are these all on the same machine?
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

18 Nov 2017, 18:33

Consider the following.

Code:

. scalar x1 = 1.1

. scalar x2 = 4.4

. scalar z1 = 3.3

. scalar z2 = x1 + (2/3)*(x2-x1)

. display x1 " " %21x x1
1.1 +1.199999999999aX+000

. display x2 " " %21x x2
4.4 +1.199999999999aX+002

. display z1 " " %21x z1
3.3 +1.a666666666666X+001

. display z2 " " %21x z2
3.3 +1.a666666666667X+001

. assert z1==z2
assertion is false

Note that this uses the rewritten code, and yet for this calculation fails in the same way that the original calculation fails.

The problem is not an error, it is an unavoidable consequence of imprecise representation of nonterminating fractional values. Consider now

Code:

. scalar x1 = 111/100

. scalar x2 = x1-1

. scalar x3 = 11/100

. display x1 " " %21x x1
1.11 +1.1c28f5c28f5c3X+000

. display x2 " " %21x x2
.11 +1.c28f5c28f5c30X-004

. display x3 " " %21x x3
.11 +1.c28f5c28f5c29X-004

. assert x2==x3
assertion is false

For further discussion, see the output of Stata's help precision command, and for even more detail, this linked Stata Blog post.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#6

18 Nov 2017, 18:36

But if these are all being done on the same machine, I don't see why it should make any difference which flavor of Stata is being used: that suggests to me that there is something else going on beyond a routine precision issue.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

18 Nov 2017, 18:59

I'm not sure that we know all three were done on the same machine, or same hardware architecture, but I too am moderately curious if the results would differ between Stata flavors on the same machine.

However, that has no bearing on the general principle that comparisons of two fractional values arrived at through different calculations being fraught with the potential for differences of truly trivial magnitudes, and those differences are not indicative of a software "error".

Let me note that the original post claims, in the first of the two assert commands, that effectively float(7.2439548890446011) should equal 7.2439548890446011, despite the lesser precision of the float data type, and the demo below shows that the first assert command never came close to being satisfied.

Code:

. clear . set obs 89750 number of observations (_N) was 0, now 89,750 . gen double x = 7.2439548890446011 . . pctile fp = x, nq(500) altdef . pctile double dp = x, nq(500) altdef . . display fp " " %21x fp 7.2439547 +1.cf9cf40000000X+002 . display dp " " %21x dp 7.2439549 +1.cf9cf4f78955fX+002
Comment

Mauricio Caceres

Join Date: Sep 2015
Posts: 130

18 Nov 2017, 19:33

The SE and MP discrepancy happens on the same machine, yes. The IC result, however, is only for my laptop, since I don't have another version of Stata locally.

It is an error because this is not about computing arbitrarily precise operations with fractions. It's about percentiles. And the important property of a percentile is not that it's arbitrarily precise. That's not actually the problem. The important property is that a portion of the data is below that number. Obviously with a finite number of values percentiles will always be approximate, but there is no reason not to be more precise if it is possible. Consider this:

Code:

clear
set obs 8975
set seed 1729
gen double x = rnormal() * 5 + 10
expand 10
pctile p = x, nq(500) altdef
xtile tile = x, cutpoints(p)

preserve
    contract tile
    tab _freq
restore

This gives

Code:

  Frequency |      Freq.     Percent        Cum.
------------+-----------------------------------
        170 |        137       27.40       27.40
        180 |        251       50.20       77.60
        190 |        112       22.40      100.00
------------+-----------------------------------
      Total |        500      100.00

But we can do better.

Code:

mata
nq = 500
N  = `=_N'
X  = st_data(., "x")
N  = rows(X)
_sort(X, 1)

quantiles = ((1::(nq - 1)) * (N + 1) / nq)
qpos_i    = floor(quantiles)
qpos_i1   = qpos_i :+ 1
qdiff     = (quantiles - qpos_i)

zeros = selectindex(qpos_i :== 0)
if ( rows(zeros) > 0 ) {
    qpos_i[zeros] = J(rows(zeros), 1, 1)
}

Ns = selectindex(qpos_i1 :== (N + 1))
if ( rows(Ns) > 0 ) {
    qpos_i1[Ns] = J(rows(Ns), 1, N)
}

X_i  = X[qpos_i]
X_i1 = X[qpos_i1]
P    = X_i :+ qdiff :* (X_i1 :- X_i)

st_addvar("double", "pmata")
st_store((1::(nq - 1)), "pmata", P)
end

xtile tmata = x, cutpoints(pmata)
preserve
    contract tmata
    tab _freq
restore

This gives

Code:

  Frequency |      Freq.     Percent        Cum.
------------+-----------------------------------
        170 |         25        5.00        5.00
        180 |        475       95.00      100.00
------------+-----------------------------------
      Total |        500      100.00

A marked improvement, wouldn't you agree?

Comment

Mauricio Caceres

Join Date: Sep 2015

Posts: 130
#9

18 Nov 2017, 19:37

Originally posted by William Lisowski View Post

Let me note that the original post claims, in the first of the two assert commands, that effectively float(7.2439548890446011) should equal 7.2439548890446011, despite the lesser precision of the float data type, and the demo below shows that the first assert command never came close to being satisfied.

Clearly float(X) need not equal double(X). My intention was merely expository and I did not meant to make that claim implicitly or explicitly. Apologies for the confusion. See my post above for a better example of why this is a problem (I think).
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

19 Nov 2017, 08:24

The SE and MP discrepancy happens on the same machine

An interesting experiment would be to add

Code:

set processors 1

to your test on MP. In one of my earliest Statalist topics I discovered performance issues in xtlogit when compared to melogit on my SE system. Stata Technical Services duplicated the problem on their MP system, but only by forcing Stata to run the non-MP code behind xtlogit using the above technique. I suspect that there may be a similar fork in the code base for the compiled (that is, non-ado) commands at the heart of these commands.

If indeed running MP on a "single processor" system recreates the problem, you've got a good reproducible example worth sharing with Technical Services, as Clyde suggested. Perhaps the implementation of the multiprocessor enhancements introduced code improvments that were not carried back to the (original?) code that is executed in single-processor mode.

Consider also this variant on your example from post #8, which shows that even Stata SE can do better on your sample data by using xtile directly than by using pctile to create cutoffs and the using xtile to apply those cutoffs to the data. Interesting and not immediately explicable to me, even after a short glance at the xtile source code.

Code:

clear set obs 8975 set seed 1729 gen double x = rnormal() * 5 + 10 expand 10 xtile tile = x, nq(500) altdef contract tile tab _freq

Code:

Frequency | Freq. Percent Cum. ------------+----------------------------------- 170 | 25 5.00 5.00 180 | 475 95.00 100.00 ------------+----------------------------------- Total | 500 100.00
Comment

Mauricio Caceres

Join Date: Sep 2015
Posts: 130

#11

19 Nov 2017, 08:56

I just realized that in my post above, the results are comparing a double and a float. The correct comparison is:

Code:

clear
set obs 8975
set seed 1729
gen double x = rnormal() * 5 + 10
expand 10
pctile double p = x, nq(500) altdef
xtile tile = x, cutpoints(p)

preserve
    contract tile
    tab _freq
restore

which gives

Code:

  Frequency |      Freq.     Percent        Cum.
------------+-----------------------------------
        170 |         33        6.60        6.60
        180 |        459       91.80       98.40
        190 |          8        1.60      100.00
------------+-----------------------------------
      Total |        500      100.00

Which is better but still off. I will also point out that without `altdef`, you get a similar distribution with xtile vs using the corrected formula.

Code:

pctile double p2 = x, nq(500)
xtile tile2 = x, cutpoints(p2)

preserve
    contract tile2
    tab _freq
restore

This also gives

Code:

  Frequency |      Freq.     Percent        Cum.
------------+-----------------------------------
        170 |         25        5.00        5.00
        180 |        475       95.00      100.00
------------+-----------------------------------
      Total |        500      100.00

But, as expected, tile2 and tmata are different. (I also tested this in the machine where I have Stata SE and MP; the issue is, again, not present in MP once I correctly create "p" as a double).

Comment

Mauricio Caceres

Join Date: Sep 2015

Posts: 130
#12

19 Nov 2017, 09:01

Originally posted by William Lisowski View Post

Consider also this variant on your example from post #8, which shows that even Stata SE can do better on your sample data by using xtile directly than by using pctile to create cutoffs and the using xtile to apply those cutoffs to the data. Interesting and not immediately explicable to me, even after a short glance at the xtile source code.

This may not correct. Stata's built-in xtile ignores "altdef", which is why I did not use it in my example. I have reported this bug here: https://www.statalist.org/forums/for...-option-altdef

So this is comparing xtile without altdef, which is better than xtile with altdef (once it is fixed). Also,

Originally posted by William Lisowski View Post

An interesting experiment would be to add

Code:

set processors 1

I did do this, but the results were the same.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#13

20 Nov 2017, 06:07

I would suggest that the problem could be avoided by not working at the limits of Stata's numerical precision, but instead converting the values you are working with from double to float. That is essentially the point made in the discussion of "false precision" made in the Stata Blog post I linked to at #5 above.
Comment
Mauricio Caceres

Join Date: Sep 2015

Posts: 130
#14

20 Nov 2017, 07:03

I'll keep that in mind. I am prone to working with doubles, so stuff like this comes up... As I note a few posts back, pctile wasn't actually the issue, but stuff like that eventually trickles down to xtile and if conditions and assert statements. I don't mind pctile being 1e-5 off, or whatever it actually is, but I mind xtile giving different inconsistent results with identical inputs.

My original posts did not reflect this, so I think we had some back and forth about different things because I was unclear why this was actually a problem. Apologies for that.
Comment

Announcement