Error with function that uses -rdiscrete()-

Mike Lacy

Join Date: Apr 2014
Posts: 2416

Error with function that uses -rdiscrete()-

23 Feb 2021, 11:04

I have a small program to generate multinomial random variables in Stata, given an existing Stata variable name and a probability vector. It's a wrapper for a Mata function that uses Mata's -rdiscrete()-. The problem I'm having is that -rdiscrete()- occasionally will not execute with certain probability vectors.

-rdiscrete()- understandably requires a probability vector that sums *very* closely to 1.0, a problem my function adapts to by checking for this problem and always making a tiny adjustment to the first element of the probability vector. After many years of using this function without incident, I encountered a problem while trying to generate a uniform multinomial variable with 7 categories.

The surprising thing is that -rdiscrete()- worked with a 7-element probability vector that was a *little* bit off from summing to 1.0, but *failed* when I gave it an input vector for 7 categories that was *closer* to summing to 1.0! This stumps me, though my guess goes to some kind of precision issue. Do any of you have an idea of what the problem might be?

Code for the Mata function, including some debugging echos and some relevant examples appear below.

Code:

cap mata mata drop multi()
mata:
void multi(string varname, string okname, real scalar N, real colvector prob)
// Generate multinomial random variable in Stata.
{
   "Input prob vector:"  
   prob
   // If sum of prob. vector differs from 1.0 by too much, it's an error.
   diff = sum(prob) - 1.0
   if (abs(diff) > 1e-6) {
     st_local(okname, "0")
     exit
   }
   // User-given prob. vectors often differ a little from 1.0 due to precision issues.
   // Avoid this by *always* trivially adjusting the first element in the vector.
   "diff before adjustment: " + strofreal(diff, "%20.5g")
   prob[1] = prob[1] - diff
   diff = sum(prob) - 1.0
   "diff after adjustment: " + strofreal(diff, "%20.5g")
   x = rdiscrete(N, 1, prob)
   st_store(., varname,  x)      
   st_local(okname, "1")
}
end
//
// Examples, good and bad.
clear
local N = 100
set obs `N'
//
gen byte y3 = .
// Works fine.
mata: multi("y3", "OKflag", `N', ///
   (.33333333\.33333333\.33333333) )
tab y3   
//
// Works fine: prob. vector with values reasonably close to 1/7
gen byte y7 = .
mata: multi("y7", "OKflag", `N', ///
   (.14285714\.14285714\.14285714\.14285714\ ///
    .14285714\.14285714\.14285714) )
tab y7    
//    
// Values *closer* to 1/7 fail.    
gen byte y7closer = .
mata: multi("y7closer", "OKflag", `N', ///
(.1428571428571\.1428571428571\.1428571428571\.1428571428571\  ///
.1428571428571\.1428571428571\.1428571428571) )

Tags: None

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2402

23 Feb 2021, 19:49

Hmm....I thought at first that if you were to rewrite each addition, subtraction and sum in terms of quadsum() then that might fix the precision error. However, that failed in a similar way. As a workaround, I reframed the function to use runiform() instead.

Code:

clear *
cls

set seed 523

set matastrict on

cap mata mata drop multi()
mata:
  void multi(string varname, string okname, real scalar N, real colvector prob) {
    real colvector  cumprob,  u,   x
    real scalar     diff,     i
    
    "Input prob vector:"
    prob
    
    // If sum of prob. vector differs from 1.0 by too much, it's an error.
    diff = sum(prob) - 1.0
    if (abs(diff) > 1e-6) {
      st_local(okname, "0")
      exit()
    }

    cumprob = J(length(prob), 1, .)
    for (i = 1; i < length(prob); i++) {
      cumprob[i] = sum(prob[1..i])
    }
    cumprob[length(prob)] = 1

    u = runiform(N, 1)
    
    x = J(N, 1, 1)
    for (i = 1; i < length(prob); i++) {
      x = x :+ (u :> cumprob[i])
    }
    
    st_store(., varname, x)
    st_local(okname, "1")
  }
end

local Nobs 1000
set obs `Nobs'

gen byte y1 = .
mata: multi("y1", "okflag", `Nobs', (.1, .2, .3, .4)')
tab y1

gen byte y2 = .
mata: multi("y2", "okflag", `Nobs', (.1, .1, .1, .7)')
tab y2

gen byte y3 = .
mata: multi("y3", "okflag", `Nobs', J(7, 1, 1/7))
tab y3

gen byte y4 = .
mata: multi("y4", "okflag", `Nobs', J(9, 1, 1/9))
tab y4

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#3

24 Feb 2021, 07:33

Thanks to Leonardo Guizzetti for joining in on my question. Yes, this kind of solution, cumulating the prob. vector and using a uniform random number, is a good and standard one, which I've also coded in Stata without recourse to Mata, and in fact the relatively obscure built-in -irecode- command is helpful for this purpose.

However, what I'm finding here almost looks like a bug in -rdiscrete()-, and I'm curious to try to diagnose it.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#4

24 Feb 2021, 08:38

Thanks for pointing out -irecode-, I hadn't noticed that before. I'm sure you were already aware of the approach, but thought I'd offer a workaround. This might be a bug, but I can't be sure. Adding more or less digits in your final example seem to work, but this middle ground you have found does not. Consider also the repeating fraction (in binary and base-10) of 1/9. Inputting 0.111111111111 (diff before adjustment: -9.9987e-13), but more or less digits works as expected.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#5

24 Feb 2021, 09:33

Hmm, that's interesting and strange.

I did confirm, I think, that the problem with -rdiscrete()- here is *not* due to the probabilities not summing to 1.0. Here's a simpler minimal example than what I had above that shows this, along with the error messages that occur:

Code:

mata: prob = (.1428571428571\.1428571428571\.1428571428571\.1428571428571\ /// .1428571428571\.1428571428571\.1428571428571) diff = sum(prob)-1.0 "diff = "; diff rdiscrete(100, 1, prob) end

This results in the following:

Code:

diff = -3.00204e-13 : rdiscrete(100, 1, prob) sum of the probabilities must be 1 rdiscrete(): 3300 argument out of range <istmt>: - function returned error (0 lines skipped)

I go in and tweak the probability vector and retry

Code:

mata: prob[1] = prob[1] - diff diff = sum(prob) -1.0 "diff = "; diff rdiscrete(100, 1, prob) end

which gives the following output and error:
:

Code:

diff = 0 : rdiscrete(100, 1, prob) rdiscrete(): 3498 Stata returned error <istmt>: - function returned error (0 lines skipped) --------------------------------------------------------------------------------------------------------------------------------------------------------- r(3498);

The gloss for r(3498) is quite generic: "An error specific to this function arose. The text of the message should describe the problem."

This becomes murkier and murkier, at least to me <grin>.

I'm now thinking to report this a possible bug. Does that seem reasonable?

I should mention that I'm using Stata 15.1. I presume but don't know that this problem persists in version 16.
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2593
#6

24 Feb 2021, 10:16

I can confirm the behavior in Stata 16.1. The last error indeed looks very much like a bug. As a minimum, the function should return an informative error message if it does not like the input.

https://www.kripfganz.de/stata/
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#7

24 Feb 2021, 10:54

I've sent this issue to Tech Support, and will report back when I hear something.
Comment
Joerg Luedicke (StataCorp)

StataCorp Employee

Join Date: Apr 2014

Posts: 116
#8

25 Feb 2021, 09:51

Hi all,

This is a bug and we are currently looking into a fix for this that we will release in a future update. Thank you Mike Lacy for the find, and we apologize for the inconvenience!

Best,
Joerg
Comment

Announcement

Error with function that uses -rdiscrete()-

Comment

Comment

Comment

Comment

Comment

Comment

Comment