Generalized beta distribution in probabilistic question

Riccardo Rasoni

Join Date: Dec 2024

Posts: 18
#1

Generalized beta distribution in probabilistic question

13 May 2025, 10:06

Hi everyone,

I'm currently trying to fit a generalized beta distribution for every individuals that I have in a survey in order to take interquartile range of the responses as proxy for uncertainty levels. The formulation derives from a famous paper of Engelberg and Manski (2009) that is called "Comparing the Point Predictions and Subjective Probability Distributions of Professional Forecasters", widely used in the surveys of the fed and the ecb.
I was wondering for a way to derive the distribution's percentiles directly from STATA, that is not something that I found up to now (maybe I'm wrong).

In particular, the distribution is fitted on the outcomes that individuals give to a probabilistic question where they can attach different probability (summed up to 100) from 0 to 100 to different bins of inflation expectations, in my case I have:
Increase by 8% or more

Increase by 4% to less than 8%

Increase by 2% to less than 4%

Increase by 0% to less than 2%

Decrease by more than 0% to less than 2%

Decrease by 2% to less than 4%

Decrease by 4% to less than 8%

Decrease by 8% or more

the estimate of the distribution is the following

\[
\min_{\substack{a > 1,\, b > 1,\\ l > L,\, r > r}} \sum_{i=1}^{8} \left[ B(x_i; a, b, l, r) - F(x_i) \right]^2
\]

The distribution parameters (a,b,l,r)(a, b, l, r)(a,b,l,r) are estimated by minimizing the squared distance between the cumulative beta distribution and the respondent’s empirical distribution over the 8 bins.
a,b>1 are shape parameters.

l,r are location parameters, constrained by lower (L) and upper (R) bounds if the outer bins are selected (i.e., bins 1 or 8).

F(x_i) represents the cumulative empirical distribution based on the respondent’s probability allocations.

Depending on the bins selected:
If neither bin 1 nor bin 8 is used → only a and b are estimated; l and r are fixed.

If one of bin 1 or 8 is used → estimate a,b and the corresponding location parameter.

If both outer bins are used → all four parameters (a,b,l,r) are estimated.

Initialization and bounds:
Initial parameter guesses: a=b=2, l=−12, r=12

Constraints: L=−38, R=+38

Sorry for the length of the post,
if you have any idea on how to proceed, or if you know whether someone already implemented this methodology on STATA, I would be very glad.

Thanks,
best regards,

Riccardo

Last edited by Riccardo Rasoni; 13 May 2025, 10:07. Reason: beta distribution
Tags: None
Josh Zweig

Join Date: Nov 2024

Posts: 18
#2

13 May 2025, 10:32

Dear Riccardo,

Stata does not natively support estimation of the full four-parameter generalized beta distribution, but it is possible to implement the procedure manually. The key steps involve (1) constructing the empirical cumulative distribution for each respondent based on their assigned probabilities over the eight bins; (2) defining the generalized beta cumulative distribution function using Stata’s ibeta() function, which computes the regularized incomplete beta function; and (3) estimating the parameters (a, b, l, r) by minimizing the squared error between the empirical CDF and the generalized beta CDF. This can be done using Stata’s ml command or Mata's optimization routines. You can fix or estimate l and r depending on whether the outer bins are used, as described in your message.

Here is a basic example of how you can define the generalized beta CDF as a user-defined program in Stata:

Code:

program define gbeta_cdf, rclass // Arguments: x = evaluation point, a = shape1, b = shape2, l = lower bound, r = upper bound args x a b l r tempvar z cdf gen double `z' = (`x' - `l') / (`r' - `l') gen double `cdf' = ibeta(`a', `b', `z') return scalar cdf = `cdf' end

Once parameters are estimated, extracting percentiles such as the interquartile range is straightforward using the inverse CDF of the generalized beta. In Stata, this is available through the invibeta() function. For example, the 25th percentile can be computed as l + (r − l) * invibeta(a, b, 0.25), and similarly for the 75th percentile. The difference between these gives you the interquartile range, which serves as your uncertainty proxy.

This approach will require looping over individuals or using statsby to apply the estimation per respondent. If you are interested, I can provide a template .do or .ml file that sets up the estimation for one respondent, which you could then scale to your full sample.

Best regards,
Josh from Estima
2 likes
Comment
Riccardo Rasoni

Join Date: Dec 2024

Posts: 18
#3

13 May 2025, 10:40

Originally posted by Josh Zweig View Post

Dear Riccardo,

Stata does not natively support estimation of the full four-parameter generalized beta distribution, but it is possible to implement the procedure manually. The key steps involve (1) constructing the empirical cumulative distribution for each respondent based on their assigned probabilities over the eight bins; (2) defining the generalized beta cumulative distribution function using Stata’s ibeta() function, which computes the regularized incomplete beta function; and (3) estimating the parameters (a, b, l, r) by minimizing the squared error between the empirical CDF and the generalized beta CDF. This can be done using Stata’s ml command or Mata's optimization routines. You can fix or estimate l and r depending on whether the outer bins are used, as described in your message.

Here is a basic example of how you can define the generalized beta CDF as a user-defined program in Stata:

Code:

program define gbeta_cdf, rclass // Arguments: x = evaluation point, a = shape1, b = shape2, l = lower bound, r = upper bound args x a b l r tempvar z cdf gen double `z' = (`x' - `l') / (`r' - `l') gen double `cdf' = ibeta(`a', `b', `z') return scalar cdf = `cdf' end

Once parameters are estimated, extracting percentiles such as the interquartile range is straightforward using the inverse CDF of the generalized beta. In Stata, this is available through the invibeta() function. For example, the 25th percentile can be computed as l + (r − l) * invibeta(a, b, 0.25), and similarly for the 75th percentile. The difference between these gives you the interquartile range, which serves as your uncertainty proxy.

This approach will require looping over individuals or using statsby to apply the estimation per respondent. If you are interested, I can provide a template .do or .ml file that sets up the estimation for one respondent, which you could then scale to your full sample.

Best regards,
Josh from Estima

Dear Josh,

thank you very much for your answer, it was very helpful. Of course if you can provide the template that you mentioned I would be very glad
thanks

Riccardo
Comment

Josh Zweig

Join Date: Nov 2024
Posts: 18

13 May 2025, 10:50

You're very welcome, Riccardo. I'm glad it was helpful. Below is a template using Stata's ml command to estimate the generalized beta distribution for a single individual’s forecast. This assumes you've already processed the data into cumulative probabilities F_i and bin midpoints x_i.

Step 1: Input Data for One Respondent

Assume you have 8 observations for one individual, with variables:

x = bin midpoints
F = cumulative empirical distribution (based on assigned probabilities)

Code:

* Example data structure (1 respondent)
clear
input byte bin float(x F)
1 -10 0.10
2  -6 0.20
3  -2 0.35
4   1 0.55
5   3 0.75
6   5 0.90
7   7 0.95
8  10 1.00
end

*Step 2: Define the Generalized Beta CDF Evaluation
program define beta_obj
    version 19.0
    args lnf a b l r

    tempvar z model_F
    gen double `z' = (x - `l') / (`r' - `l')
    gen double `model_F' = ibeta(`a', `b', `z')

    quietly replace `lnf' = -((F - `model_F')^2)
end

*Step 3: Estimate Parameters Using ml

ml model lf beta_obj (a b l r)
ml init a 2 b 2 l -12 r 12
ml constrain 1 a > 1
ml constrain 2 b > 1
ml maximize

*Step 4: Calculate Percentiles After Estimation
ml model lf beta_obj (a b l r)
ml init a 2 b 2 l -12 r 12
ml constrain 1 a > 1
ml constrain 2 b > 1
ml maximize

Best regards,
Josh Zweig - Estima

Comment

Riccardo Rasoni

Join Date: Dec 2024

Posts: 18
#5

13 May 2025, 10:54

Originally posted by Josh Zweig View Post

You're very welcome, Riccardo. I'm glad it was helpful. Below is a template using Stata's ml command to estimate the generalized beta distribution for a single individual’s forecast. This assumes you've already processed the data into cumulative probabilities F_i and bin midpoints x_i.

Step 1: Input Data for One Respondent

Assume you have 8 observations for one individual, with variables:
x = bin midpoints

F = cumulative empirical distribution (based on assigned probabilities)

Code:

* Example data structure (1 respondent) clear input byte bin float(x F) 1 -10 0.10 2 -6 0.20 3 -2 0.35 4 1 0.55 5 3 0.75 6 5 0.90 7 7 0.95 8 10 1.00 end *Step 2: Define the Generalized Beta CDF Evaluation program define beta_obj version 19.0 args lnf a b l r tempvar z model_F gen double `z' = (x - `l') / (`r' - `l') gen double `model_F' = ibeta(`a', `b', `z') quietly replace `lnf' = -((F - `model_F')^2) end *Step 3: Estimate Parameters Using ml ml model lf beta_obj (a b l r) ml init a 2 b 2 l -12 r 12 ml constrain 1 a > 1 ml constrain 2 b > 1 ml maximize *Step 4: Calculate Percentiles After Estimation ml model lf beta_obj (a b l r) ml init a 2 b 2 l -12 r 12 ml constrain 1 a > 1 ml constrain 2 b > 1 ml maximize

Best regards,
Josh Zweig - Estima

Wow, impressive work!
Thank you very much Josh, I really appreciate what you did.
I immediately go to test it, thanks!
Best regards,

Riccardo
Comment

Announcement

Generalized beta distribution in probabilistic question

Comment

Comment

Comment

Comment