Using ologit predicted probabilities to estimate categorical variable

Thomas ORourke

Join Date: Jun 2023

Posts: 3
#1

Using ologit predicted probabilities to estimate categorical variable

05 Jun 2023, 09:16

Hi all,

I am relatively new to Stata and was looking for some direction on how to use the postestimation tools for ologit. Specifically, I am trying to take the estimates of an ordered logit regression in dataset #1, and predict the outcomes for that categorical variable in dataset #2 .

So far, I have been able to use the 'estimates store' and 'estimates restore' commands, along with 'predict' to get the predicted probabilities of each outcome for one categorical variable (which has six possible outcomes). Now, I am hoping to predict a particular value (e.g, 1-6) for that categorical variable using those probabilities. Is there some sort of random number function that would allow me to manually assign probabilities of particular outcomes? Or is there something that I am missing in the ologit postestimation documentation?

See some example code below:

Code:

Code:

use "dataset1.dta" ologit cat_var predictor1 predictor2 predictor3 predictor4 predictor5 estimates store m1 use "dataset2.dta", clear estimates restore m1 predict cat_var1 cat_var2 cat_var3 cat_var4 cat_var5 cat_var6
Tags: None
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2204
#2

05 Jun 2023, 14:36

Thomas: I think what you wind up with is the six estimated probabilities for each of the categories, right? So your prediction should be the category with the largest estimated probability. Naturally, there will be mistakes -- often lots of them. I would do it in a cumbersome way using a bunch of if statements; I'm sure you can come up with something more clever.

Code:

gen double cat_var_max = max(cat_var1,cat_var2,cat_var3,cat_var4,cat_var5,.cat_var6) gen cat_var_pred = 1 if cat_var1 >= cat_var_max replace cat_var_pred = 2 if cat_var2 >= cat_var_max replace cat_var_pred = 3 if cat_var3 >= cat_var_max ...
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30168
#3

05 Jun 2023, 15:44

I don't know if this is more clever, but a slightly more Stata-ish way to implement Jeff Wooldridge 's approach is:

Code:

gen `c(obs_t)' obs_no = _n reshape long cat_var, i(obs_no) j(category) by obs_no (cat_var category), sort: gen most_likely_category = category[_N] reshape wide

By the way, as with the code in #2, in the unlikely event of a tie, the highest numbered category is chosen. On the other hand, the code suggested in #2 is somewhat more transparent.

That said, reading #1, I am left with the impression that this may not be what Thomas O'Rourke. wants. The description given suggests that he is interested in fitting the -ologit- model in one data set and then simulating the application of its predictive model in a different data set. In that case, the use of some random sampling would be appropriate. If this is what he wants, I recommend he post back with more details (e.g. how many replications are wanted, other statistics that might be simulated besides category chosen, whether there are also observed category outcomes against which the simulation should be validated) and I will try to show an approach.

Last edited by Clyde Schechter; 05 Jun 2023, 15:46.
Comment
Thomas ORourke

Join Date: Jun 2023

Posts: 3
#4

06 Jun 2023, 07:24

Thank you both! Clyde Schechter, you are exactly right. Ideally, I was hoping to predict a categorical value for every observation in dataset #2 using some sort of random sampling informed by the probabilities generated by 'predict.' Currently, I have those predicted probabilities stored in six separate variables, as Jeff Wooldridge suggests.

Is there some sort of random sampling technique that allows users to manually call the probabilities for each value?

If I were to go with the approach suggested by Jeff, nearly every single observation in dataset #2 would take the value of 1, because it is far more likely for any given observation to take the value 1 than the value 6. Apologies for lack of clarity on that point!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30168
#5

06 Jun 2023, 10:10

Here is a demonstration of the approach using the built-in auto data set, with rep78 as the ordered outcome variable in a toy -ologit- regression. Adapt to your actual data.

Note: In this instance, the categories of rep78 are themselves consecutive integers starting from 1. For that reason, the c_order variable is actually not needed, and the variable category itself could be used where the code uses c_order. But if your categories are 1, 2, ..., N consecutive integers, the c_order machinery is necessary. As you showed no example data, I wrote the code on the assumption that the more complicated, but more general, approach is required.

Code:

clear* sysuse auto, clear ologit rep78 price mpg i.foreign predict cat_var* // CODE TO SELECT RANDOM CATEGORY REFLECTING THE MODELED PROBABILITIES gen `c(obs_t)' obs_no = _n reshape long cat_var, i(obs_no) j(category) set seed 1234 // OR YOUR FAVORITE NUMBER SEED by obs_no (category), sort: gen cum_prob = sum(cat_var) by obs_no (category): gen c_order = _n by obs_no (category): gen double selector = runiform() if _n == 1 by obs_no (category): egen selected_order = /// min(cond(cum_prob >= selector[1], c_order, .)) by obs_no (category): gen selected_category = category[selected_order] drop selector c_order cum_prob reshape wide

This just selects a random category respecting the predicted probabilities once. Evidently if you need to do this multiple times, you can wrap this code inside a loop with some minor editing.
1 like
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2423

06 Jun 2023, 11:47

What this brings to light for me is that Stata's otherwise wide collection of random variable functions lacks one for a multinomial, something I've occasionally needed. Mata does have -rdiscrete()-, but as near as I can tell (?) , it's not convenient for the current situation. -irecode()- applied to a cumulative distribution might help here, but again I don't think it's convenient for a variable list rather than a fixed prob. vector.

Here's a program based on what I did before, although the current situation is complicated by the probabilities being different for every observation. I don't completely understand how Clyde's code works -- my fault -- but I'd presume the code I suggest below does something similar to his, but without a -reshape- or a -sort-.

Code:

// Pedestrian code without error checking follows. Comments welcome.
cap prog drop rmulti
prog rmulti
// Given a probability vector contained in a variable list, create
// a new variable with values drawn from a multinomial distribution
syntax varlist, GENerate(name)
// Cumulate the probability variables.
local k = wordcount("`varlist'")
tempname F1  
local first = word("`varlist'",  1)
qui gen `F1' = `first'
forval i = 2/`=`k'-1' {
   local current = word("`varlist'", `i')
   local previous = `i' - 1
   tempname F`i'
   qui gen `F`i'' = `F`previous'' + `current'
}
tempname F`k'
qui gen `F`k'' = 1.0 // avoids roundoff error
//
// Generate a random variable value.
qui gen `generate' = .
tempname rand
gen `rand' = runiform()
forval i = 1/`k' {
   qui replace `generate' = `i' ///
      if missing(`generate') & (`rand' <= `F`i'')
}
end
//
// Demonstration code.
sysuse auto, clear
ologit rep78 price mpg i.foreign
predict prob*
//
set seed 9475
// One repetition, for illustration.
rmulti prob*, gen(y)
list prob* y in 1/10
drop y
//
// Test by repeatedly generating multinomial values for each original observation's prob. vector.
gen id = _n
expand 10000
rmulti prob*, gen(y)
// Compare predicted value distribution to the prob. vector for a few original observations.
egen just1 = tag(id)
foreach id in 10 20 30 40 50 {
   di _newline "checking id = `id'"
   list prob* if just1 & (id == `id')
   tab y if (id == `id')
}

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30168

06 Jun 2023, 11:59

The core of both Mike Lacy's approach and mine is found in

Code:

forval i = 1/`k' {
   qui replace `generate' = `i' ///
      if missing(`generate') & (`rand' <= `F`i'')
}

in his code, and in

Code:

by obs_no (category): egen selected_order = ///
    min(cond(cum_prob >= selector[1], c_order, .))
by obs_no (category): gen selected_category = category[selected_order]

in mine. They do the same thing; his with wide data and mine with long.

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2423
#8

06 Jun 2023, 14:45

For whatever it's worth, I figured out that the relatively obscure -irecode()- command *can* be conveniently used here. That command recodes a variable into the values 0, 1, 2, ..., k-1 using a set of k cutpoints, which avoids the need for the user to write any loop over the cumulative probabilities to create random values. So, given a list of (say) 5 cumulated probability variables, say F1, F2, ..., F5, one could do this to generate a multinomial variable:

Code:

gen rand = runiform() gen y = irecode(rand, F1, F2, F3, F4, 1.0)

Using 1.0 for the top value avoids problems stemming from roundoff error.

There appears to be a savings in machine time of about 50% for using -irecode()-, as opposed to the loop I showed. In some simulation contexts, that might be useful.
1 like
Comment

Announcement