Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using ologit predicted probabilities to estimate categorical variable

    Hi all,

    I am relatively new to Stata and was looking for some direction on how to use the postestimation tools for ologit. Specifically, I am trying to take the estimates of an ordered logit regression in dataset #1, and predict the outcomes for that categorical variable in dataset #2 .

    So far, I have been able to use the 'estimates store' and 'estimates restore' commands, along with 'predict' to get the predicted probabilities of each outcome for one categorical variable (which has six possible outcomes). Now, I am hoping to predict a particular value (e.g, 1-6) for that categorical variable using those probabilities. Is there some sort of random number function that would allow me to manually assign probabilities of particular outcomes? Or is there something that I am missing in the ologit postestimation documentation?

    See some example code below:

    Code:
    Code:
    use "dataset1.dta"  
    ologit cat_var predictor1 predictor2 predictor3 predictor4 predictor5
    estimates store m1  
    use "dataset2.dta", clear  
    estimates restore m1  
    predict cat_var1 cat_var2 cat_var3 cat_var4 cat_var5 cat_var6

  • #2
    Thomas: I think what you wind up with is the six estimated probabilities for each of the categories, right? So your prediction should be the category with the largest estimated probability. Naturally, there will be mistakes -- often lots of them. I would do it in a cumbersome way using a bunch of if statements; I'm sure you can come up with something more clever.

    Code:
    gen double cat_var_max = max(cat_var1,cat_var2,cat_var3,cat_var4,cat_var5,.cat_var6)
    gen cat_var_pred = 1 if cat_var1 >= cat_var_max
    replace cat_var_pred = 2 if cat_var2 >= cat_var_max
    replace cat_var_pred = 3 if cat_var3 >= cat_var_max
    ...

    Comment


    • #3
      I don't know if this is more clever, but a slightly more Stata-ish way to implement Jeff Wooldridge 's approach is:
      Code:
      gen `c(obs_t)' obs_no = _n
      reshape long cat_var, i(obs_no) j(category)
      by obs_no (cat_var category), sort: gen most_likely_category = category[_N]
      reshape wide
      By the way, as with the code in #2, in the unlikely event of a tie, the highest numbered category is chosen. On the other hand, the code suggested in #2 is somewhat more transparent.

      That said, reading #1, I am left with the impression that this may not be what Thomas O'Rourke. wants. The description given suggests that he is interested in fitting the -ologit- model in one data set and then simulating the application of its predictive model in a different data set. In that case, the use of some random sampling would be appropriate. If this is what he wants, I recommend he post back with more details (e.g. how many replications are wanted, other statistics that might be simulated besides category chosen, whether there are also observed category outcomes against which the simulation should be validated) and I will try to show an approach.
      Last edited by Clyde Schechter; 05 Jun 2023, 15:46.

      Comment


      • #4
        Thank you both! Clyde Schechter, you are exactly right. Ideally, I was hoping to predict a categorical value for every observation in dataset #2 using some sort of random sampling informed by the probabilities generated by 'predict.' Currently, I have those predicted probabilities stored in six separate variables, as Jeff Wooldridge suggests.

        Is there some sort of random sampling technique that allows users to manually call the probabilities for each value?

        If I were to go with the approach suggested by Jeff, nearly every single observation in dataset #2 would take the value of 1, because it is far more likely for any given observation to take the value 1 than the value 6. Apologies for lack of clarity on that point!

        Comment


        • #5
          Here is a demonstration of the approach using the built-in auto data set, with rep78 as the ordered outcome variable in a toy -ologit- regression. Adapt to your actual data.

          Note: In this instance, the categories of rep78 are themselves consecutive integers starting from 1. For that reason, the c_order variable is actually not needed, and the variable category itself could be used where the code uses c_order. But if your categories are 1, 2, ..., N consecutive integers, the c_order machinery is necessary. As you showed no example data, I wrote the code on the assumption that the more complicated, but more general, approach is required.

          Code:
          clear*
          
          sysuse auto, clear
          
          ologit rep78 price mpg i.foreign
          predict cat_var*
          
          //    CODE TO SELECT RANDOM CATEGORY REFLECTING THE MODELED PROBABILITIES
          gen `c(obs_t)' obs_no = _n
          reshape long cat_var, i(obs_no) j(category)
          
          set seed 1234 // OR YOUR FAVORITE NUMBER SEED
          by obs_no (category), sort: gen cum_prob = sum(cat_var)
          by obs_no (category): gen c_order = _n
          by obs_no (category): gen double selector = runiform() if _n == 1
          by obs_no (category): egen selected_order = ///
              min(cond(cum_prob >= selector[1], c_order, .))
          by obs_no (category): gen selected_category = category[selected_order]
          
          drop selector c_order cum_prob
          reshape wide
          This just selects a random category respecting the predicted probabilities once. Evidently if you need to do this multiple times, you can wrap this code inside a loop with some minor editing.

          Comment


          • #6
            What this brings to light for me is that Stata's otherwise wide collection of random variable functions lacks one for a multinomial, something I've occasionally needed. Mata does have -rdiscrete()-, but as near as I can tell (?) , it's not convenient for the current situation. -irecode()- applied to a cumulative distribution might help here, but again I don't think it's convenient for a variable list rather than a fixed prob. vector.

            Here's a program based on what I did before, although the current situation is complicated by the probabilities being different for every observation. I don't completely understand how Clyde's code works -- my fault -- but I'd presume the code I suggest below does something similar to his, but without a -reshape- or a -sort-.

            Code:
            // Pedestrian code without error checking follows. Comments welcome.
            cap prog drop rmulti
            prog rmulti
            // Given a probability vector contained in a variable list, create
            // a new variable with values drawn from a multinomial distribution
            syntax varlist, GENerate(name)
            // Cumulate the probability variables.
            local k = wordcount("`varlist'")
            tempname F1  
            local first = word("`varlist'",  1)
            qui gen `F1' = `first'
            forval i = 2/`=`k'-1' {
               local current = word("`varlist'", `i')
               local previous = `i' - 1
               tempname F`i'
               qui gen `F`i'' = `F`previous'' + `current'
            }
            tempname F`k'
            qui gen `F`k'' = 1.0 // avoids roundoff error
            //
            // Generate a random variable value.
            qui gen `generate' = .
            tempname rand
            gen `rand' = runiform()
            forval i = 1/`k' {
               qui replace `generate' = `i' ///
                  if missing(`generate') & (`rand' <= `F`i'')
            }
            end
            //
            // Demonstration code.
            sysuse auto, clear
            ologit rep78 price mpg i.foreign
            predict prob*
            //
            set seed 9475
            // One repetition, for illustration.
            rmulti prob*, gen(y)
            list prob* y in 1/10
            drop y
            //
            // Test by repeatedly generating multinomial values for each original observation's prob. vector.
            gen id = _n
            expand 10000
            rmulti prob*, gen(y)
            // Compare predicted value distribution to the prob. vector for a few original observations.
            egen just1 = tag(id)
            foreach id in 10 20 30 40 50 {
               di _newline "checking id = `id'"
               list prob* if just1 & (id == `id')
               tab y if (id == `id')
            }

            Comment


            • #7
              The core of both Mike Lacy's approach and mine is found in
              Code:
              forval i = 1/`k' {
                 qui replace `generate' = `i' ///
                    if missing(`generate') & (`rand' <= `F`i'')
              }
              in his code, and in
              Code:
              by obs_no (category): egen selected_order = ///
                  min(cond(cum_prob >= selector[1], c_order, .))
              by obs_no (category): gen selected_category = category[selected_order]
              in mine. They do the same thing; his with wide data and mine with long.

              Comment


              • #8
                For whatever it's worth, I figured out that the relatively obscure -irecode()- command *can* be conveniently used here. That command recodes a variable into the values 0, 1, 2, ..., k-1 using a set of k cutpoints, which avoids the need for the user to write any loop over the cumulative probabilities to create random values. So, given a list of (say) 5 cumulated probability variables, say F1, F2, ..., F5, one could do this to generate a multinomial variable:
                Code:
                gen rand  = runiform()
                gen y = irecode(rand, F1, F2, F3, F4, 1.0)
                Using 1.0 for the top value avoids problems stemming from roundoff error.

                There appears to be a savings in machine time of about 50% for using -irecode()-, as opposed to the loop I showed. In some simulation contexts, that might be useful.

                Comment

                Working...
                X