Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bayesian Posteriors (Direct Computation) Help

    Hello all,

    I am currently working on a project and some internal comments suggested that I directly control for theoretical posteriors as covariates into my analysis. I have the entire theoretical model completed, but I have reached my knowledge limit with Stata in trying to produce such variables using direct computational methods in Stata. So, now I need help! Do not worry I have made a few attempts myself without success.

    Below is a snippet of the relevant variables needed to produce the Bayesian posterior that I am needing.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str24 label_participant float p byte corrected_signal float(p_score_10 p_score_11 p_score_12 p_score_13 p_score_14 p_score_15 bayes_score_10 bayes_score_11 bayes_score_12 bayes_score_13 bayes_score_14 bayes_score_15)
    "558a035bfdf99b2d75651378" .6  8   0   0  0   0   0   0 . . . . . .
    "558de4bffdf99b0cad9683ba" .9  5   0   0  0   0   0   0 . . . . . .
    "5591827dfdf99b4fccbdfb21" .6  9   0   0  0   0   0   0 . . . . . .
    "559385f5fdf99b78e49c58e9" .9 15 .15  .2 .2 .11 .06 .04 . . . . . .
    "559ef308fdf99b2616ff2a4f" .6 19   0 .05 .1 .45 .35 .05 . . . . . .
    "55a43278fdf99b02ff6caf89" .6 13 .15  .2 .2  .2 .08   0 . . . . . .
    "55a55bd7fdf99b790558627c" .9 10 .05  .1 .7  .1 .05   0 . . . . . .
    "55b237e6fdf99b19ea79d2f7" .9  .   0   0  0   0   0   0 . . . . . .
    "55b42b7cfdf99b5244217cf4" .6 11   0   0  0   0   0 .05 . . . . . .
    "55b47001fdf99b615b12d66c" .6  .   0   0  0   0  .1  .2 . . . . . .
    "55bd22d0fdf99b374ec0711a" .6  7   0   0 .2 .35  .3  .1 . . . . . .
    end
    Please note that the variables p_score* and bayes_score* index from 0 to 20 (21 states of the world) in the full data set and I am using the ones in the dataex output purely for reference purposes.

    Here is a quick breakdown of what each variable stands for:

    1. label_participant : id for each subject in my experiment
    2. p : probability participant's 'corrected_signal' equals their true score on a quiz
    3. corrected_signal : the signal a participant observed after their priors but before their posteriors
    4. p_score_`k' : participant's prior probability they got a score of `k' for k = [0, 20]
    5. bayes_score_`k' : the theoretical posterior probability of getting score `k' if updating as if they were a pure or exact Bayesian

    Now, there are two formulas from the theoretical model needed to produce bayes_score_`k' for each `k' (beware pseudo Stata code incoming).

    1. The posterior belief that a Bayesian thinks their true score is equal to the signal received:

    bayes_score_`k=corrected_signal' = (p * p_score_`k=corrected_signal') / ((p * p_score_`k=corrected_signal') + sum_{i = -5}^{i = 5}{((1-p)/10) * p_score_`k=corrected_signal + i'})

    2. The posterior belief that a Bayesian thinks their true score is equal to any other score other than the signal they received

    bayes_score_`k=/=corrected_signal' = (((1-p)/10) * p_score_`k=/=corrected_signal') / ((p * p_score_`k=corrected_signal') + sum_{i = -5}^{i = 5}{((1-p)/10) * p_score_`k=corrected_signal + i'}) for each k =/= corrected_signal


    I am wanting to produce some sort of nested for loop with if conditions that can produce all of this in one go. Is this possible? Any help would be greatly appreciated!

    Thanks in advance!

    Nicholas

  • #2
    I am trying to fit this into my own conceptualization of Bayesian posterior probability where we have a prior probability, and a likelihood function for the observation, and those get put together to calculate the posterior. And I'm getting confused by your data and explanation. You state there are 21 possible states, but you only show prior probabilities for 6 states, and they are {10, 11, 12, 13, 14, 15}. What's that about?

    I don't see what the likelihood function is supposed to be. I don't know what you mean when you say that p is the probability that the "corrected_signal" equals the true score. That sounds more like a posterior probability than a likelihood. It also seems that you want to somehow allocate 1-p evenly across other potential values (hence terms involving (1-p)/10. But why dividing by 10? If there are 21 states, shouldn't we be dividing by 20? Or if there are actually only 6 states in all, shouldn't we be dividing by 5?

    Finally, I thought you intended the corrected_signal variable to represent the observed data for the participant. And while the values shown are all between 0 and 20, but several of them are not from the set of states you actually show, {10, 11, 12, 13, 14, 15}.

    So I think I am seriously misunderstanding your data and its relationship to the data constructs. Can you clear this up for me?

    Comment


    • #3
      Thanks Clyde for getting back to me!

      So I stated below the data table the following:

      Please note that the variables p_score* and bayes_score* index from 0 to 20 (21 states of the world) in the full data set and I am using the ones in the dataex output purely for reference purposes.
      So you are correct that there is a total of 21 states in the actual data set hence there would be a varlist of p_score_0=p_score_20, and similarly, bayes_score_0-bayes_score_20. I only included a subset because that is all dataex can output.

      So the 'corrected_signal' variable is the piece of information the participant is observing in my study. It is fundamentally a noisy signal. That is, depending on the treatment, which in my case is either p=0.60 (60% chance the signal is their actual score on the quiz) and p=0.90 (90% chance the signal they observe is their actual score on the quiz). To put it differently, 'p' represents the probability the subjects observed the true state of the world. This then implies that with probability 1-p they would get a signal that uniformly draws and adds an integer from the set [-5, 5] to their true score. So that noisy signal is always centered on their true score. So the (1-p)/10 is simply the probability for each of the noisy states occuring (excluding 0 from the set because they observe it with probability p). The 10 is just the number of noisy states in the set {-5, -4, -3, -2, -1, 1, 2, 3, 4, 5}.

      Does this make sense now? Sorry for the confusion.
      Last edited by Nicholas Lacourse; 03 Dec 2021, 18:42.

      Comment


      • #4
        OK, I think I get it now. I could not work with the example data because it had corrected_signals that were outside the support of the prior distribution, and the priors also did not sum to 1. Those phenomena were, I suppose artifacts of reporting out only selected variables from the prior distribution due to the limits of -dataex-. So I used the general layout of your data and made new "data" that resolved these problems. Respecting -dataex-'s limits, I have limited the universe of outcomes to [0, 5] and the support of the likelihood to [observed-2, observed+2]. But these choices are incorporated into the code with the definitions of the two local macros window and n_states. So all you have to do is change 2 to 5 and 6 to 21 and the code should work.

        The key to doing this effectively is to go to long layout--nearly everything in Stata is easier that way. Note, also, that no explicit loops are required this way.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str24 label_participant float p byte corrected_signal float(p_score_0 p_score_1 p_score_2 p_score_3 p_score_4 p_score_5 bayes_score_0 bayes_score_1 bayes_score_2 bayes_score_3 bayes_score_4 bayes_score_5)
        "558a035bfdf99b2d75651378" .6 0 .0019073712  .2506222 .11307929   .4180601   .2163311   0 . . . . . .
        "558de4bffdf99b0cad9683ba" .9 1   .28322607 .06206453  .1649248   .2559488  .23383577   0 . . . . . .
        "5591827dfdf99b4fccbdfb21" .6 0   .21618482 .21131876  .1248001  .10128398   .3464124   0 . . . . . .
        "559385f5fdf99b78e49c58e9" .9 0    .2437925  .3037116 .28340217 .025255695  .14383812 .04 . . . . . .
        "559ef308fdf99b2616ff2a4f" .6 3    .1893963  .2277302 .25043082   .2132074  .11923525 .05 . . . . . .
        "55a43278fdf99b02ff6caf89" .6 1    .4436076 .02054846  .3049865  .05050541    .180352   0 . . . . . .
        "55a55bd7fdf99b790558627c" .9 0   .05489553 .14040484 .15219286   .3931172  .25938964   0 . . . . . .
        "55b237e6fdf99b19ea79d2f7" .9 0   .17740396  .1681224 .27308652   .3271333   .0542538   0 . . . . . .
        "55b42b7cfdf99b5244217cf4" .6 1   .16815548 .27505726 .28558803  .24215887 .029040346 .05 . . . . . .
        "55b47001fdf99b615b12d66c" .6 5    .3199304 .02592731  .3214624   .1406498     .19203  .2 . . . . . .
        "55bd22d0fdf99b374ec0711a" .6 1   .05487858 .06129556  .5106032   .0960367  .27718604  .1 . . . . . .
        end
        
        
        //  GO TO LONG DATA LAYOUT
        reshape long p_score_ bayes_score_, i(label_participant) j(k)
        
        //  CONSTRUCT LIKELIHOOD FUNCTION
        local window 2  // IN THE REAL PROBLEM IT WILL BE 5
        local n_states 6 // IN THE REAL PROBLEM IT WILL BE 21
        
        //  IT IS CENTERED AT k = corrected_signal, AND SUPPORT EXTENDS +/- `window' AROUND
        //  THERE WITH 1-p UNIFORMLY DISTRIBUTED OVER k != corrected_signal
        gen likelihood = p if k == corrected_signal
        replace likelihood = (1-p)/(min(corrected_signal+`window', `n_states'-1) ///
            -max(corrected_signal-`window', 0)) if k != corrected_signal ///
            & abs(k-corrected_signal) <= `window'
        replace likelihood = 0 if abs(k-corrected_signal) > `window'
        
        //  CALCULATE THE POSTERIOR DISTRIBUTION
        gen priorXlikelihood = likelihood*p_score
        by label_participant (k), sort: egen denominator = total(priorXlikelihood)
        replace bayes_score_ = priorXlikelihood/denominator
        Consider strongly leaving your data in this long layout. It is likely that whatever else you hope to do with this data set will also be easier that way. If, however, you need to go back to the original wide arrangement, all yu need to do is follow this code with:
        Code:
        drop likeilhood priorXlikelihood denominator
        reshape wide
        Added: Note also that the notion of spreading the support for the likeilhood symmetrically around the corrected_signal value within a window of -5 to +5 is not workable at the edges. The above code therefore divides (1-p) by the number of values that lie within the range of possible values for the true state and fall within -2 (corresponding to -5 in your data) and +2 of corrected signal, which will be smaller when corrected _signal is near the boundaries.
        Last edited by Clyde Schechter; 03 Dec 2021, 19:20.

        Comment


        • #5
          Yes certainly limitations of dataex! The actual data set does have probabilities sum to 1 for each subject! I will have to look at this in the morning to try and understand it, but I did not realize you could avoid for loops by simply reshaping to long! I am going to have to keep this in mind for future projects. Thanks again Clyde!

          Have a nice night!

          Nicholas

          Comment


          • #6
            I did have a follow up question Clyde.

            I am wanting to produce some moment and non-moment based metrics from the distribution such as:

            Code:
            egen bayes_mode_weight = max(bayes_score_), by(label_participant) // probability weight assigned to each state k
            egen bayes_mean = total(k*bayes_score_), by(label_participant) // mean
            forvalues i=2/4 {
                 egen moment`i' = total((k-bayes_mean)^`i'*bayes_score_), by(label_participant)
            }
            gen bayes_sd = sqrt(moment2) // standard deviation (uncertainty)
            gen bayes_skew = moment3*moment2^(-3/2) // skewness
            gen bayes_kurt = moment4*moment2^(-2) // kurtosis
            gen bayes_kurt_excess = bayes_kurt - 3 // excess from expected normal
            reshape wide
            drop moment*
            What if I want the bayes_mode (k)? How can I produce this by(label_participant)? I tried something like the following:

            Code:
            gen bayes_mode = state if bayes_score_ == bayes_mode_weight
            This would in fact produce the state of the world with the highest probability weight, but it does so for only for the row k that is satisfies and not simply producing that value across all observations within group(label_participant) which I will agree the code is acting exactly as it should. Just want to be able to reshape back to wide, but do not want to do a half well done job by not enforcing some form of a by() option without egen command.

            Comment


            • #7
              Code:
              by label_participant (k), sort: egen bayes_mode = max(cond(bayes_score_ == bayes_mode_weight, state, .))
              Now, be cautioned that no matter how you attempt to calculate it, modes can be tricky because they need not be unique. You could have a distribution where two state values both share a common posterior probability and that happens to be the highest probability in the posterior distribution. So you need some way to break ties like that if they occur. The code above will return the state that is the highest state number. Other choices are possible: one could choose the one with the lowest state number. Or you could choose the mean value of all the state numbers that share the highest posterior probability--but that number might not be an integer, and even if it is, it might not be one of the states that has highest posterior probability. (For example, states 3 and 5 might be the highest posterior probability with 4 having a lower posterior probability.)

              I won't go so far as to say that you shouldn't look into posterior modes at all, but be very cognizant of the fact that modes (of anything) are treacherous.

              Comment


              • #8
                I ended up reshaping to wide and did the following which worked:

                Code:
                forvalues i=0/20 {
                     replace bayes_mode = `i' if bayes_score_`i' == bayes_mode_weight
                }
                where the bayes_mode_weight was computed above in thread #6 as:

                Code:
                egen bayes_mode_weight = max(bayes_score_), by(label_participant)
                So yes, I do agree that modes are in fact tricky, however, all of them in my data set are unique by design in how I asked the questions (at least this is true for the participant's actual posteriors and not the Bayesian's which is what I was trying to parse out). So I will compare your code to the less complex version I did and compare to see if there is any hiccups.
                Last edited by Nicholas Lacourse; 04 Dec 2021, 17:33.

                Comment

                Working...
                X