ANOVA for comparing mean PROBABILITIES?

Helena Torres Alvaro

Join Date: Dec 2023

Posts: 6
#1

ANOVA for comparing mean PROBABILITIES?

03 Mar 2024, 08:21

Dear all,

I have created a model that predicts the probability to eat meat (DV) of 75 individuals when they are in different situational settings (IVs: meal type (breakfast, lunch, dinner); location (home, work, restaurant, etc.), type of company (family, friends, colleagues, etc.), and some other independent variables). I have > 10,000 observations.

Now I want to compare the MEAN probability to eat meat of these 75 individuals when they are at breakfast vs. lunch vs. dinner; and compare this mean probability when they are with family vs. with friends vs. with colleagues, etc. etc.

I have thought about doing a (one-way) ANOVA. But there are two particularities due to which I am not sure if an ANOVA is the right analysis to perform:

1- The means I want to compare are PROBABILITIES

2 - The individuals are the same in all the groups (i.e., situational settings). So it is not about a comparison between different groups, but comparing the same people under different conditions.

I would appreciate so much if someone could tell me whether I can still use an ANOVA (taking into account these two points). And if not, which other type of analysis could I do to compare this 'mean probability to eat meat under different situational settings' ?

I thank you in advance,
& look forward to hearing from you!

Best,

Helena
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#2

03 Mar 2024, 11:51

The fact that your outcome is a probability is problematic but tolerable. Using ANOVA fits a linear probability model, where the effects of the situation variables are additive rather than multiplicative, as would happen with a logistic or Poisson model. Consequently, it is possible that the model will predict probabilities outside the 0 to 1 range for some situations. This may or may not be acceptable in the context of your research goals. There is also the problem that the residual distribution is nearly guaranteed to be heteroskedastic when using a dichotomous outcome. This last problem can be overcome by using robust standard errors, or ignored if you are only interested in estimating expected probabilities and not doing inference.

However, you have a more serious problem that cannot be so readily overcome. Because the same people are observed multiple times in different situations, your observations are not independent. You need an analytic approach that accounts for the within-person correlation of outcomes. The most closely related procedure you might use would be repeated measures anova. But as long as you are willing to leave ANOVA behind, I would do this with a different model that overcomes all of these problems. I would use a two-level logistic model such as

Code:

xtset individual xtlogit eat_meat i.meal_type i.location i.company, fe // OR MAYBE re INSTEAD

As a logistic model, it does not predict probabilities outside the 0-1 range. And the -xtlogit- command is designed to account for intra-individual correlation.

I do have a question that you must ask yourself first, however. When you say, for example, that you are interested in the effects on meat eating of location, which of the following do you mean:
You want to compare the probabilities of meat eating in the same person when that person eats at home, vs at work, vs at a restaurant.

You want to compare the probabilities of meat eating among all people eating at home with that of all people eating at work and all people eating in restaurants.

These are different questions and the answers may be different. If you are interested in how these situational factors affect the meat eating behavior of the same person in different situations, then use the -fe- option to -xtlogit-. But if you are interested in the second question, -xtlogit, fe- will give you incorrect results if the answers to the two different questions differ. In that case, using the -re- option instead of -fe- would be needed.

One more thing about -fe-. If your situational factors include unchanging attributes of individuals, e.g. their sex, race/ethnicity, perhaps occupation or socioeconomic class, the effects of such unchanging factors cannot be estimated with an -fe- model. So if you are interested in the first question but also want to estimate the effects of these unchanging factors, then your best bet is the Mundlak correlated random effects model. You can fit the Mundlak model using the -xthybrid- command, available from SSC.
Comment
Helena Torres Alvaro

Join Date: Dec 2023

Posts: 6
#3

03 Mar 2024, 13:30

Dear Clyde,

Thank you very much for your quick reply. The fact is that I initially had an (unbalanced) panel dataset of these 75 individuals, each having 5-21 observations collected throughout three weeks. I already used a random-effects logit model to predict the probability to eat meat of these individuals (I used RE and not FE because I was also interested in measuring the effect of unchanging factors). Using that regression, I built a model, that places these 75 individuals in a multitude of different situations, and predicts the individual probability to eat meat under each of these situations. After running the model, I have > 10,000 observations, and now I would like to compare the probability to eat meat of ALL sampled individuals (i.e., the OPTION 2 you mentioned on your reply) -that is why I was talking about MEAN probability- under different situational settings.

That is, now (having >10,000 observations -probability to eat meat- of the 75 individuals under different situational settings), I want to compare the MEAN probability to eat meat of the sample when they are at breakfast vs. lunch vs. dinner; the MEAN probability to eat meat of the sample when they are at home vs. work vs. restaurant, etc. etc. (other categories of situational settings as well).

After clarifying that, which analysis would do you recommend to do?

Does the repeated measures ANOVA you mention still make sense?
Or another type of analysis could be more feasible? (I'm ok with NOT using ANOVA, but using again 'xtlogit' does not make much sense I think...)

Looking forward to hearing from you!
Thanks again!

Best,

Helena
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#4

03 Mar 2024, 13:44

If I understand you correctly, you have initial fit a random effects logistic regression model to a directly observed data set, and then used the results of that model to create a simulated data set of 10,000 simulated observations of these 75 people in a large number of situations. Now you want to estimate the mean meat-eating probabilities at each level of several (perhaps all) of your predictors from your simulated data set. For this purpose, I would use the -mean- command. For example, -mean eat_meat, over(location)-, and so on. See -help mean- for more details.

Note: While -mean- will always produce a probability estimate between 0 and 1 in this data set, the confidence intervals are calculated using normal theory (as they would be with ANOVA). Consequently, those confidence intervals might extend outside the 0-1 range. If you prefer to avoid this complication, the -proportion- command will produce the same mean estimates but offers different options for calculating confidence intervals that may be more suitable. See -help proportion- for details.
Comment
Helena Torres Alvaro

Join Date: Dec 2023

Posts: 6
#5

04 Mar 2024, 01:08

Dear Clyde,

Yes, you understood right. But then one question remains: after I have computed the "mean eat_meat, over(location)", and so on, which test can I use to COMPARE these MEANS? (see if they are statistically different from each other).

E.g., compare the mean probability to eat meat of the sample when they are at home vs. at restaurant vs. at work?

Kind regards,

Helena
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#6

04 Mar 2024, 14:15

Use the -test- (or, better still since it also provides an estimate of the difference and a confidence interval, -lincom-) command.

But, based on my understanding that this is all simulated data from a model, it makes no sense to do testing on these differences. Your sample size of 75,000 observations is arbitrary. You could have done 75,000,000, or you could have done 75. Or any other number, and whatever p-values you get in this data will simply be artifacts of the sample size you chose.

Last edited by Clyde Schechter; 04 Mar 2024, 14:19.
1 like
Comment
Helena Torres Alvaro

Join Date: Dec 2023

Posts: 6
#7

07 Mar 2024, 02:49

All right!
Thank you very much for your help!

Yours sincerely,

Helena
Comment

Announcement

ANOVA for comparing mean PROBABILITIES?

Comment

Comment

Comment

Comment

Comment

Comment