Sample size for multinomial outcome

Galenda Nagudi

Join Date: May 2019

Posts: 20
#1

Sample size for multinomial outcome

14 Jul 2025, 13:22

Hello all,

We intend on carrying out a study with the outcome of interest being how cervicovaginal secretions (CVS) categories of women change over time based on multiple exposures such as contraceptive use etc. The outcome has 3 categories and a previous study had the following proportions among women across the 3 categories 0.42, 0.46 and 0.12. The CVS categories are not ordered however, the categories symbolise the women's susceptibility to infection i.e enhancers (nature of secretion promotes infection), inhibitors (nature of secretion protects them from infection or decreases chances of infection) and the neutral (have no effect).

The hypothesis is with time women's CVS categories will change over time based on the exposure variables such as contraceptive use, sexual activity etc. Women will be followed for up to a year with CVS categories being determined every month for the first 3 months and quarterly for the remainder of the time. In otherwords a woman could have a CVS category of neutral at baseline, then inhibitor at month 2 and then enhancer at month 6 based on the exposure variables.

Any assistance on how to come up with a sample size would be greatly appreciated.
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#2

14 Jul 2025, 16:04

Originally posted by Galenda Nagudi View Post

The CVS categories are not ordered however, the categories symbolise the women's susceptibility to infection i.e enhancers (nature of secretion promotes infection), inhibitors (nature of secretion protects them from infection or decreases chances of infection) and the neutral (have no effect).

Those sure seem ordered to me.

Any assistance on how to come up with a sample size would be greatly appreciated.

Because you're fitting a generalized linear model to the longitudinal data, you'll probably better off estimating sample size by simulation. There are several resources for how to approach this in Stata, for example, an FAQ here and its corresponding Stata Journal article here, and a blog series beginning here.

But first you're going to need to specify what exactly (quantitatively) it is you're trying to detect, which you haven't mentioned, You'll also need an estimate of the longitudinal correlation.
1 like
Comment
Galenda Nagudi

Join Date: May 2019

Posts: 20
#3

16 Jul 2025, 13:34

Thank you so much for the response Joseph.

"Those sure seem ordered to me".

Our thoughts of considering the outcome not being ordered is that a woman's CVS category could change from being an inhibitor to an enhancer without being neutral. In otherwords the CVS categories and their changes over time don't follow any order. If we are considering them to be ordered wouldn't that mean that one who is initially in the inhibitor category would move to neutral before going into the enhancer category?

"But first you're going to need to specify what exactly (quantitatively) it is you're trying to detect, which you haven't mentioned, You'll also need an estimate of the longitudinal correlation."

All the previous studies have done cross sectional analysis. How would you advise we go about establishing the correlation value?

We would like to characterize changes in CVS categories and determine factors associated with the CVS changes over time.
Some of the considerations we are having are to use the xtmlogit model with fixed effects since it is more flexible since no distributional assumptions have been made. The other option is using Generalised estimating equations (xtgee) as this takes into account the correlation of observations. Please share your thoughts on these options.

Thank you for the resources on using simulations for sample size. I will look through them.

Last edited by Galenda Nagudi; 16 Jul 2025, 13:38.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30169
#4

16 Jul 2025, 15:03

Our thoughts of considering the outcome not being ordered is that a woman's CVS category could change from being an inhibitor to an enhancer without being neutral. In otherwords the CVS categories and their changes over time don't follow any order. If we are considering them to be ordered wouldn't that mean that one who is initially in the inhibitor category would move to neutral before going into the enhancer category?

No, it is perfectly possible (and probably very common) for longitudinal follow of an ordinal variable to exhibit transitions from one category to another that skips intervening categories. For one thing, you are not continuously observing the outcome, so it may be that the intermediate categories occurred at a time when you just weren't measuring the outcome. But, beyond that, the definition of an ordered set of categories is that they can be lined up in a way that defines lower to higher degrees of intensity such that any two of the categories can be compared and one is higher than or equal to the other, and the ordering relationship is reflexive, anti-symmetric, and transitive. (i.e., from a mathematical standpoint there is a total ordering relationship on the categories.) This has no implications at all for how transitions between categories occur. Let's take a very simple and clear example. Suppose we have a survey question "How many times per month do you eat in a restaurant?" And suppose the response options are: A) 0; B)1-3; C)4-10; D)11 or more. These are clearly ordered categories. Suppose in ordinary circumstances I eat in restaurants 5 times per month. And now suppose that next month I become unemployed. I would probably respond to that by eliminating all unnecessary expenses, including substituting homemade food for restaurant meals. This means I would abruptly drop from category C to category A without ever passing through category B.
2 likes
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#5

17 Jul 2025, 04:06

Originally posted by Galenda Nagudi View Post

All the previous studies have done cross sectional analysis. How would you advise we go about establishing the correlation value?

Relax the search criteria to include a broader range of literature related to the outcome. Even case studies can give some idea. If in the unlikely situation that you can glean absolutely no information from the literature for guidance, then you can simulate across a range of correlation coefficients and choose a worst-case assumption.

We would like to characterize changes in CVS categories and determine factors associated with the CVS changes over time.

It sounds more like an exploratory study, and so for the sample size, to the extend that it's affected by intraclass correlation, as you gather data, you can refine the estimate and discuss revising enrollment numbers with your funding agency and ethics committee.

Some of the considerations we are having are to use the xtmlogit model with fixed effects since it is more flexible since no distributional assumptions have been made. The other option is using Generalised estimating equations (xtgee) as this takes into account the correlation of observations. Please share your thoughts on these options.

I have only very limited experience with xtmlogit and that with the random effects estimator. It seemed a little finicky, and I'm guessing that the constraints that ordered-categorical analogues impose help with convergence. There is no provision for either unordered-categorical or ordered-categorical longitudinal regression models with xtgee as far as I am aware. The closest you would get is to use the vce(cluster <varname>) option with mlogit or ologit.
1 like
Comment
Galenda Nagudi

Join Date: May 2019

Posts: 20
#6

17 Jul 2025, 07:22

Originally posted by Clyde Schechter View Post

No, it is perfectly possible (and probably very common) for longitudinal follow of an ordinal variable to exhibit transitions from one category to another that skips intervening categories. For one thing, you are not continuously observing the outcome, so it may be that the intermediate categories occurred at a time when you just weren't measuring the outcome. But, beyond that, the definition of an ordered set of categories is that they can be lined up in a way that defines lower to higher degrees of intensity such that any two of the categories can be compared and one is higher than or equal to the other, and the ordering relationship is reflexive, anti-symmetric, and transitive. (i.e., from a mathematical standpoint there is a total ordering relationship on the categories.) This has no implications at all for how transitions between categories occur. Let's take a very simple and clear example. Suppose we have a survey question "How many times per month do you eat in a restaurant?" And suppose the response options are: A) 0; B)1-3; C)4-10; D)11 or more. These are clearly ordered categories. Suppose in ordinary circumstances I eat in restaurants 5 times per month. And now suppose that next month I become unemployed. I would probably respond to that by eliminating all unnecessary expenses, including substituting homemade food for restaurant meals. This means I would abruptly drop from category C to category A without ever passing through category B.

Thank you so much for this clarity Clyde. Our outcome variable is certainly ordered.
Comment
Galenda Nagudi

Join Date: May 2019

Posts: 20
#7

17 Jul 2025, 07:37

Originally posted by Joseph Coveney View Post

Relax the search criteria to include a broader range of literature related to the outcome. Even case studies can give some idea. If in the unlikely situation that you can glean absolutely no information from the literature for guidance, then you can simulate across a range of correlation coefficients and choose a worst-case assumption.

It sounds more like an exploratory study, and so for the sample size, to the extend that it's affected by intraclass correlation, as you gather data, you can refine the estimate and discuss revising enrollment numbers with your funding agency and ethics committee.

I have only very limited experience with xtmlogit and that with the random effects estimator. It seemed a little finicky, and I'm guessing that the constraints that ordered-categorical analogues impose help with convergence. There is no provision for either unordered-categorical or ordered-categorical longitudinal regression models with xtgee as far as I am aware. The closest you would get is to use the vce(cluster <varname>) option with mlogit or ologit.

Thank you for your insight Joseph.

In post your post #2 you mentioned using generalised linear modelling for analysis. Would I be right in specifying the family as negative binomial and linkname as logit? I have looked through the stata glm manual and non of the examples are longitudinal data with an ordered or unordered multinomial outcome. Please share any resources I could refer to.

Also while using ologit or mlogit with VCE(cluster variable name), would this take into account that the data is longitudinal and that the outcome is varying?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#8

17 Jul 2025, 15:21

Originally posted by Galenda Nagudi View Post

In post your post #2 you mentioned using generalised linear modelling for analysis. Would I be right in specifying the family as negative binomial and linkname as logit?

No, by "you're fitting a generalized linear model to the longitudinal data" I was alluding to meologit, meoprobit and meglm , family(ordered).

Also while using ologit or mlogit with VCE(cluster variable name), would this take into account that the data is longitudinal and that the outcome is varying?

Yes, you would use study participant ID as the clustering variable and include observation interval among the predictors.
1 like
Comment
Galenda Nagudi

Join Date: May 2019

Posts: 20
#9

29 Jul 2025, 06:32

Thank you so much for the clarity Joseph.
Comment

Announcement

Sample size for multinomial outcome

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment