Testing if distribution is similar between two groups

Chris James

Join Date: Mar 2015

Posts: 30
#1

Testing if distribution is similar between two groups

09 Mar 2015, 15:57

I have a variable `young` that is equal to 1 if a participant is less than 25 years old. I then have a list of of each participant's favorite ice cream flavor (everyone has to choose among 25 flavors and can only make one choice). I would like to test if the distribution of tastes of flavors differs by age, using the `young` variable. A sample of the data below.

id young flavor
1 1 1
2 1 1
3 0 5
4 1 11
5 0 7

I have been using a `ttest` for each flavor, however, I a 25X2 chi2 test seems more appropriate, but I do not know how to handle this in Stata.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#2

09 Mar 2015, 16:19

In principle, assuming each participant appears only once in the data set, it's

Code:

tab flavor young, chi2

In reality, you may find you get a lot of zeroes or other small numbers in the table cells. In that case the chi square approximation may break down. If that happens you should consider combining some of the seldom-selected flavors into an "Other" group. Or you could replace -chi2- by -exact- in the code shown to get a Fisher exact test instead. (Fisher exact tests can be slow and chew up lots of memory--sometimes they exceed memory limits or your patience for a result.)
Comment
Chris James

Join Date: Mar 2015

Posts: 30
#3

09 Mar 2015, 16:23

That is helpful. Out of curiosity if we do not assume each participant appears only once, how does that change things? Also, is the null that they are different? So a Pr = 0.11 means these distributions are the same by age group?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#4

09 Mar 2015, 16:33

If participants appear more than once, then you have a repeated-measures design and the ordinary Pearson chi square test is inapplicable because it assumes independent observations.

The null hypothesis is that the distribution is the same in both age groups. If you got a p-value of 0.11 you would not, conventionally, reject that null hypothesis. So you could say that there is not sufficient evidence to infer that they are different. Whether that means they are substantially the same depends on your statistical power (which, in turn, depends on the sample size as well as the rate in 1 group, and how much of a difference you would consider substantial.)
Comment
Chris James

Join Date: Mar 2015

Posts: 30
#5

09 Mar 2015, 16:40

How would I treat this distribution comparison if I find that the survey captured the same person multiple times?

In these data I have 1812 with `young == 1` and 270 with `young == 0` Is there any adjustment for unequal numbers? If not, I would assume that this is a substantial sample size to state that the taste distribution, for ice cream, is similar in terms of age.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#6

09 Mar 2015, 16:53

If there is a small number of people who were captured more than once, I would probably just exclude all but their first response and then stick with the simple chi square test, to avoid complicating things. If this happened commonly, then you have to use a model that accommodates repeated measures. Stata has the -xt- and -me- suites of commands for this kind of nested design. The problem is that your outcome variable is a discrete variable with 25 levels. None of the -xt- or -me- commands support this kind of outcome variable. What I would probably do in that case is stand the problem on its head, using the age group as the dependent variable and the chosen ice cream flavor as the predictor in a logistic model. So something like

Code:

xtset id xtlogit young i.flavor

and I would base my statistical inference on the overall model chi square test.

With regards to statistical power, 1,812 is a fairly large sample for most purposes, 270 might or might not be. And again, you can't assess statistical power just from the sample size. It also depends on the actual outcome distribution and how different they would need to be to consider it different. Again, the fact that you have a 25-level categorical outcome makes this complicated. None of the built-in power calculation methods in Stata would handle this, as far as I know.
Comment

Announcement

Testing if distribution is similar between two groups

Comment

Comment

Comment

Comment

Comment