Multiple-select questions terminology

Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#1

Multiple-select questions terminology

30 Aug 2024, 14:59

Dear All,

I would like to ask a question related more to terminology than Stata, but I hope it is all right to benefit from the collective wisdom of the Stata users.

The question is in the context of the analysis of data that is being asked as multiple-select categorical questions.

The following web-page (brought into light here not because of any particular reputation or significance, but because of relevance and simple terms and illustration) mentions two approaches, which it calls:
- percent of respondents;
- percent of answers.
https://resources.pollfish.com/pollf...ion-questions/

1. My first question here is whether this terminology is standard/established/intuitive, or whether there are more common/historical/etc names for the same indicators?

2. My second question is whether there is any other indicator that the researcher may derive from a multiple-select categorical question, if yes, please suggest.

3. My main (third) question is, that although I agree that both can make sense and be useful for SOME situations. I want to question whether both are ALWAYS valid?

Consider the following first example:
the company is selling widgets and the customers can buy [otherwise identical] widgets that are painted to the color of choice of the customer, and each customer can buy no more than one widget of a specific color. Then having the results of the "What colors are your widgets?" multiple-select question we will be able to answer BOTH what is the percentage of the respondents that have a widget of a specific color (for example, if we want to decide which color is the one that will resonate most in an ad) and which percentage of the widgets is painted which color (for example if we wanted to trim the offering of colors, reduce the number of paint buckets).

But consider the following second example:
The household is asked what appliances they have in their possession: say, refrigerator, microwave, radio, etc. I have a feeling that it is incorrect to add all appliances together, and say that 10% of all appliances are refrigerators, since they are very different from e.g. radios (in terms of costs, or uses, or possibly utility derived from their use). Furthermore, if the number of the devices is not being asked, but just the fact of their presence (as common with such multiple-selection questions), then I can be building invalid inferences from the second indicator (percent of answers) or I should be cautious with the wording around it, as it looses the interpretability rather quickly.

Is my intuition correct here? And if so, what are the requirements for the second approach to make sense? And is that something that I [mechanically] can deduce from the question and/or its options? or it does necessarily require understanding of the underlying subject? (in lame terms: requires knowing and understanding what the widgets are specifically?)

Thank you and have a good weekend, everybody!

PS: regarding my second question above, I would like to somehow account for concentration of the answers. For example, if 5 households mentioned CRIME as the only problem, but 6 mentioned ACCESS TO SCHOOLS and ACCESS TO HOSPITALS (pairwise, selecting these two items at the same time), I would like to be sensitive to that and give the whole point for the only one selected item, and perhaps half a point if two items were selected. Is there a formal name of this approach/algorithm? (see the last column in the matrix displayed below)

Code:

clear all input crime school hospital 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 end program define mselect, rclass version 18.0 syntax varlist // 2 or more dummy variables 1=YES, 0=NO foreach v in `varlist' { assert inlist(`v',0,1) } local n=`:word count `varlist'' display `n' matrix M=J(`n',3,.) matrix rownames M=`varlist' local i=1 local s=0 foreach v in `varlist' { summarize `v', meanonly matrix M[`i',1]=r(mean)*100 matrix M[`i++',2]=r(sum) local s=`s'+r(sum) } local i=1 foreach v in `varlist' { summarize `v', meanonly matrix M[`i',2]=M[`i',2]/`s'*100 local i=`i'+1 } tempvar tmp egen `tmp'=rowtotal(`varlist') local i=1 foreach v in `varlist' { tempvar tmpw generate `tmpw'=`v'/`tmp' summarize `tmpw', meanonly matrix M[`i++',3]=r(mean)*100 drop `tmpw' } return matrix M=M end mselect crime school hospital return list matrix list r(M) , format(%6.2f)

Code:

r(M)[3,3] c1 c2 c3 crime 45.45 29.41 45.45 school 54.55 35.29 27.27 hospital 54.55 35.29 27.27
Tags: categorical
Mike Lacy

Join Date: Apr 2014

Posts: 2411
#2

30 Aug 2024, 17:42

"My second question is whether there is any other indicator that the researcher may derive from a multiple-select categorical question, if yes, please suggest."

One analytic approach I heard of some years ago was to treat the data as a vector of binary outcomes for each possible object, and then treat that binary vector as defining a multinomial response. That is, given (e.g.) 3 items that could be chosen the possible response categories would be: yes to 0 items; yes to item 1 only, yes to item 2 only, ..., yes to items 1 and 2, ..., yes to items 1 and 2 and 3. Then, each of the 2^(number of items) response patterns is treated as one possible response. Obviously, if there are 10 possible appliances, this is not a very helpful approach <grin>

I think I read about this in:

Agresti, A., 1997. A model for repeated measurements of a multivariate binary response. Journal of the American Statistical Association, 92(437), pp.315-321.

(There's not a huge citation trail issuing from that article, which surprises me a bit. Perhaps it was something else by Agresti. I have a vague recollection that there was some discussion of the problem of large sets of responses patterns, but I might be wrong.

My sense is that the most useful approach will depend on the substantive context, but that there might be some basic ideas in that literature that could inform whatever approach is best. I hope this helps, but I fear it might not <grin>.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#3

01 Sep 2024, 11:42

Obviously, if there are 10 possible appliances, this is not a very helpful approach

Obviously. And the source I am working with offers up to 200 selections from a catalogue of some 15,000.

But this hints me towards another result that I could possibly calculate is the correlation matrix between the individual choices.

Still my main interest is the 3rd question in the above. Are the two described approaches always valid, and the difficulties I describe are the problem of the user doing the interpretation? Or rather the problem is with the metric itself, and it shouldn't even apply in some cases?

Thank you.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3842
#4

02 Sep 2024, 01:37

I find your question(s) here uncharacteristically vague.

Originally posted by Sergiy Radyakin View Post

I want to question whether both are ALWAYS valid?

Could you elaborate on your understanding of the term "valid" in this conetxt, and perhaps provide a more rigorous formal definition of the term?

Originally posted by Sergiy Radyakin View Post

Consider the following first example:
[...] and each customer can buy no more than one widget of a specific color.

This restriction does not fit well with the idea of multiple selects, does it?

Originally posted by Sergiy Radyakin View Post

PS: regarding my second question above, I would like to somehow account for concentration of the answers. For example, if 5 households mentioned CRIME as the only problem, but 6 mentioned ACCESS TO SCHOOLS and ACCESS TO HOSPITALS (pairwise, selecting these two items at the same time), I would like to be sensitive to that and give the whole point for the only one selected item, and perhaps half a point if two items were selected. Is there a formal name of this approach/algorithm? (see the last column in the matrix displayed below)

Such an approach might be okay if you commuintcated very clearly what you did. More generally, I would question why one answer should carry any more weight than each of multiple answers. What exactly is the assumption here?

Overall: What exactly is the (research) question you want to answer?
Comment

Announcement

Multiple-select questions terminology

Comment

Comment

Comment