Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple-select questions terminology


    Dear All,

    I would like to ask a question related more to terminology than Stata, but I hope it is all right to benefit from the collective wisdom of the Stata users.

    The question is in the context of the analysis of data that is being asked as multiple-select categorical questions.

    The following web-page (brought into light here not because of any particular reputation or significance, but because of relevance and simple terms and illustration) mentions two approaches, which it calls:
    - percent of respondents;
    - percent of answers.
    https://resources.pollfish.com/pollf...ion-questions/

    1. My first question here is whether this terminology is standard/established/intuitive, or whether there are more common/historical/etc names for the same indicators?

    2. My second question is whether there is any other indicator that the researcher may derive from a multiple-select categorical question, if yes, please suggest.

    3. My main (third) question is, that although I agree that both can make sense and be useful for SOME situations. I want to question whether both are ALWAYS valid?

    Consider the following first example:
    the company is selling widgets and the customers can buy [otherwise identical] widgets that are painted to the color of choice of the customer, and each customer can buy no more than one widget of a specific color. Then having the results of the "What colors are your widgets?" multiple-select question we will be able to answer BOTH what is the percentage of the respondents that have a widget of a specific color (for example, if we want to decide which color is the one that will resonate most in an ad) and which percentage of the widgets is painted which color (for example if we wanted to trim the offering of colors, reduce the number of paint buckets).

    But consider the following second example:
    The household is asked what appliances they have in their possession: say, refrigerator, microwave, radio, etc. I have a feeling that it is incorrect to add all appliances together, and say that 10% of all appliances are refrigerators, since they are very different from e.g. radios (in terms of costs, or uses, or possibly utility derived from their use). Furthermore, if the number of the devices is not being asked, but just the fact of their presence (as common with such multiple-selection questions), then I can be building invalid inferences from the second indicator (percent of answers) or I should be cautious with the wording around it, as it looses the interpretability rather quickly.

    Is my intuition correct here? And if so, what are the requirements for the second approach to make sense? And is that something that I [mechanically] can deduce from the question and/or its options? or it does necessarily require understanding of the underlying subject? (in lame terms: requires knowing and understanding what the widgets are specifically?)

    Thank you and have a good weekend, everybody!

    PS: regarding my second question above, I would like to somehow account for concentration of the answers. For example, if 5 households mentioned CRIME as the only problem, but 6 mentioned ACCESS TO SCHOOLS and ACCESS TO HOSPITALS (pairwise, selecting these two items at the same time), I would like to be sensitive to that and give the whole point for the only one selected item, and perhaps half a point if two items were selected. Is there a formal name of this approach/algorithm? (see the last column in the matrix displayed below)



    Code:
    clear all
    input crime school hospital
    1 0 0
    1 0 0
    1 0 0
    1 0 0
    1 0 0
    0 1 1
    0 1 1
    0 1 1
    0 1 1
    0 1 1
    0 1 1
    end
    
    program define mselect, rclass
        version 18.0
        syntax varlist // 2 or more dummy variables 1=YES, 0=NO
        
        foreach v in `varlist' {
            assert inlist(`v',0,1)
        }
        
        local n=`:word count `varlist''
        display `n'
        matrix M=J(`n',3,.)
        matrix rownames M=`varlist'
        local i=1
        local s=0
        foreach v in `varlist' {
            summarize `v', meanonly
            matrix M[`i',1]=r(mean)*100
            matrix M[`i++',2]=r(sum)
            local s=`s'+r(sum)        
        }
    
        local i=1
        foreach v in `varlist' {
            summarize `v', meanonly
            matrix M[`i',2]=M[`i',2]/`s'*100
            local i=`i'+1        
        }
    
        tempvar tmp
        egen `tmp'=rowtotal(`varlist')
        local i=1
        foreach v in `varlist' {
            tempvar tmpw
            generate `tmpw'=`v'/`tmp'
            summarize `tmpw', meanonly
            matrix M[`i++',3]=r(mean)*100
            drop `tmpw'
        }
    
        return matrix M=M
    end
    
    mselect crime school hospital
    return list
    matrix list r(M) , format(%6.2f)
    Code:
    r(M)[3,3]
                 c1     c2     c3
       crime  45.45  29.41  45.45
      school  54.55  35.29  27.27
    hospital  54.55  35.29  27.27

  • #2
    "My second question is whether there is any other indicator that the researcher may derive from a multiple-select categorical question, if yes, please suggest."

    One analytic approach I heard of some years ago was to treat the data as a vector of binary outcomes for each possible object, and then treat that binary vector as defining a multinomial response. That is, given (e.g.) 3 items that could be chosen the possible response categories would be: yes to 0 items; yes to item 1 only, yes to item 2 only, ..., yes to items 1 and 2, ..., yes to items 1 and 2 and 3. Then, each of the 2^(number of items) response patterns is treated as one possible response. Obviously, if there are 10 possible appliances, this is not a very helpful approach <grin>

    I think I read about this in:

    Agresti, A., 1997. A model for repeated measurements of a multivariate binary response. Journal of the American Statistical Association, 92(437), pp.315-321.

    (There's not a huge citation trail issuing from that article, which surprises me a bit. Perhaps it was something else by Agresti. I have a vague recollection that there was some discussion of the problem of large sets of responses patterns, but I might be wrong.

    My sense is that the most useful approach will depend on the substantive context, but that there might be some basic ideas in that literature that could inform whatever approach is best. I hope this helps, but I fear it might not <grin>.




    Comment


    • #3
      Obviously, if there are 10 possible appliances, this is not a very helpful approach
      Obviously. And the source I am working with offers up to 200 selections from a catalogue of some 15,000.

      But this hints me towards another result that I could possibly calculate is the correlation matrix between the individual choices.

      Still my main interest is the 3rd question in the above. Are the two described approaches always valid, and the difficulties I describe are the problem of the user doing the interpretation? Or rather the problem is with the metric itself, and it shouldn't even apply in some cases?

      Thank you.

      Comment


      • #4
        I find your question(s) here uncharacteristically vague.

        Originally posted by Sergiy Radyakin View Post
        I want to question whether both are ALWAYS valid?
        Could you elaborate on your understanding of the term "valid" in this conetxt, and perhaps provide a more rigorous formal definition of the term?


        Originally posted by Sergiy Radyakin View Post
        Consider the following first example:
        [...] and each customer can buy no more than one widget of a specific color.
        This restriction does not fit well with the idea of multiple selects, does it?


        Originally posted by Sergiy Radyakin View Post
        PS: regarding my second question above, I would like to somehow account for concentration of the answers. For example, if 5 households mentioned CRIME as the only problem, but 6 mentioned ACCESS TO SCHOOLS and ACCESS TO HOSPITALS (pairwise, selecting these two items at the same time), I would like to be sensitive to that and give the whole point for the only one selected item, and perhaps half a point if two items were selected. Is there a formal name of this approach/algorithm? (see the last column in the matrix displayed below)
        Such an approach might be okay if you commuintcated very clearly what you did. More generally, I would question why one answer should carry any more weight than each of multiple answers. What exactly is the assumption here?

        Overall: What exactly is the (research) question you want to answer?

        Comment

        Working...
        X