Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Weighted average

    Hi,

    I am working with O*NET's task rating data (a sample is attached). Different people rated the frequency of each task differently. The report gives the percentage of these ratings in 7 different categories of task frequency (1=yearly, 7= hourly). I need one number for the frequency of each task, instead of 7 of them.
    I thought the best way is to make a weighted average for each task. But I am not sure if it's possible and if it's the best approach. The data includes standard error and confidence interval for each category. I think it's not possible to just average them.

    Any help would be much appreciated.
    Attached Files

  • #2
    For the benefit of other readers, here is the basic data from your example, presented using CODE delimiters as requested in the Statalist FAQ linked to at the top of this page.
    Code:
    . list tasktid category percentage, sepby(tasktid)
    
         +-------------------------------+
         | tasktid   category   percen~e |
         |-------------------------------|
      1. |    8823          1       4.34 |
      2. |    8823          2       9.16 |
      3. |    8823          3      11.04 |
      4. |    8823          4      16.19 |
      5. |    8823          5      46.67 |
      6. |    8823          6       7.33 |
      7. |    8823          7       5.26 |
         |-------------------------------|
      8. |    8824          1       1.59 |
      9. |    8824          2      11.14 |
     10. |    8824          3      27.41 |
     11. |    8824          4      15.58 |
     12. |    8824          5      25.27 |
     13. |    8824          6      14.21 |
     14. |    8824          7       4.81 |
         +-------------------------------+
    I really don't see any way of summarizing these as a single number for each category in a meaningful way. How does one reconcile the 4.34% of the respondents who describe tasktid 8823 as yearly with the 5.26% who describe it as hourly? What does it mean to average "1 time a year" and "8,760 times a year"?

    To answer that question requires having a better idea of what use you would put this number to, once you obtained it. And perhaps some idea of how many different tasktid's you are concerned with (assuming the two you presented were just .

    Comment


    • #3
      Thank you very much William for your response. Here is how I look at it: Let's say teaching includes different tasks one of which is grading (id = 8823). Teachers across US rated how often they do this task, i.e., grading. 4.34% said they do it yearly or less (category 1), 9.16% said they do it more than once a year (category 2), so on. My question is on average how often teachers engage in the task of grading. I thought I could multiply the percentage for each of the 7 rows by the corresponding category, add them, and divide the sum by 100 to get the weighted average for the task. This way for example for grading I get 4.34 as the weighted average of the frequency which is between 4 and 5. Here is the definition of categories:

      1 Yearly or less
      2 More than yearly
      3 More than monthly
      4 More than weekly
      5 Daily
      6 Several times daily
      7 Hourly or more

      So, I can conclude that teachers on average grade papers more than weekly but less than daily. Do you see a logical error in this?

      Comment


      • #4
        Wikipedia has an illuminating discussion of the meaning of the word average, see it at https://en.wikipedia.org/wiki/Average .

        In this instance, I would not use the arithmetic mean, as you suggest, but rather the median, as my measure of central tendency. I would write that half the teachers grade papers several times a week or more, and half grade papers several times a week or less. And in the second population, the same thing holds: half the teachers perform that task several times a week or more, and half several times a week or less. Same conclusion you reached, but a different and I think more defensible path, and expressed in a way that makes it clear that we're looking at the median. Or for that matter, I might say that the median frequency of grading papers is several times a week, but less than daily, if I thought my audience understood the concept of the median.

        Comment


        • #5
          For your information weighted means are employed in situations like this where we have Likert scales with frequencies as weights. You can find more information here: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean
          I hope there is someone else in this forum to help me with my specific question.

          Comment


          • #6
            William gave good advice. Further, the scale you are using is not like the usual kind of opinion scale such as strongly agree 1 2 3 neutral 4 5 strongly disagree where people are presented with a scale that can be construed as equally spaced in the absence of further information (and even that is controversial and rejected by many researchers). The categories are self-evidently not equally spaced, but at best just weakly ordered (I can't see that the distinction between 1 and 2 and between 6 and 7 is clear either).

            An extra option with this kind of data is correspondence analysis, which produces scores that are themselves calculated from the pattern of frequencies.

            Comment


            • #7
              I hope there is someone else in this forum to help me with my specific question.
              It is not quite clear what your specific question is.

              I thought the best way is to make a weighted average for each task. But I am not sure if it's possible and if it's the best approach.
              It is possible. Apparently neither I nor Nick agree that it's the best approach.

              I thought I could multiply the percentage for each of the 7 rows by the corresponding category, add them, and divide the sum by 100 to get the weighted average for the task.
              Yes, that would give you a weighted average of the category numbers. Each of your percentages is just the frequency for that category, divided by the total frequency for all categories, and then multiplied by 100. The actual total frequency doesn't matter, it cancels out in the mathematics. Just go through the mathematics symbolically, converting your percentages to frequencies using, say, N to represent the total frequency, so that 4.34% becomes 0.434*N, and you'll see that this is correct.

              If your question is, how do I do this in Stata, then you can easily code exactly the procedure you described, but consider this simpler alternative.
              Code:
              . destring category, generate(catn)
              category has all characters numeric; catn generated as byte
              
              . collapse catn [aweight=percentage], by(tasktid)
              
              . list
              
                   +---------------------+
                   | tasktid        catn |
                   |---------------------|
                1. |    8823   4.3473349 |
                2. |    8824   4.1365862 |
                   +---------------------+
              
              .
              From that I conclude that the average (arithmetic mean) category assigned to task 8823 is 4.34.

              So, I can conclude that teachers on average grade papers more than weekly but less than daily. Do you see a logical error in this?
              Yes, I do see a logical error, as I discussed and Nick elaborated on. I would add, again, that you have given us no idea of what use you intend to put the resulting "average" to. Also, Likert scales seem to bare no relation to your data, if I correctly understand https://en.wikipedia.org/wiki/Likert_scale.
              Last edited by William Lisowski; 04 Jul 2015, 21:57.

              Comment


              • #8
                I'd draw attention also to iquantile (SSC). As the detailed argument in its help file is of some relevance here, I will quote at length.

                Code:
                Interpolated quantiles
                
                
                Syntax
                
                        iquantile varlist [if] [in] [weight] [ , by(byvarlist) format(format)
                                  p(numlist) list_options ]
                
                    fweights and aweights are allowed.
                
                
                Description
                
                    iquantile calculates and displays quantiles estimated by linear
                    interpolation in the mid-distribution function. The user may specify one
                    or more numeric variables, one or more grouping variables and one or more
                    quantiles.
                
                
                Remarks 
                
                    By quantiles here are meant those summaries defined by the fact that some
                    percent of a batch of values is fewer.  Thus the median (50%) and the
                    quartiles (25% and 75%) are examples. Most commands in Stata that
                    calculate such summaries select particular sample values or at most
                    average two sample values. That is often sufficient for the purpose
                    intended. iquantile offers an alternative, which is perhaps most useful
                    when the number of distinct values is small. For example, although the
                    variable in question may be measured coarsely, say on an integer scale,
                    and many ties may be observed, it may be hoped or imagined that a property
                    on a continuous scale lies beneath. Note that iquantile performs no white
                    magic, just elementary linear interpolation.
                
                    The cumulative probability is here defined as
                
                        SUM counts of values below + (1/2) count of this value
                        ------------------------------------------------------.
                                       SUM counts of all values
                                   
                    With terminology from Tukey (1977, 496-497), this could be called a `split
                    fraction below'. It is also a `ridit' as defined by Bross (1958):  see
                    also Fleiss et al. (2003, 198-205) or Flora (1988).  Yet again, it is also
                    the mid-distribution function of Parzen (1993, 3295) and the grade
                    function of Haberman (1996, 240-241). Parzen's term appears best for the
                    purposes of this command. The numerator is a `split count'. Using this
                    numerator, rather than
                
                        SUM counts of values below 
                
                    or
                
                        SUM counts of values below + count of this value, 
                        
                    treats distributions symmetrically. For applications to plotting ordinal
                    categorical data, see Cox (2004).
                
                    The technique used in iquantile is illustrated by a worked example using
                    Mata calculator-style. We first enter the data as values and frequencies:
                
                        : y = 2, 3, 4, 5
                
                        : f = 2, 9, 8, 8
                
                    Then we can work out the cumulative frequencies:
                
                        : runningsum(f)
                                1    2    3    4
                            +---------------------+
                          1 |   2   11   19   27  |
                            +---------------------+
                
                    Subtract half the frequencies and get the cumulative proportions,
                    symmetrically considered, i.e. the mid-distribution function:
                
                        : runningsum(f) :- f/2
                                 1     2     3     4
                            +-------------------------+
                          1 |    1   6.5    15    23  |
                            +-------------------------+
                
                        : (runningsum(f) :- f/2) / 27
                                         1             2             3             4
                            +---------------------------------------------------------+
                          1 |   .037037037   .2407407407   .5555555556   .8518518519  |
                            +---------------------------------------------------------+
                
                        : cup = (runningsum(f) :- f/2) / 27
                
                    To get the median, we need to interpolate between the 2nd and 3rd values
                    of y.
                
                        : y[2] + (y[3] - y[2]) * (0.5 - cup[2]) / (cup[3] - cup[2])
                          3.823529412
                
                    iquantile uses list to show results.
                
                    iquantile issues a warning if any quantile was calculated by
                    extrapolation, i.e. it lies in one or other tail of the distribution
                    beyond the observed mid-distribution function. Such results should be
                    treated with extreme caution.
                
                    If the data consist of a single distinct value, then exactly that value is
                    always returned as a quantile.
                
                    iquantile uses Mata for its innermost calculations.  Thus Stata 9 up is
                    required.
                
                
                Options 
                
                    by() specifies that calculations are to be carried out separately for the
                        distinct groups defined by byvarlist. The variable(s) in byvarlist may
                        be numeric or string.
                
                    format() specifies a numeric format to be used to display the quantiles.
                        This option has no lasting effect.
                
                    p() specifies a numlist of integers betweem 1 and 99 to indicate the p%
                        quantiles. If p() is not specified, it defaults to 50, i.e. the 50%
                        point or median is calculated.  p(25(25)75) specifies the median and
                        quartiles.
                
                    list_options are options of list other than noobs and subvarname. They may
                        be specified to tune the display of quantiles.
                
                
                Examples
                
                    . iquantile mpg
                    . iquantile mpg, p(25 50 70)
                    . iquantile mpg, p(25 50 70) format(%2.1f)
                    . iquantile mpg, p(25 50 70) format(%2.1f) by(rep78)
                    . iquantile mpg weight price
                
                
                Saved results 
                
                    Saved results are best explained by example. After iquantile mpg, two
                    results are saved, r(mpg_50_1) and r(mpg_50_1_epolate).  The elements of
                    the name for both are first, the variable name (if necessary, abbreviated
                    to 16 characters); second, the percent defining the quantile; third, the
                    number of the group in question in the observations processed (here, the
                    first of one). The extra flag epolate indicates whether extrapolation was
                    needed (1 for true, 0 for false).
                
                
                Author 
                
                    Nicholas J. Cox, Durham University, UK
                    [email protected]
                
                
                Acknowledgments 
                
                    This command grew out of a thread on Statalist started b Taggert J.
                        Brooks. See http://www.stata.com/statalist/archive/2009-01/msg00652.html
                
                
                References
                
                    Bross, I. D. J. 1958. How to use ridit analysis. Biometrics 14: 38-58.
                
                    Cox, N. J. 2004. Speaking Stata: Graphing categorical and compositional
                        data. Stata Journal 4(2): 190-215.  See Section 5.
                        http://www.stata-journal.com/sjpdf.html?articlenum=gr0004
                
                    Fleiss, J. L., B. Levin, and M. C. Paik. 2003.  Statistical Methods for
                        Rates and Proportions.  Hoboken, NJ: Wiley.
                
                    Flora, J. D. 1988. Ridit analysis. In Encyclopedia of Statistical
                        Sciences, ed. S. Kotz and N. L. Johnson, (8) 136-139.  New York:
                        Wiley.
                
                    Haberman, S. J. 1996.  Advanced Statistics Volume I: Description of
                        Populations.  New York: Springer.
                
                    Parzen, E. 1993. Change PP plot and continuous sample quantile function.
                        Communications in Statistics -Theory and Methods 22: 3287-3304.
                
                    Tukey, J. W. 1977. Exploratory Data Analysis.  Reading, MA:
                    Addison-Wesley.
                
                
                Also see
                
                    help for summarize, centile, pctile, tabstat, hdquantile (if installed)
                Last edited by Nick Cox; 05 Jul 2015, 03:47.

                Comment


                • #9
                  Monica,

                  Depending on your data and your understanding of them you could map your ordinal values to interval ones. Here is an example you could tweak:
                  Code:
                  #d ;
                  recode oldvar (1=  1  "Yearly or less")
                                (2=  5  "More than yearly")
                                (3= 18  "More than monthly")
                                (4=  6  "More than weekly")
                                (5= 365 "Daily")
                                (6=1095 "Several times daily")
                                (7=8760 "Hourly or more"), gen(newvar);
                  #d
                  Best,
                  Alan

                  Comment


                  • #10
                    Thank you very much everyone for your help. I really appreciate it.

                    Comment

                    Working...
                    X