Weighted average

Monica Muller

Join Date: Jul 2014

Posts: 226
#1

Weighted average

03 Jul 2015, 21:25

Hi,

I am working with O*NET's task rating data (a sample is attached). Different people rated the frequency of each task differently. The report gives the percentage of these ratings in 7 different categories of task frequency (1=yearly, 7= hourly). I need one number for the frequency of each task, instead of 7 of them.
I thought the best way is to make a weighted average for each task. But I am not sure if it's possible and if it's the best approach. The data includes standard error and confidence interval for each category. I think it's not possible to just average them.

Any help would be much appreciated.
Attached Files

statalist.dta (2.5 KB, 1 view)
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

04 Jul 2015, 08:57

For the benefit of other readers, here is the basic data from your example, presented using CODE delimiters as requested in the Statalist FAQ linked to at the top of this page.

Code:

. list tasktid category percentage, sepby(tasktid) +-------------------------------+ | tasktid category percen~e | |-------------------------------| 1. | 8823 1 4.34 | 2. | 8823 2 9.16 | 3. | 8823 3 11.04 | 4. | 8823 4 16.19 | 5. | 8823 5 46.67 | 6. | 8823 6 7.33 | 7. | 8823 7 5.26 | |-------------------------------| 8. | 8824 1 1.59 | 9. | 8824 2 11.14 | 10. | 8824 3 27.41 | 11. | 8824 4 15.58 | 12. | 8824 5 25.27 | 13. | 8824 6 14.21 | 14. | 8824 7 4.81 | +-------------------------------+

I really don't see any way of summarizing these as a single number for each category in a meaningful way. How does one reconcile the 4.34% of the respondents who describe tasktid 8823 as yearly with the 5.26% who describe it as hourly? What does it mean to average "1 time a year" and "8,760 times a year"?

To answer that question requires having a better idea of what use you would put this number to, once you obtained it. And perhaps some idea of how many different tasktid's you are concerned with (assuming the two you presented were just .
Comment
Monica Muller

Join Date: Jul 2014

Posts: 226
#3

04 Jul 2015, 12:41

Thank you very much William for your response. Here is how I look at it: Let's say teaching includes different tasks one of which is grading (id = 8823). Teachers across US rated how often they do this task, i.e., grading. 4.34% said they do it yearly or less (category 1), 9.16% said they do it more than once a year (category 2), so on. My question is on average how often teachers engage in the task of grading. I thought I could multiply the percentage for each of the 7 rows by the corresponding category, add them, and divide the sum by 100 to get the weighted average for the task. This way for example for grading I get 4.34 as the weighted average of the frequency which is between 4 and 5. Here is the definition of categories:

1 Yearly or less
2 More than yearly
3 More than monthly
4 More than weekly
5 Daily
6 Several times daily
7 Hourly or more

So, I can conclude that teachers on average grade papers more than weekly but less than daily. Do you see a logical error in this?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

04 Jul 2015, 14:39

Wikipedia has an illuminating discussion of the meaning of the word average, see it at https://en.wikipedia.org/wiki/Average .

In this instance, I would not use the arithmetic mean, as you suggest, but rather the median, as my measure of central tendency. I would write that half the teachers grade papers several times a week or more, and half grade papers several times a week or less. And in the second population, the same thing holds: half the teachers perform that task several times a week or more, and half several times a week or less. Same conclusion you reached, but a different and I think more defensible path, and expressed in a way that makes it clear that we're looking at the median. Or for that matter, I might say that the median frequency of grading papers is several times a week, but less than daily, if I thought my audience understood the concept of the median.
Comment
Monica Muller

Join Date: Jul 2014

Posts: 226
#5

04 Jul 2015, 16:27

For your information weighted means are employed in situations like this where we have Likert scales with frequencies as weights. You can find more information here: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean
I hope there is someone else in this forum to help me with my specific question.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#6

04 Jul 2015, 17:36

William gave good advice. Further, the scale you are using is not like the usual kind of opinion scale such as strongly agree 1 2 3 neutral 4 5 strongly disagree where people are presented with a scale that can be construed as equally spaced in the absence of further information (and even that is controversial and rejected by many researchers). The categories are self-evidently not equally spaced, but at best just weakly ordered (I can't see that the distinction between 1 and 2 and between 6 and 7 is clear either).

An extra option with this kind of data is correspondence analysis, which produces scores that are themselves calculated from the pattern of frequencies.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

04 Jul 2015, 21:46

I hope there is someone else in this forum to help me with my specific question.

It is not quite clear what your specific question is.

I thought the best way is to make a weighted average for each task. But I am not sure if it's possible and if it's the best approach.

It is possible. Apparently neither I nor Nick agree that it's the best approach.

I thought I could multiply the percentage for each of the 7 rows by the corresponding category, add them, and divide the sum by 100 to get the weighted average for the task.

Yes, that would give you a weighted average of the category numbers. Each of your percentages is just the frequency for that category, divided by the total frequency for all categories, and then multiplied by 100. The actual total frequency doesn't matter, it cancels out in the mathematics. Just go through the mathematics symbolically, converting your percentages to frequencies using, say, N to represent the total frequency, so that 4.34% becomes 0.434*N, and you'll see that this is correct.

If your question is, how do I do this in Stata, then you can easily code exactly the procedure you described, but consider this simpler alternative.

Code:

. destring category, generate(catn) category has all characters numeric; catn generated as byte . collapse catn [aweight=percentage], by(tasktid) . list +---------------------+ | tasktid catn | |---------------------| 1. | 8823 4.3473349 | 2. | 8824 4.1365862 | +---------------------+ .

From that I conclude that the average (arithmetic mean) category assigned to task 8823 is 4.34.

So, I can conclude that teachers on average grade papers more than weekly but less than daily. Do you see a logical error in this?

Yes, I do see a logical error, as I discussed and Nick elaborated on. I would add, again, that you have given us no idea of what use you intend to put the resulting "average" to. Also, Likert scales seem to bare no relation to your data, if I correctly understand https://en.wikipedia.org/wiki/Likert_scale.

Last edited by William Lisowski; 04 Jul 2015, 21:57.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35694

05 Jul 2015, 03:44

I'd draw attention also to iquantile (SSC). As the detailed argument in its help file is of some relevance here, I will quote at length.

Code:

Interpolated quantiles


Syntax

        iquantile varlist [if] [in] [weight] [ , by(byvarlist) format(format)
                  p(numlist) list_options ]

    fweights and aweights are allowed.


Description

    iquantile calculates and displays quantiles estimated by linear
    interpolation in the mid-distribution function. The user may specify one
    or more numeric variables, one or more grouping variables and one or more
    quantiles.


Remarks 

    By quantiles here are meant those summaries defined by the fact that some
    percent of a batch of values is fewer.  Thus the median (50%) and the
    quartiles (25% and 75%) are examples. Most commands in Stata that
    calculate such summaries select particular sample values or at most
    average two sample values. That is often sufficient for the purpose
    intended. iquantile offers an alternative, which is perhaps most useful
    when the number of distinct values is small. For example, although the
    variable in question may be measured coarsely, say on an integer scale,
    and many ties may be observed, it may be hoped or imagined that a property
    on a continuous scale lies beneath. Note that iquantile performs no white
    magic, just elementary linear interpolation.

    The cumulative probability is here defined as

        SUM counts of values below + (1/2) count of this value
        ------------------------------------------------------.
                       SUM counts of all values
                   
    With terminology from Tukey (1977, 496-497), this could be called a `split
    fraction below'. It is also a `ridit' as defined by Bross (1958):  see
    also Fleiss et al. (2003, 198-205) or Flora (1988).  Yet again, it is also
    the mid-distribution function of Parzen (1993, 3295) and the grade
    function of Haberman (1996, 240-241). Parzen's term appears best for the
    purposes of this command. The numerator is a `split count'. Using this
    numerator, rather than

        SUM counts of values below 

    or

        SUM counts of values below + count of this value, 
        
    treats distributions symmetrically. For applications to plotting ordinal
    categorical data, see Cox (2004).

    The technique used in iquantile is illustrated by a worked example using
    Mata calculator-style. We first enter the data as values and frequencies:

        : y = 2, 3, 4, 5

        : f = 2, 9, 8, 8

    Then we can work out the cumulative frequencies:

        : runningsum(f)
                1    2    3    4
            +---------------------+
          1 |   2   11   19   27  |
            +---------------------+

    Subtract half the frequencies and get the cumulative proportions,
    symmetrically considered, i.e. the mid-distribution function:

        : runningsum(f) :- f/2
                 1     2     3     4
            +-------------------------+
          1 |    1   6.5    15    23  |
            +-------------------------+

        : (runningsum(f) :- f/2) / 27
                         1             2             3             4
            +---------------------------------------------------------+
          1 |   .037037037   .2407407407   .5555555556   .8518518519  |
            +---------------------------------------------------------+

        : cup = (runningsum(f) :- f/2) / 27

    To get the median, we need to interpolate between the 2nd and 3rd values
    of y.

        : y[2] + (y[3] - y[2]) * (0.5 - cup[2]) / (cup[3] - cup[2])
          3.823529412

    iquantile uses list to show results.

    iquantile issues a warning if any quantile was calculated by
    extrapolation, i.e. it lies in one or other tail of the distribution
    beyond the observed mid-distribution function. Such results should be
    treated with extreme caution.

    If the data consist of a single distinct value, then exactly that value is
    always returned as a quantile.

    iquantile uses Mata for its innermost calculations.  Thus Stata 9 up is
    required.


Options 

    by() specifies that calculations are to be carried out separately for the
        distinct groups defined by byvarlist. The variable(s) in byvarlist may
        be numeric or string.

    format() specifies a numeric format to be used to display the quantiles.
        This option has no lasting effect.

    p() specifies a numlist of integers betweem 1 and 99 to indicate the p%
        quantiles. If p() is not specified, it defaults to 50, i.e. the 50%
        point or median is calculated.  p(25(25)75) specifies the median and
        quartiles.

    list_options are options of list other than noobs and subvarname. They may
        be specified to tune the display of quantiles.


Examples

    . iquantile mpg
    . iquantile mpg, p(25 50 70)
    . iquantile mpg, p(25 50 70) format(%2.1f)
    . iquantile mpg, p(25 50 70) format(%2.1f) by(rep78)
    . iquantile mpg weight price


Saved results 

    Saved results are best explained by example. After iquantile mpg, two
    results are saved, r(mpg_50_1) and r(mpg_50_1_epolate).  The elements of
    the name for both are first, the variable name (if necessary, abbreviated
    to 16 characters); second, the percent defining the quantile; third, the
    number of the group in question in the observations processed (here, the
    first of one). The extra flag epolate indicates whether extrapolation was
    needed (1 for true, 0 for false).


Author 

    Nicholas J. Cox, Durham University, UK
    [email protected]


Acknowledgments 

    This command grew out of a thread on Statalist started b Taggert J.
        Brooks. See http://www.stata.com/statalist/archive/2009-01/msg00652.html


References

    Bross, I. D. J. 1958. How to use ridit analysis. Biometrics 14: 38-58.

    Cox, N. J. 2004. Speaking Stata: Graphing categorical and compositional
        data. Stata Journal 4(2): 190-215.  See Section 5.
        http://www.stata-journal.com/sjpdf.html?articlenum=gr0004

    Fleiss, J. L., B. Levin, and M. C. Paik. 2003.  Statistical Methods for
        Rates and Proportions.  Hoboken, NJ: Wiley.

    Flora, J. D. 1988. Ridit analysis. In Encyclopedia of Statistical
        Sciences, ed. S. Kotz and N. L. Johnson, (8) 136-139.  New York:
        Wiley.

    Haberman, S. J. 1996.  Advanced Statistics Volume I: Description of
        Populations.  New York: Springer.

    Parzen, E. 1993. Change PP plot and continuous sample quantile function.
        Communications in Statistics -Theory and Methods 22: 3287-3304.

    Tukey, J. W. 1977. Exploratory Data Analysis.  Reading, MA:
    Addison-Wesley.


Also see

    help for summarize, centile, pctile, tabstat, hdquantile (if installed)

Last edited by Nick Cox; 05 Jul 2015, 03:47.

Comment

Alan Neustadtl

Join Date: Mar 2014
Posts: 107

06 Jul 2015, 10:42

Monica,

Depending on your data and your understanding of them you could map your ordinal values to interval ones. Here is an example you could tweak:

Code:

#d ;
recode oldvar (1=  1  "Yearly or less")
              (2=  5  "More than yearly")
              (3= 18  "More than monthly")
              (4=  6  "More than weekly")
              (5= 365 "Daily")
              (6=1095 "Several times daily")
              (7=8760 "Hourly or more"), gen(newvar);
#d

Best,
Alan

Comment

Monica Muller

Join Date: Jul 2014

Posts: 226
#10

07 Jul 2015, 11:51

Thank you very much everyone for your help. I really appreciate it.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment