Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Coefficient of variance in very large dataset

    Dear Statalist,

    I want to calculate the coefficient of variance in a large dataset (>60 million observations). My dataset includes individuals' performance in a given team and firm. For each team, I would like to calculate the coefficient of variance (or an alternative dispersion measure) with respect to performance. The teams can have different sizes (2 or more members). In the following example, "team_performance_disparity" is the variable of interest.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str2 ind_ID str1(team_ID firm_ID) byte ind_performance double team_performance_disparity
    "#1" "a" "x" 9  .1767767
    "#2" "a" "x" 7  .1767767
    "#3" "b" "x" 0 .91879348
    "#4" "b" "x" 2 .91879348
    "#5" "b" "x" 4 .91879348
    "#6" "b" "x" 7 .91879348
    "#1" "a" "y" 9 1.1313708
    "#2" "a" "y" 1 1.1313708
    "#3" "b" "y" 1 .95839372
    "#4" "b" "y" 2 .95839372
    "#5" "b" "y" 3 .95839372
    "#6" "b" "y" 9 .95839372
    end

    Theoretically, the following command would do the trick

    Code:
    by firm_ID team_ID: cv2 ind_performance
    except that the command is to slow for large datasets and the values only appear in the output window and not as new column in the dataset.

    If there is loop to generate the coefficient of variance and store the results in a separate column, it would be extremely helpful.

    Thanks,

    Marvin

  • #2
    I just found an (admittedly simple) solution, which I wanted to share in case someone else has a similar problem.

    Code:
    egen SD = sd(ind_performance), by(firm_ID team_ID)
    egen mean = mean(ind_performance), by(firm_ID team_ID)
    gen coef_var = SD/mean

    Comment


    • #3
      Explicit looping is going to be slow, even if you use something faster than cv2. Do this instead:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str2 ind_ID str1(team_ID firm_ID) byte ind_performance
      "#1" "a" "x" 9
      "#2" "a" "x" 7
      "#3" "b" "x" 0
      "#4" "b" "x" 2
      "#5" "b" "x" 4
      "#6" "b" "x" 7
      "#1" "a" "y" 9
      "#2" "a" "y" 1
      "#3" "b" "y" 1
      "#4" "b" "y" 2
      "#5" "b" "y" 3
      "#6" "b" "y" 9
      end
      
      capture program drop one_team
      program define one_team
          summ ind_performance
          gen coeff_of_variation = r(sd)/r(mean)
          exit
      end
      
      runby one_team, by(firm_ID team_ID) status
      To use this code you must install runby.ado. It is written by Robert Picard and me, and is available from SSC. In effect it loops, but it uses a different mechanism that is much faster. In large data sets the difference is quite noticeable. Also, it gives you periodic updates on its progress, so you will always know how much of the task is done and have an estimate of the remaining time.

      By the way, I saw that in your original example you got double-precision for your result, but that seems like overkill to me. If you really need double precision, go ahead and add that to the -gen- command in program one_team.

      Added: Crossed with #2. What I propose here should still be noticeably faster than that.

      Comment


      • #4
        Dear Clyde,

        Thank you for your helpful solution. It works perfectly.

        Best,

        Marvin

        Comment

        Working...
        X