Coefficient of variance in very large dataset

Marvin Hanisch

Join Date: Oct 2017

Posts: 50
#1

Coefficient of variance in very large dataset

31 Jul 2019, 13:07

Dear Statalist,

I want to calculate the coefficient of variance in a large dataset (>60 million observations). My dataset includes individuals' performance in a given team and firm. For each team, I would like to calculate the coefficient of variance (or an alternative dispersion measure) with respect to performance. The teams can have different sizes (2 or more members). In the following example, "team_performance_disparity" is the variable of interest.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str2 ind_ID str1(team_ID firm_ID) byte ind_performance double team_performance_disparity "#1" "a" "x" 9 .1767767 "#2" "a" "x" 7 .1767767 "#3" "b" "x" 0 .91879348 "#4" "b" "x" 2 .91879348 "#5" "b" "x" 4 .91879348 "#6" "b" "x" 7 .91879348 "#1" "a" "y" 9 1.1313708 "#2" "a" "y" 1 1.1313708 "#3" "b" "y" 1 .95839372 "#4" "b" "y" 2 .95839372 "#5" "b" "y" 3 .95839372 "#6" "b" "y" 9 .95839372 end

Theoretically, the following command would do the trick

Code:

by firm_ID team_ID: cv2 ind_performance

except that the command is to slow for large datasets and the values only appear in the output window and not as new column in the dataset.

If there is loop to generate the coefficient of variance and store the results in a separate column, it would be extremely helpful.

Thanks,

Marvin
Tags: None
Marvin Hanisch

Join Date: Oct 2017

Posts: 50
#2

31 Jul 2019, 13:21

I just found an (admittedly simple) solution, which I wanted to share in case someone else has a similar problem.

Code:

egen SD = sd(ind_performance), by(firm_ID team_ID) egen mean = mean(ind_performance), by(firm_ID team_ID) gen coef_var = SD/mean
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30083
#3

31 Jul 2019, 13:26

Explicit looping is going to be slow, even if you use something faster than cv2. Do this instead:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str2 ind_ID str1(team_ID firm_ID) byte ind_performance "#1" "a" "x" 9 "#2" "a" "x" 7 "#3" "b" "x" 0 "#4" "b" "x" 2 "#5" "b" "x" 4 "#6" "b" "x" 7 "#1" "a" "y" 9 "#2" "a" "y" 1 "#3" "b" "y" 1 "#4" "b" "y" 2 "#5" "b" "y" 3 "#6" "b" "y" 9 end capture program drop one_team program define one_team summ ind_performance gen coeff_of_variation = r(sd)/r(mean) exit end runby one_team, by(firm_ID team_ID) status

To use this code you must install runby.ado. It is written by Robert Picard and me, and is available from SSC. In effect it loops, but it uses a different mechanism that is much faster. In large data sets the difference is quite noticeable. Also, it gives you periodic updates on its progress, so you will always know how much of the task is done and have an estimate of the remaining time.

By the way, I saw that in your original example you got double-precision for your result, but that seems like overkill to me. If you really need double precision, go ahead and add that to the -gen- command in program one_team.

Added: Crossed with #2. What I propose here should still be noticeably faster than that.
1 like
Comment
Marvin Hanisch

Join Date: Oct 2017

Posts: 50
#4

31 Jul 2019, 13:37

Dear Clyde,

Thank you for your helpful solution. It works perfectly.

Best,

Marvin
Comment

Announcement

Coefficient of variance in very large dataset

Comment

Comment

Comment