What is the most efficient way to compute a prorated sum of variables?

Bruce Weaver

Join Date: May 2014

Posts: 1132
#1

What is the most efficient way to compute a prorated sum of variables?

17 May 2017, 08:27

Some scales & questionnaires have sub-scales where a sub-scale score = the sum of several items. Computing such sub-scale scores is easy if there were no missing items: One can use egen with rowtotal(varlist). E.g.,

Code:

egen score = rowtotal(v1-v5)

But if some items are missing, one may wish to compute a prorated sum, provided that some minimum number of variables have valid values. For example, suppose I want to compute the prorated sum of v1-v5, but only if at least 3 of the variables have valid values.

Code:

input v1-v5 1 1 1 1 1 . 1 1 1 1 . . 1 1 1 . . . 1 1 . . . . 1 . . . . . end

In SPSS, I would do something like this:

Code:

COMPUTE ProSum = MEAN.3(v1 to v5)*5.

Or more generally, to make it work better for long variable lists where I may not want to count the number of variables:

Code:

COMPUTE ProSum = MEAN.3(v1 to v5)*(NVALID(v1 to v5)+NMISS(v1 to v5)).

That 3 in MEAN.3 is the minimum number of valid values required to compute a mean.

Now don't laugh (too hard), but so far the best I've come up with in Stata is this:

Code:

egen RMean = rowmean(v1-v5) // row mean egen Rmiss = rowmiss(v1-v5) // # of missing values egen Rvalid = rownonmiss(v1-v5) // # of valid values generate prosum = RMean*(Rmiss+Rvalid) if Rvalid >= 3 drop RMean - Rvalid

There must be a more efficient way to do this, but so far my searches have been fruitless. Any tips appreciated.

Cheers,
Bruce (still a relative Stata newbie)

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35697
#2

17 May 2017, 09:15

Not laughing. Using egen is as efficient in programmer time as anything I know, except that f you want the sum, use rowtotal().

If you were doing this repeatedly in a program, you would cut out the interpretative overhead by writing your own loops.

Code:

gen sum = 0 gen nOK = 0 forval j = 1/5 { replace sum = sum + cond(missing(v`j'), 0, v`j') replace nOK = nOK + !missing(v`j') } replace sum = . if nOK < 3

That's less code. (Look inside egen and the egen functions called if you don't believe it.)

But even an experienced programmer would be pushed to save time by doing that unless the dataset was enormous and their typing was really, really fast.

Conversely, that's what I would put in a program to be used repeatedly where efficiency is a virtue. (Or reach for Mata.)
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1132
#3

17 May 2017, 13:01

Thanks Nick. It's good to know that my code was not (too) laughable, but a bit disappointing that there is not a solution that is more comparable to the SPSS COMPUTE command I posted in #1. The main stumbling block I ran into was that egen can (apparently) take only one function. If it could take more than one function, I could do something like this:

Code:

egen prosum = rowmean(v1-v5)*(rowmiss(v1-v5)+rownonmiss(v1-v5)) if rownonmiss(v1-v5) >=3

I wonder if there is any possibility of extending egen to take more than one function in a future version of Stata. Perhaps that has been discussed in one of the wish list threads.

Cheers,
Bruce

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#4

17 May 2017, 14:00

My own wild guess is that egen will remain as is indefinitely. StataCorp are unlikely to tinker with it ad hoc. It's more likely that we'd see some deeper change that would make it a historical oddity. I am not leaking there because I really have no idea about their precise plans, but that's the kind of change that would appeal to them.

It's interesting that people who know Stata often ask in other forums questions like "What's the equivalent in R of egen?" but there is no deep egen concept to be borrowed. It's just a convenient toolbox that can be cloned insofar as it comes with many example programs.

If you're satisfied by a one-liner, then all that is needed is a clone of _growtotal.ado with an extra option. In fact there are more user-written egen functions than official egen functions, and many were written in just this way, because somebody wanted a variant on or extension of an official function.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1132
#5

17 May 2017, 16:08

By the way, Nick's code in #2 would have to be tweaked a bit to get the prorated sum.

Code:

generate sum = 0 generate N = 0 // Added to Nick's code generate nOK = 0 forval j = 1/5 { replace sum = sum + cond(missing(v`j'), 0, v`j') replace N = N+1 // Added to Nick's code replace nOK = nOK + !missing(v`j') } generate prosum = sum*N/nOK // Added to Nick's code -- compute prorated sum replace prosum = . if nOK < 3

And before any asks, yes, I do realize that this approach is equivalent to using mean substitution for missing data, and that mean substitution is a lousy way to deal with missing data. But for better or worse, this is what is recommended in scoring manuals for various instruments.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#6

18 May 2017, 00:02

Bruce is right in stating that prorated sum is one of the most recommended method to deal with missing data in questionnaires.
However, in my experience, when I compare the statistics of the observed data to those obtained after the prorate method has been applied, sometimes it's hard to believe that they refer to the same sample. In those instances, after investigating that the underlying missing mechanism allows me to do so, -mi- gave me more reasonable results.
There are obviously some challenging situations, as those related to missing values concerning (patients', in my case) intimacy, which are usually missing not at random (and require sensitivity analyses to deal with).

Kind regards,
Carlo
(Stata 19.0)
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35697

18 May 2017, 06:01

Bruce is right. I missed the important detail of scaling up the sum. But you don't need an extra variable (holding a constant in every observation) to do that. In this case you know the answer is 5, but more generally just use the loop counter.

Code:

generate sum = 0
generate nOK = 0

forval j = 1/5 {
     replace sum = sum + cond(missing(v`j'), 0, v`j')
     replace nOK = nOK + !missing(v`j')
     local nvars = `j'
}

generate prosum = sum * `nvars'/nOK
replace prosum = . if nOK < 3

If the variables weren't conveniently numbered 1 up, you'd need something more general, say

Code:

generate sum = 0
generate nOK = 0

* local varlist defined previously
local nvars = 0
foreach v of local varlist {
     replace sum = sum + cond(missing(v`j'), 0, v`j')
     replace nOK = nOK + !missing(v`j')
     local ++nvars
}

generate prosum = sum * `nvars'/nOK
replace prosum = . if nOK < 3

Last edited by Nick Cox; 18 May 2017, 06:18.

Announcement

What is the most efficient way to compute a prorated sum of variables?

Comment

Comment

Comment

Comment

Comment

Comment