very different results on geometric mean (stata vs excel)

Yanyu Li started a topic very different results on geometric mean (stata vs excel)

13 May 2018, 09:43
very different results on geometric mean (stata vs excel)
I use the same dataset to calculate geometric mean

STATA:

Code:

ssc install egenmore egen gmean_m1=gmean(m1_vol1)

Plus, if I just give the code #gmean(m1_vol1)# it doesn't work

Excel:
first: calculate r*=1+r
second: geomean(r*)-1

I tried to use function: exp(average(ln(r))), it doesn't work

result from STATA is 0.01978; from Excel is 0.007001

BTW, the mean calculated from STATA is 0.0072096; from Excel is 0.007751

It looks that results from Excel are more reliable, does it?

I am really grateful for any help.

Best wishes,

Yanyu
Tags: None
Nick Cox replied

26 Nov 2018, 02:44
Rebecca Water I see nothing in your data or your code that identifies individuals, so I am unclear what you are asking.
Leave a comment:

Rebecca Water replied

25 Nov 2018, 18:54

Originally posted by Nick Cox View Post

For geometric mean growth rate over a period of time, I can't see that this behaviour is what you should want. I think you need to work from last and first values directly.

Thank you very much, that just saved me from a lot of trouble.

I conclude that when there are no negative values of the variable of interest ("numbers" below) I can still use -gmean- by sticking to multipliers of each year instead of rates. If, instead, I use the first and last value to calculate the average growth rate over a period and for one id those two values happen to be missing, information about this id would be lost.

Code:

. list numbers multiplier rateinstead

     +--------------------------------+
     | numbers   multip~r   rateins~d |
     |--------------------------------|
  1. |     100          .           . |
  2. |     122       1.22         .22 |
  3. |     503   4.122951    3.122951 |
  4. |      50   .0994036   -.9005964 |
  5. |      52       1.04         .04 |
     |--------------------------------|
  6. |      89   1.711538    .7115384 |
  7. |     300   3.370786    2.370786 |
     +--------------------------------+

. means multiplier rateinstead

    Variable |    Type             Obs        Mean       [95% Conf. Interval]
-------------+---------------------------------------------------------------
  multiplier | Arithmetic            6    1.927447        .3303814   3.524512
             |  Geometric            6    1.200937        .2951487   4.886518
             |   Harmonic            6    .4627954               .          .
-------------+---------------------------------------------------------------
 rateinstead | Arithmetic            6    .9274465       -.6696186   2.524512
             |  Geometric            5    .5410379        .0580626   5.041494
             |   Harmonic            5    .1577642               .          .
-------------+---------------------------------------------------------------

. egen meanmultiplier = gmean(multiplier)

. di "Average growth rate", meanmultiplier - 1
Average growth rate .20093691

Leave a comment:

Nick Cox replied

25 Nov 2018, 05:34

Rebecca Water

Your question is already answered in the text you quote. The gmean() function for egen in egenmore (SSC) ignores zeros and negatives in the original data. That wasn't by design; it's a side-effect of using the Stata function sum() to cumulate logged values. In Stata log() creates missing as a result if fed zero or negative values, but sum() ignores missings.

More generally, to resolve uncertainty about what code does, you can (no claims that this list is exhaustive)

1. Inspect the code whenever, as here, it is accessible. Then think through what the code will do. (Not always easy or trivial with unfamiliar code, and usually not with lengthy code.)

2. Try out the code on simple examples where you know the answer.

3. If only as a matter of curiosity see what related commands or functions do.

4. Apply the same algorithm in Mata.

Here are some results underlining what happens.

Code:

. clear

. set obs 2
number of observations (_N) was 0, now 2

. gen whatever = real(word("2 -1", _n))

. l

     +----------+
     | whatever |
     |----------|
  1. |        2 |
  2. |       -1 |
     +----------+

. egen gmean = gmean(whatever)

. list

     +------------------+
     | whatever   gmean |
     |------------------|
  1. |        2       2 |
  2. |       -1       2 |
     +------------------+

. means whatever

    Variable |    Type             Obs        Mean       [95% Conf. Interval]
-------------+---------------------------------------------------------------
    whatever | Arithmetic            2          .5       -18.55931   19.55931 
             |  Geometric            1           2               .          . 
             |   Harmonic            1           2               .          . 
-----------------------------------------------------------------------------

. mata :
------------------------------------------------- mata (type end to exit) ------------
: whatever = (2 \ -1)

: runningsum(log(whatever))
                 1
    +---------------+
  1 |  .6931471806  |
  2 |  .6931471806  |
    +---------------+

: exp(runningsum(log(whatever)))
       1
    +-----+
  1 |  2  |
  2 |  2  |
    +-----+

: end
--------------------------------------------------------------------------------------

For geometric mean growth rate over a period of time, I can't see that this behaviour is what you should want. I think you need to work from last and first values directly.

Leave a comment:

Rebecca Water replied

25 Nov 2018, 00:21
Originally posted by Nick Cox View Post

I think I based the code on Stata's egen function for means,in which ignoring missings is the preferred behaviour. Here ignoring missings for logarithms catches zeros and negatives in the original data, as well as missings in the original data.

When using -gmean- for growth rates with very large negative values, will these negative values be ignored?
Excuse me if I missed the answer in the previous posts that I have all read...
Leave a comment:
Yanyu Li replied

13 May 2018, 13:03
Dear Nick,

Yes, the bottom two lines of code works very well. I get the same results with Jean-Claude's programming.

I am soooo happy learning a lot from you amazing guys!

Have a lovely day!
Leave a comment:
Nick Cox replied

13 May 2018, 12:40
Yanyu: Thanks for the thanks. It all seems to boil down to your needing to use two lines of code, not one, if you use egen, gmean().
Leave a comment:
Yanyu Li replied

13 May 2018, 12:04
I am really appreciate for fantastic help!!!

Dear Jean-Claude,

WOW, Many Thanks! I learnt a lot. writing a program, amazing!

Dear Nick,

many thanks for the clear explanation!

Best wishes,

Yanyu
Leave a comment:
Nick Cox replied

13 May 2018, 11:49
Clyde: Thanks. It's easier for me to speak ill of my programs than to hear someone else do it. Failure to check for valid input is a failure too. I think I based the code on Stata's egen function for means,in which ignoring missings is the preferred behaviour. Here ignoring missings for logarithms catches zeros and negatives in the original data, as well as missings in the original data.
Leave a comment:
Clyde Schechter replied

13 May 2018, 11:39
If I can reconstruct my thinking in 1999 when writing that, I would have been surprised that anyone thought there was point to calculating geometric means for non-positive numbers. Nevertheless from 2018 I have to think it's a bug not to warn of zeros or negatives.

From the pedantry corner: I believe that a "bug" is when a program is given valid input (inputs that are within the domain of the function it purports to evaluate) but produces an incorrect answer. But what you describe is a failure to warn that invalid input (outside the domain of the function) has been provided; your program produces an arguably misleading output. From what I was taught, that would be considered a "design flaw," and a slight one at that. Failing gracefully in the presence of invalid input is highly desirable and is ideal program design, but missing that ideal doesn't descend down to the level of a bug.
Leave a comment:
Nick Cox replied

13 May 2018, 10:57
The geometric mean is undefined if any value is negative and otherwise is 0 if any value is zero.

The code you're using is _ggmean.ado from SSC.

Code:

ssc type _ggmean.ado *! NJC 1.0.0 9 December 1999 program define _ggmean version 6 syntax newvarname =/exp [if] [in] [, BY(varlist)] tempvar touse quietly { gen byte `touse' = 1 `if' `in' sort `touse' `by' by `touse' `by': gen `typlist' `varlist' = /* */ sum(log(`exp')) / sum((log(`exp'))!=.) if `touse'==1 by `touse' `by': replace `varlist' = exp(`varlist'[_N]) } end

That program ignores zero and negative values altogether as a side-effect of sum() ignoring missings in the numerator.

If I can reconstruct my thinking in 1999 when writing that, I would have been surprised that anyone thought there was point to calculating geometric means for non-positive numbers. Nevertheless from 2018 I have to think it's a bug not to warn of zeros or negatives. I didn't know anything in 1999 about working with returns.

Either way, gmean() from egenmore doesn't do what you want and it certainly doesn't add 1 and then subtract it from the result.

What you want in Mat a terms appears to be

Code:

exp(mean(ln(x :+ 1))) :- 1

and in Stata terms it could be

Code:

egen double dgmean = gmean(m1_vol1 + 1) replace dgmean = dgmean - 1

See also programs written by people who work with your kind of data (I don't) such as asrol (SSC).
1 like
Leave a comment:
Jean-Claude Arbaut replied

13 May 2018, 10:48
First, a few words about the concept, on an example. Take the annual evolution of population in an area, you may have an increase around 1% (say, +2%, +1.5% -1% in three successive years).
There are at least three ways to represent this evolution:

A) the percentages: 2, 1.5, -1
B) the "multipliers": 1.02, 1.015, 0.99 (the population is multiplied by these numbers)
C) the rates, which are the multipliers minus 1 (that's similar to percentages, just take care of where the decimal point lies!): 0.02, 0.015, -0.01.

For instance, if the population grows from 12345 to 13117, the multiplier is 13117/12345, the rate is (13117-12345)/12345, and the percentage is (13117-12345)/12345*100 (respectively 1.0625, 0.0625 and 6.25).

Now, imagine you want the mean annual growth of the population. You need a geometric mean. But not the geometric mean of an arbitrary variable! A and C won't do: you must take the geometric mean of something that is "multiplicative", hence B.

Here is a sample program, computing the geometric mean in case B and C:

Code:

prog def gmean, rclass syntax varname(numeric) [if] [in], [Rates] tempvar x if "`rates'"=="" { gen `x'=log(`varlist') `if' `in' } else { gen `x'=log(1+`varlist') `if' `in' } qui sum `x' `if' `in' if "`rates'"=="" { loc gm=exp(r(mean)) } else { loc gm=exp(r(mean))-1 } di "Geometric mean: " `gm' ret loc gmean=`gm' end

Now, let's see what happens on sample data:

Code:

. clear . input mult rate 1.02 0.02 1.015 0.015 0.99 -0.01 end . gmean mult Geometric mean: 1.0082475 . gmean rate, rate Geometric mean: .00824746

I defined variable rate (which is C as described above), and mult (B above). Of course, rate=mult-1.
I added an option in the program to compute the correct result whether you have rates or not (it will just add 1 and remove 1 afterwards from the mean). Likewise, the result is printed as multiplier if you pass multiplier data, and rate if you pass rate data.

I repeat the important point: take the geometric mean of multiplicative data!

Usually it's easy to tell the difference: rates can be negative. That's why you get an error in Excel: you can't compute the logarithm of a negative number.
Also pay attention, if you add 1, you must use raw rates, not percentages (the 0.015 above, not 1.5). When the rates are very small, it's not obvious to tell which is which.

For your 100 data points above, which I guess are rates, you will find

Code:

. gmean m1_vol1, rate Geometric mean: .01103937

With Excel you must add 1 before calling the function, and the result is printed as 1.011039371. Looks ok to me.

Note: you may want the "set type double" option in Stata to always store and compute in double precision. Sometimes that can explain small differences.

Hope this helps

Jean-Claude Arbaut
Last edited by Jean-Claude Arbaut; 13 May 2018, 11:23.
4 likes
Leave a comment:

Yanyu Li replied

13 May 2018, 10:30

Dear Nick,

Many thanks! here is my data.

Code:


	Code:
	* Example generated by -dataex-. To install: ssc install dataex
clear
input double m1_vol1
 -.059533194271073085
   .08553445191568178
  .021669432626490067
 -.010809037641986721
   -.0998044598340742
  -.07782653780580019
  .017336065944151224
  .026002328044206997
  .003092583541879438
 -.006432234980911638
   .10725212131865638
    .0353189629020679
  .006035858484168095
-.0011053353017857653
  -.03017034603351121
   .04885578451687837
  .025542858189385134
  .026433941218777277
 -.011299805704031477
   -.0540169442477355
 .0030381666557641124
  .024412098000454373
  .010113822338255037
 -.033839471341455096
   .10126735472739483
   .03318137086717275
  -.06548020725348643
  -.06751137415398147
 -.048500049235796316
   .08522139169866853
   .03847080897849642
   .02794160971900748
   .07495124962407414
  .029341800577723537
  .023160513835259477
  .023729425208221307
 .0025871469010782374
  .012901746011194001
  .025711920503480062
   .01422348035208259
   .06785475347358405
 -.008697307645728573
   .03032823547661658
  -.01170088753129399
   .07290817688683772
    .0651809627534647
  -.03642036076948865
 -.047067277503153274
  .011124822312898765
  -.04785085920010752
  -.02010337596008443
  .044583960286451334
    .0400367008370984
  -.05756719543068287
  .003935943956883683
 -.005584225687645684
 -.002352864574510501
 -.018652888666318833
  .003730535815841516
   .04116627067157207
   .02502391379383623
   .03761758655251597
 -.006638782033109404
   .04952167355710546
  .015815693085943258
   .00794231655711458
 .0015825712657892175
   .03102583569248785
  .007233160330351668
   .02057881004823313
  .006334229193751861
   .01762419064590381
  .042413642515235155
-.0005240140985816995
 -.015485638003624182
 -.009849205610878513
   .04505612765711201
  .011979957144053907
  .007491518158585872
  .017592748833395552
  .015877285867864407
  .029461675742123203
  .013563858033485834
  .002741680195988412
  .012382292281255541
  .021312806625295506
 -.006561833082930011
  .034734433607650037
 .0018261621600802718
   .07351061587528592
 -.020338597319833625
  .006960856879572719
   .04711927311812575
  .032450018124739854
   .05486837694882987
   .04283149184928581
  .006119366095388868
   .02310848949228952
 -.025658794920336687
 .0022647033227246177
end
------------------ copy up to and including the previous line ------------------

Listed 100 out of 304 observations
Use the count() option to list more

And, here is the STATA results:

Code:

. mean m1_vol1

Mean estimation                   Number of obs   =        303

--------------------------------------------------------------
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
     m1_vol1 |   .0072096   .0022524      .0027772     .011642
--------------------------------------------------------------

. egen gmean=gmean(m1_vol1)
variable gmean already defined
r(110);

. drop gmean

. egen gmean=gmean(m1_vol1)

. di gmean[1]
.01978051

. egen double dgmean=gmean(m1_vol1)

. di dgmean[1]
.01978051

Plus, in Excel, I should use =exp(average(ln(B2:B295))) OR =exp(average(log(B2:B295))), but I always get error.

Many thanks for your big help, Nick!!!

Best wishes,

Yanyu

Leave a comment:

Nick Cox replied

13 May 2018, 10:04

You need to show us the data in your variable if you are asking us to check results.

Meanwhile the geometric mean is the nth root of the product of n positive numbers. So, the geometric mean of 10, 100, 1000 is 100.

Adding 1 to all and then subtracting 1 is not the same calculation.

Here are some sample results. Note that using double with egen gets you more precision.

Code:

. mata : 
------------------------------------------------- mata (type end to exit) -------
: 
: x = (10, 100, 1000)' 

:: exp(mean(ln(x))) 
  100

: : exp(mean(ln(x :+ 1))) :- 1 
  102.6054764

: end 
---------------------------------------------------------------------------------

. clear

. set obs 3
number of observations (_N) was 0, now 3

. gen x = 10^_n

. means x

    Variable |    Type             Obs        Mean       [95% Conf. Interval]
-------------+---------------------------------------------------------------
           x | Arithmetic            3         370       -989.9378   1729.938 
             |  Geometric            3         100        .3279913   30488.62 
             |   Harmonic            3    27.02703               .          . 
-----------------------------------------------------------------------------
Missing values in confidence intervals for harmonic mean indicate 
that confidence interval is undefined for corresponding variables.
Consult Reference Manual for details.

. egen gmean = gmean(x)

. di gmean[1]
100.00001

. egen double dgmean = gmean(x)

. di dgmean[1]
100

For STATA read Stata throughout (FAQ Advice #18).

Announcement