GMM and factor variables

John Mullahy

Join Date: Dec 2016
Posts: 772

GMM and factor variables

14 Mar 2023, 12:31

Does anyone have insights why gmm computation time is much greater when factor variables are used? The question arose in a real application where the additional computation time was problematic but here's a toy (and hopefully replicable) example that demonstrates the issue.

Code:

cap preserve
cap drop _all

sysuse auto

gen trep=rep-1

loc lhs "trep"
loc rhs1 "mpg i.foreign"
loc rhs2 "mpg foreign"
loc max=4

expand 100

timer clear

/* gmm: first residual equation is of form y - f(x,b)                         */
/*      second residual equation is of form y^2 - g(x,b)                      */

timer on 1

qui gmm (1: `lhs'-`max'*exp({xa:`rhs1' _cons})/(exp({xa:})+exp({xb:`rhs1' _cons}))) ///
    (2: `lhs'^2-(`max'^2)*(exp({xa:})/(exp({xa:})+exp({xb:}))^2)*             ///
    ((exp({xa:})+exp({xb:})+exp({xa:})^2+exp({xa:})*exp({xb:}))/              ///
    (exp({xa:})+exp({xb:})+1))), inst(`rhs1') igmm vce(robust) winit(i)
 
gmm
 
timer off 1

timer on 2

qui gmm (1: `lhs'-`max'*exp({xa:`rhs2' _cons})/(exp({xa:})+exp({xb:`rhs2' _cons}))) ///
    (2: `lhs'^2-(`max'^2)*(exp({xa:})/(exp({xa:})+exp({xb:}))^2)*             ///
    ((exp({xa:})+exp({xb:})+exp({xa:})^2+exp({xa:})*exp({xb:}))/              ///
    (exp({xa:})+exp({xb:})+1))), inst(`rhs2') igmm vce(robust) winit(i)
    
gmm
    
timer off 2

timer list

cap restore

Results:

Code:

. cap preserve

. cap drop _all

.
. sysuse auto
(1978 automobile data)

.
. gen trep=rep-1
(5 missing values generated)

.
. loc lhs "trep"

. loc rhs1 "mpg i.foreign"

. loc rhs2 "mpg foreign"

. loc max=4

.
. expand 100
(7,326 observations created)

.
. timer clear

.
. /* gmm: first residual equation is of form y - f(x,b)                         */
. /*      second residual equation is of form y^2 - g(x,b)                      */
.
. timer on 1

.
. qui gmm (1: `lhs'-`max'*exp({xa:`rhs1' _cons})/(exp({xa:})+exp({xb:`rhs1' _cons}))) ///
>     (2: `lhs'^2-(`max'^2)*(exp({xa:})/(exp({xa:})+exp({xb:}))^2)*             ///
>     ((exp({xa:})+exp({xb:})+exp({xa:})^2+exp({xa:})*exp({xb:}))/              ///
>     (exp({xa:})+exp({xb:})+1))), inst(`rhs1') igmm vce(robust) winit(i)

.   
. gmm

GMM estimation

Number of parameters =   6
Number of moments    =   6
Initial weight matrix: Identity                   Number of obs   =      6,900
GMM weight matrix:     Robust

------------------------------------------------------------------------------
             |               Robust
             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
xa           |
         mpg |  -.0892447   .0047576   -18.76   0.000    -.0985693     -.07992
             |
     foreign |
    Foreign  |   .6791125   .0533686    12.72   0.000      .574512     .783713
       _cons |   2.747949   .1006754    27.30   0.000     2.550628    2.945269
-------------+----------------------------------------------------------------
xb           |
         mpg |   -.124634   .0059788   -20.85   0.000    -.1363523   -.1129158
             |
     foreign |
    Foreign  |   -.637334   .0660776    -9.65   0.000    -.7668436   -.5078243
       _cons |   3.418192   .1225606    27.89   0.000     3.177977    3.658406
------------------------------------------------------------------------------
Instruments for equation 1: mpg 0b.foreign 1.foreign _cons
Instruments for equation 2: mpg 0b.foreign 1.foreign _cons

.   
. timer off 1

.
. timer on 2

.
. qui gmm (1: `lhs'-`max'*exp({xa:`rhs2' _cons})/(exp({xa:})+exp({xb:`rhs2' _cons}))) ///
>     (2: `lhs'^2-(`max'^2)*(exp({xa:})/(exp({xa:})+exp({xb:}))^2)*             ///
>     ((exp({xa:})+exp({xb:})+exp({xa:})^2+exp({xa:})*exp({xb:}))/              ///
>     (exp({xa:})+exp({xb:})+1))), inst(`rhs2') igmm vce(robust) winit(i)

.     
. gmm

GMM estimation

Number of parameters =   6
Number of moments    =   6
Initial weight matrix: Identity                   Number of obs   =      6,900
GMM weight matrix:     Robust

------------------------------------------------------------------------------
             |               Robust
             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
xa           |
         mpg |  -.0892447   .0047576   -18.76   0.000    -.0985693     -.07992
     foreign |   .6791125   .0533686    12.72   0.000      .574512     .783713
       _cons |   2.747949   .1006754    27.30   0.000     2.550628    2.945269
-------------+----------------------------------------------------------------
xb           |
         mpg |   -.124634   .0059788   -20.85   0.000    -.1363523   -.1129158
     foreign |   -.637334   .0660776    -9.65   0.000    -.7668436   -.5078243
       _cons |   3.418192   .1225606    27.89   0.000     3.177977    3.658406
------------------------------------------------------------------------------
Instruments for equation 1: mpg foreign _cons
Instruments for equation 2: mpg foreign _cons

.     
. timer off 2

.
. timer list
   1:     14.28 /        1 =      14.2830
   2:      5.30 /        1 =       5.3010

.
. cap restore    

.

Tags: None

Joro Kolev

Join Date: Aug 2018
Posts: 3050

15 Mar 2023, 09:13

In my experience the factor variable notation generally slows down Stata. I think this is a general observation, and not something specific to -gmm-. I think for every computationally intensive task you are better off in terms of completion time if you manually generate the dummies and bypass factor variable notation.

Here is an example that does not involve -gmm-:

Code:

. clear all

. timer clear

. set obs 100000000
Number of observations (_N) was 0, now 100,000,000.

. gen y = rnormal()

. gen x = uniform()>.3

. timeit 1: reg y x

      Source |       SS           df       MS      Number of obs   = 100000000
-------------+----------------------------------   F(1, 99999998)  =      2.03
       Model |  2.02804723         1  2.02804723   Prob > F        =    0.1545
    Residual |   100021400  99999998  1.00021402   R-squared       =    0.0000
-------------+----------------------------------   Adj R-squared   =    0.0000
       Total |   100021402  99999999  1.00021403   Root MSE        =    1.0001

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
           x |  -.0003108   .0002182    -1.42   0.154    -.0007385     .000117
       _cons |  -.0000313   .0001826    -0.17   0.864    -.0003891    .0003266
------------------------------------------------------------------------------

. timeit 2: reg y i.x

      Source |       SS           df       MS      Number of obs   = 100000000
-------------+----------------------------------   F(1, 99999998)  =      2.03
       Model |  2.02804723         1  2.02804723   Prob > F        =    0.1545
    Residual |   100021400  99999998  1.00021402   R-squared       =    0.0000
-------------+----------------------------------   Adj R-squared   =    0.0000
       Total |   100021402  99999999  1.00021403   Root MSE        =    1.0001

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         1.x |  -.0003108   .0002182    -1.42   0.154    -.0007385     .000117
       _cons |  -.0000313   .0001826    -0.17   0.864    -.0003891    .0003266
------------------------------------------------------------------------------

. timer list
   1:      7.17 /        1 =       7.1750
   2:     18.95 /        1 =      18.9540

.

Comment

John Mullahy

Join Date: Dec 2016

Posts: 772
#3

15 Mar 2023, 09:59

Thanks Joro.I appreciate your taking time to provide a helpful example. It's interesting that the relative time difference in the two examples is similar (~2.5).

In my real example (nonlinear gmm with a large sample and several RHS dummy variables) I want to use factor variables so that margins would compute the average marginal effects properly for the dummy variables. I'll figure out some workaround (or will be patient and wait for the factor-variable specification to converge 😀).

Apart from my specific application I'd be keen to learn why it is that this happens, i.e. what is it about the Stata algorithms that results in such large differences in computation time.

One hypothesis I had was that there was some overhead involved in setting up an algorithm to accommodate any factor variables but that the number of factor variables specified >=1 wouldn't matter with respect to computation time. This seems not to be correct, using a modification of your example:

Code:

. . timer clear . . set obs 5000000 Number of observations (_N) was 0, now 5,000,000. . . gen y = rnormal() . . forval j=1/10 { 2. gen x`j'=uniform()>.5 3. } . . timeit 1: qui reg y x1-x10 . timeit 2: qui reg y i.x1 x2-x10 . timeit 3: qui reg y i.(x1-x10) . . timer list 1: 2.52 / 1 = 2.5230 2: 5.48 / 1 = 5.4820 3: 9.32 / 1 = 9.3210

P.S. Thanks for teaching me about timeit. I was unaware of its existence.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10482
#4

15 Mar 2023, 10:12

I'll figure out some workaround

In case you are not aware, FernandoRios has written a command f_able that allows you to compute marginal effects correctly without having used factor variables. Perhaps it can be helpful.

Code:

ssc install f_able, clear help f_able
2 likes
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2449
#5

15 Mar 2023, 10:17

I'd bring this to the attention of Stata tech support. I can't offhand think of any good reason why this should affect estimation time---but then, I'm no expert. I have previously encountered at least one Stata setting that was not needed or relevant to an estimation, but which grossly affected estimation time. (That was the setting for max matrix size, which had effects even in situations in which no large matrices were being used.) I wonder if there might be some similarly "irrelevant" setting here that could be changed and solve the problem. Regardless, I'd find it interesting to hear what tech support says.
1 like
Comment
John Mullahy

Join Date: Dec 2016

Posts: 772
#6

15 Mar 2023, 10:59

Thanks Andrew and Mike.

Andrew: I was not aware of f_able (or if I was aware of it I forgot about it). I will definitely investigate it.

Mike: I just contacted Stata tech support. Whatever I learn I'll post to this thread.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2534
#7

15 Mar 2023, 12:22

Hi John
If I may
1. f_able works with any transformation of continuous variables. It doesn't work with dummies.
2. This is why there are differences in marginal effects in your exercises!
when using the following:
reg y x1-x10
margins use the implicit derivative to get the marginal effect. (the straightforward coefficient)

When using dummies:
reg y i.x1 i.x2 ... i.x10
Stata actuallyhas to work harder. It compares the outcome when x1=0 and x1=1 (or all other values it contains). So at least double the time.

On this regard f_able would be much slower, because it relies of numerical derivation.

say
reg y x1 x1sq

It creates Yhat, then adds the epislon to x1,,say x1*, f_able recalculates x1sq , say x1sq* =(x1+e)^2. predicts yhat (yhat*) and gets the derivative
dydx = (yhat* - yhat)/e

Because of that, f_able can be very slow, (the cost of allowing for other function transformations)

F
2 likes
Comment

John Mullahy

Join Date: Dec 2016
Posts: 772

15 Mar 2023, 12:50

Thanks for this clarification, Fernando.

I can appreciate why computations with margins might be more complicated (and thus more time intensive) when factor variables are used. This is (perhaps) evident in an extension of the example in #3.

Code:

.
. timer clear

.
. set seed 2345

.
. set obs 5000000
Number of observations (_N) was 0, now 5,000,000.

.
. gen y = rnormal()

.
. forval j=1/10 {
  2.  gen x`j'=uniform()>.5
  3. }

.
. timeit 1: qui reg y x1-x10

. timeit 2: qui margins, dydx(*) nose

. timeit 3: qui reg y i.x1 x2-x10

. timeit 4: qui margins, dydx(*) nose

. timeit 5: qui reg y i.(x1-x10)

. timeit 6: qui margins, dydx(*) nose

.
. timer list
   1:      2.53 /        1 =       2.5340
   2:     41.52 /        1 =      41.5240
   3:      5.57 /        1 =       5.5730
   4:     41.48 /        1 =      41.4750
   5:     14.05 /        1 =      14.0500
   6:    152.75 /        1 =     152.7480

.

However what's perplexing to me (and I would guess what's perplexing to Joro and Mike) is why estimation of a model in the first place is more time intensive when factor variables are used, before even getting to the margins computations. (Joro's example in #2 is a clear demonstration of this.) I'm hoping Stata tech support can shine some light on this.

Comment

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2291
#9

15 Mar 2023, 13:11

Interesting! This suggests I might greatly speed up some nonlinear diff-in-diffs simulations I’ve done.
1 like
Comment
John Mullahy

Join Date: Dec 2016

Posts: 772
#10

16 Mar 2023, 11:36

Here's the instructive response I received from Stata technical support.

Thank you for providing details on your concern regarding the processing time
associated to using factor variables with estimation and postestimation
commands. We passed this report to our developers and after an internal
evaluation, here is a response prepared to address this query.

You are right to point out that factor variables add additional time to
computation. Stata's factor-variable notation provides convenient
specification for categorical variables and for interactions of variables.
More importantly, it allows the -margins- command to perform marginal analysis
and compute marginal effects correctly in the presence of interactions and in
nonlinear models. However, as it is expected with general purpose
funcionalities, there's a cost to make factor variables compatible with all
the broad variety of estimators that support them. Programming an isolated
code for one or two specific cases could certainly significantly speed up the
computational time for a particular situation, but it would also add other
costs.

That is the short story. Below, is the long story. We present only the
necessary details to understand the costs of factor variable notation. We base
our discussion on the example presented on Statalist with 10 dummy variables.

Every Stata command that works with variables in the dataset must read the
value of that variable from each selected observation. Let r be the cost of
reading a value from a variable at a given observation. For a dataset with N
observations, the overall cost of using each value for the entire dataset is
then:

N * r

This would be the cost of computing a mean. For a covariance or correlation
between k variables the cost is then

N * r * k * (k+1)/2

So a regression of y on 10 indicator variables

. regress y x1-x10

is going to cost

N * r * k * (k+1)/2

where k here is 12, but in general is the number of regressors, plus 1 for the
depvar, and plus 1 for the intercept.

For factor variables, in addition to reading the value of the variable for a
given observation, we must then compare that value to the level for a given
factor variable element. Our x variables are binary 0/1, therefore the factor
elements look like

0.x1 1.x1
0.x2 1.x2
...
0.x10 1.x10

So we need to access each factor variable 2 times, and also compare that given
value with the level. Factor variables automatically reorder themselves to
their first appearance when parsed, so we really only need to read the value
of a factor variable once for each observation, then find the level it
indicates. These level comparisons are less expensive than reading a value, so
let's denote that cost by c0. For a single 0binary factor variable the overall
cost of using each value and level for the entire dataset is then

N * (r + c0*2)

assuming we have to make a comparison with each level of our factor
variable. Stata stops looking when it finds a match, so this cost is a
worst case scenario.

In fact, Stata keeps track of the minimum, maximum, and number of levels
of each factor variable, and when all level values are contiguous in the
range, Stata can find the factor's level to indicate for a given value
without making comparisons with each level, so if the level jumping
costs c1 to perform, the actual cost in our example is

N * (r + c1)

While

c0 < c1

the difference is almost immeasurable for 2-level factor variables.
However, the c0 cost adds up quick when using factor variables with many
non-continuous levels.

Now back to your regression model, with all 10 factor variables the cost
is now

N * (r + c1) * k * (k+1)/2

thus by adding factor variables, the added cost for the matrix
accumulation X'X is

N * c1 * k * (k+1)/2

Stata must invert X'X. For the regression

. regress y x1-x10

an 11x11 matrix is inverted. For the regression

. regress y i.(x1-x10)

a 21x21 matrix is inverted, but there are 10 rows and columns composed
of zeros, for the base level of each factor variable, so the inversion
of this matrix should take the same amount of time as the one
constructed without factor variables. However, here is where -margins-
affects how estimators post their results to -e()-. -margins- performs
checks for estimable functions that depend on a matrix constructed from
X'X where the base level indicators are not zero. Inverting this X'X
matrix, which is 21x21 in our factor variables model, takes more time
than the one that is 11x11.

Now lets talk about -margins-. Comparing the -margins- timings

(1) . regress y x1-x10
. margins, dydx(*) nose

to

(2) . regress y i.(x1-x10)
. margins, dydx(*) nose

In (1) -margins- is computing the marginal effects of each x assuming it is a
continuous variable. There is special logic in -margins- for this case, since
the default prediction is -xb-. The marginal effect of x1 is computed from
the analytical derivative of the linear prediction with respect to x1, and so
on for x2, ..., and x10. So the cost here is a linear prediction for each
"continuous" predictor

N * r * k

-margins- checks for estimable functions for each marginal effect, so the real
cost is

N * r * k * 2

In (2) -margins- is computing the difference between the predictions assuming
each factor variable is 1 versus when it is 0. So the cost here is a linear
prediction for each level of each factor variable predictor

N * r * k * 2

with the extra check for estimable functions, the real cost is

N * r * k * 2 * 2

Note, what we said about the marginal effect comparison is reasonable for 0/1
factor variables and linear predictions without interactions.

Factor-variable notation has a cost. But the additional cost comes with a
benefit when we want to obtain effects using -margins- or other postestimation
tools. A benefit that is more transparent with interactions and nonlinear
models. There's no such thing as a free lunch.

I responded with this message. If I learn more I will share in this thread.

This is a useful lesson in the mechanics of how Stata handles factor variables. I appreciate both the brief and the extended discussion.

I do wonder, however, whether there is a special case that could be treated uniquely where all the factor variables are 0-1 dummies, which I speculate may be the most common situation for many analysts. The major value of factor variables in such cases is for -margins- not for estimation per se.

So is there an imaginable algorithm like this one:

[1] Program scans to determine if all factor variables are 0-1 dummies (yes/no).

[2] If yes, estimation proceeds without needing to do anything special. Treat factor variables as if "regular variables" and estimate accordingly.

[3] Post-estimation -margins- accesses the results from [2]. Then in doing its computations -margins- takes into account that some of the variables specified are factor variables (but all of those factor variables are 0-1 dummies).

Here's my conjecture:

a. Step [1] would take very little time.
b. In step [2] the major reduction in computation time would arise.
c. There would be no meaningful change in computation time for step [3].

I'd be interested to hear your reactions.
3 likes
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10482

#11

16 Mar 2023, 20:41

One thing that I noticed from examining the matrices of estimation commands is that if you specify the level using factor variables (e.g., 1.catvar instead of i.catvar), you do not get the base levels included in the stored matrices. That goes to the point:

Stata must invert X'X. For the regression

. regress y x1-x10

an 11x11 matrix is inverted. For the regression

. regress y i.(x1-x10)

a 21x21 matrix is inverted

So that gave me the idea to replicate your examples specifying the level using factor variable notation. For some reason, there are speed improvements for the example in #1 but not for #3, which is somewhat puzzling.

Code:

cap preserve
cap drop _all

sysuse auto

gen trep=rep-1

loc lhs "trep"
loc rhs1 "mpg i.foreign"
loc rhs2 "mpg foreign"
loc rhs3 "mpg 1.foreign"
loc max=4

expand 100
*ssc install timeit, replace
timer clear

/* gmm: first residual equation is of form y - f(x,b)                         */
/*      second residual equation is of form y^2 - g(x,b)                      */

timeit 1: qui gmm (1: `lhs'-`max'*exp({xa:`rhs1' _cons})/(exp({xa:})+exp({xb:`rhs1' _cons}))) ///
    (2: `lhs'^2-(`max'^2)*(exp({xa:})/(exp({xa:})+exp({xb:}))^2)*             ///
    ((exp({xa:})+exp({xb:})+exp({xa:})^2+exp({xa:})*exp({xb:}))/              ///
    (exp({xa:})+exp({xb:})+1))), inst(`rhs1') igmm vce(robust) winit(i)
 
gmm
 


timeit 2: qui gmm (1: `lhs'-`max'*exp({xa:`rhs2' _cons})/(exp({xa:})+exp({xb:`rhs2' _cons}))) ///
    (2: `lhs'^2-(`max'^2)*(exp({xa:})/(exp({xa:})+exp({xb:}))^2)*             ///
    ((exp({xa:})+exp({xb:})+exp({xa:})^2+exp({xa:})*exp({xb:}))/              ///
    (exp({xa:})+exp({xb:})+1))), inst(`rhs2') igmm vce(robust) winit(i)
    
gmm
    

timeit 3: qui gmm (1: `lhs'-`max'*exp({xa:`rhs3' _cons})/(exp({xa:})+exp({xb:`rhs3' _cons}))) ///
    (2: `lhs'^2-(`max'^2)*(exp({xa:})/(exp({xa:})+exp({xb:}))^2)*             ///
    ((exp({xa:})+exp({xb:})+exp({xa:})^2+exp({xa:})*exp({xb:}))/              ///
    (exp({xa:})+exp({xb:})+1))), inst(`rhs3') igmm vce(robust) winit(i)
    
gmm
    



timer list

cap restore

Res.:

Code:

. timer list
   1:      7.39 /        1 =       7.3950
   2:      1.84 /        1 =       1.8360
   3:      3.02 /        1 =       3.0180

Comment

John Mullahy

Join Date: Dec 2016

Posts: 772
#12

17 Mar 2023, 06:45

Based on Andrew Musau's observation in #11 I estimated a couple additional nonlinear GMM specifications. My timer report is consistent with Andrew's finding. For 0-1 dummy variables 1 and 3 use the

Code:

1.var

notation while 2 and 4 use the

Code:

i.var

notation.

Code:

. timer list 1: 47.73 / 1 = 47.7260 2: 423.28 / 1 = 423.2820 3: 56.88 / 1 = 56.8770 4: 301.85 / 1 = 301.8530

Using the 1.var notation instead of the i.var notation yields identical point estimates but different marginal effects (i.e. -margins- doesn't recognize 1.var as categorical).

I wonder if there's a potential modification of - margins- that would allow an option like

Code:

asfactor(var1 var2)

that tells -margins- to treat binary variables var1 and var2 as factor variables in computing the AMEs even though they may not have been specified as such in the original estimation.
2 likes
Comment
John Mullahy

Join Date: Dec 2016

Posts: 772
#13

17 Mar 2023, 09:41

Addendum: In an email exchange regarding this thread that I just had with Enrique Pinzon from StataCorp, he suggested trying the -quickderivatives- option in my nonlinear -gmm- examples.

I tried this and it sped up computation time enormously even though factor variables were included in the x-vector (e.g. 15s vs 238s for one specification). Moreover the point estimates were identical.

Had I used -quickderivatives- in the first place I probably wouldn't have noticed the factor-variable computation time issue. As such I may have unnecessarily opened a Pandora's Box by initiating this thread (even though I, for one, have learned a lot about factor variables as a consequence).
2 likes
Comment

Announcement

GMM and factor variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment