Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • GMM and factor variables

    Does anyone have insights why gmm computation time is much greater when factor variables are used? The question arose in a real application where the additional computation time was problematic but here's a toy (and hopefully replicable) example that demonstrates the issue.
    Code:
    cap preserve
    cap drop _all
    
    sysuse auto
    
    gen trep=rep-1
    
    loc lhs "trep"
    loc rhs1 "mpg i.foreign"
    loc rhs2 "mpg foreign"
    loc max=4
    
    expand 100
    
    timer clear
    
    /* gmm: first residual equation is of form y - f(x,b)                         */
    /*      second residual equation is of form y^2 - g(x,b)                      */
    
    timer on 1
    
    qui gmm (1: `lhs'-`max'*exp({xa:`rhs1' _cons})/(exp({xa:})+exp({xb:`rhs1' _cons}))) ///
        (2: `lhs'^2-(`max'^2)*(exp({xa:})/(exp({xa:})+exp({xb:}))^2)*             ///
        ((exp({xa:})+exp({xb:})+exp({xa:})^2+exp({xa:})*exp({xb:}))/              ///
        (exp({xa:})+exp({xb:})+1))), inst(`rhs1') igmm vce(robust) winit(i)
     
    gmm
     
    timer off 1
    
    timer on 2
    
    qui gmm (1: `lhs'-`max'*exp({xa:`rhs2' _cons})/(exp({xa:})+exp({xb:`rhs2' _cons}))) ///
        (2: `lhs'^2-(`max'^2)*(exp({xa:})/(exp({xa:})+exp({xb:}))^2)*             ///
        ((exp({xa:})+exp({xb:})+exp({xa:})^2+exp({xa:})*exp({xb:}))/              ///
        (exp({xa:})+exp({xb:})+1))), inst(`rhs2') igmm vce(robust) winit(i)
        
    gmm
        
    timer off 2
    
    timer list
    
    cap restore
    Results:
    Code:
    . cap preserve
    
    . cap drop _all
    
    .
    . sysuse auto
    (1978 automobile data)
    
    .
    . gen trep=rep-1
    (5 missing values generated)
    
    .
    . loc lhs "trep"
    
    . loc rhs1 "mpg i.foreign"
    
    . loc rhs2 "mpg foreign"
    
    . loc max=4
    
    .
    . expand 100
    (7,326 observations created)
    
    .
    . timer clear
    
    .
    . /* gmm: first residual equation is of form y - f(x,b)                         */
    . /*      second residual equation is of form y^2 - g(x,b)                      */
    .
    . timer on 1
    
    .
    . qui gmm (1: `lhs'-`max'*exp({xa:`rhs1' _cons})/(exp({xa:})+exp({xb:`rhs1' _cons}))) ///
    >     (2: `lhs'^2-(`max'^2)*(exp({xa:})/(exp({xa:})+exp({xb:}))^2)*             ///
    >     ((exp({xa:})+exp({xb:})+exp({xa:})^2+exp({xa:})*exp({xb:}))/              ///
    >     (exp({xa:})+exp({xb:})+1))), inst(`rhs1') igmm vce(robust) winit(i)
    
    .   
    . gmm
    
    GMM estimation
    
    Number of parameters =   6
    Number of moments    =   6
    Initial weight matrix: Identity                   Number of obs   =      6,900
    GMM weight matrix:     Robust
    
    ------------------------------------------------------------------------------
                 |               Robust
                 | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
    xa           |
             mpg |  -.0892447   .0047576   -18.76   0.000    -.0985693     -.07992
                 |
         foreign |
        Foreign  |   .6791125   .0533686    12.72   0.000      .574512     .783713
           _cons |   2.747949   .1006754    27.30   0.000     2.550628    2.945269
    -------------+----------------------------------------------------------------
    xb           |
             mpg |   -.124634   .0059788   -20.85   0.000    -.1363523   -.1129158
                 |
         foreign |
        Foreign  |   -.637334   .0660776    -9.65   0.000    -.7668436   -.5078243
           _cons |   3.418192   .1225606    27.89   0.000     3.177977    3.658406
    ------------------------------------------------------------------------------
    Instruments for equation 1: mpg 0b.foreign 1.foreign _cons
    Instruments for equation 2: mpg 0b.foreign 1.foreign _cons
    
    .   
    . timer off 1
    
    .
    . timer on 2
    
    .
    . qui gmm (1: `lhs'-`max'*exp({xa:`rhs2' _cons})/(exp({xa:})+exp({xb:`rhs2' _cons}))) ///
    >     (2: `lhs'^2-(`max'^2)*(exp({xa:})/(exp({xa:})+exp({xb:}))^2)*             ///
    >     ((exp({xa:})+exp({xb:})+exp({xa:})^2+exp({xa:})*exp({xb:}))/              ///
    >     (exp({xa:})+exp({xb:})+1))), inst(`rhs2') igmm vce(robust) winit(i)
    
    .     
    . gmm
    
    GMM estimation
    
    Number of parameters =   6
    Number of moments    =   6
    Initial weight matrix: Identity                   Number of obs   =      6,900
    GMM weight matrix:     Robust
    
    ------------------------------------------------------------------------------
                 |               Robust
                 | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
    xa           |
             mpg |  -.0892447   .0047576   -18.76   0.000    -.0985693     -.07992
         foreign |   .6791125   .0533686    12.72   0.000      .574512     .783713
           _cons |   2.747949   .1006754    27.30   0.000     2.550628    2.945269
    -------------+----------------------------------------------------------------
    xb           |
             mpg |   -.124634   .0059788   -20.85   0.000    -.1363523   -.1129158
         foreign |   -.637334   .0660776    -9.65   0.000    -.7668436   -.5078243
           _cons |   3.418192   .1225606    27.89   0.000     3.177977    3.658406
    ------------------------------------------------------------------------------
    Instruments for equation 1: mpg foreign _cons
    Instruments for equation 2: mpg foreign _cons
    
    .     
    . timer off 2
    
    .
    . timer list
       1:     14.28 /        1 =      14.2830
       2:      5.30 /        1 =       5.3010
    
    .
    . cap restore    
    
    .

  • #2
    In my experience the factor variable notation generally slows down Stata. I think this is a general observation, and not something specific to -gmm-. I think for every computationally intensive task you are better off in terms of completion time if you manually generate the dummies and bypass factor variable notation.

    Here is an example that does not involve -gmm-:

    Code:
    . clear all
    
    . timer clear
    
    . set obs 100000000
    Number of observations (_N) was 0, now 100,000,000.
    
    . gen y = rnormal()
    
    . gen x = uniform()>.3
    
    . timeit 1: reg y x
    
          Source |       SS           df       MS      Number of obs   = 100000000
    -------------+----------------------------------   F(1, 99999998)  =      2.03
           Model |  2.02804723         1  2.02804723   Prob > F        =    0.1545
        Residual |   100021400  99999998  1.00021402   R-squared       =    0.0000
    -------------+----------------------------------   Adj R-squared   =    0.0000
           Total |   100021402  99999999  1.00021403   Root MSE        =    1.0001
    
    ------------------------------------------------------------------------------
               y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
               x |  -.0003108   .0002182    -1.42   0.154    -.0007385     .000117
           _cons |  -.0000313   .0001826    -0.17   0.864    -.0003891    .0003266
    ------------------------------------------------------------------------------
    
    . timeit 2: reg y i.x
    
          Source |       SS           df       MS      Number of obs   = 100000000
    -------------+----------------------------------   F(1, 99999998)  =      2.03
           Model |  2.02804723         1  2.02804723   Prob > F        =    0.1545
        Residual |   100021400  99999998  1.00021402   R-squared       =    0.0000
    -------------+----------------------------------   Adj R-squared   =    0.0000
           Total |   100021402  99999999  1.00021403   Root MSE        =    1.0001
    
    ------------------------------------------------------------------------------
               y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
             1.x |  -.0003108   .0002182    -1.42   0.154    -.0007385     .000117
           _cons |  -.0000313   .0001826    -0.17   0.864    -.0003891    .0003266
    ------------------------------------------------------------------------------
    
    . timer list
       1:      7.17 /        1 =       7.1750
       2:     18.95 /        1 =      18.9540
    
    .

    Comment


    • #3
      Thanks Joro.I appreciate your taking time to provide a helpful example. It's interesting that the relative time difference in the two examples is similar (~2.5).

      In my real example (nonlinear gmm with a large sample and several RHS dummy variables) I want to use factor variables so that margins would compute the average marginal effects properly for the dummy variables. I'll figure out some workaround (or will be patient and wait for the factor-variable specification to converge 😀).

      Apart from my specific application I'd be keen to learn why it is that this happens, i.e. what is it about the Stata algorithms that results in such large differences in computation time.

      One hypothesis I had was that there was some overhead involved in setting up an algorithm to accommodate any factor variables but that the number of factor variables specified >=1 wouldn't matter with respect to computation time. This seems not to be correct, using a modification of your example:
      Code:
      .
      . timer clear
      
      .
      . set obs 5000000
      Number of observations (_N) was 0, now 5,000,000.
      
      .
      . gen y = rnormal()
      
      .
      . forval j=1/10 {
        2.  gen x`j'=uniform()>.5
        3. }
      
      .
      . timeit 1: qui reg y x1-x10
      
      . timeit 2: qui reg y i.x1 x2-x10
      
      . timeit 3: qui reg y i.(x1-x10)
      
      .
      . timer list
         1:      2.52 /        1 =       2.5230
         2:      5.48 /        1 =       5.4820
         3:      9.32 /        1 =       9.3210
      P.S. Thanks for teaching me about timeit. I was unaware of its existence.

      Comment


      • #4
        I'll figure out some workaround
        In case you are not aware, FernandoRios has written a command f_able that allows you to compute marginal effects correctly without having used factor variables. Perhaps it can be helpful.

        Code:
        ssc install f_able, clear
        help f_able

        Comment


        • #5
          I'd bring this to the attention of Stata tech support. I can't offhand think of any good reason why this should affect estimation time---but then, I'm no expert. I have previously encountered at least one Stata setting that was not needed or relevant to an estimation, but which grossly affected estimation time. (That was the setting for max matrix size, which had effects even in situations in which no large matrices were being used.) I wonder if there might be some similarly "irrelevant" setting here that could be changed and solve the problem. Regardless, I'd find it interesting to hear what tech support says.

          Comment


          • #6
            Thanks Andrew and Mike.

            Andrew: I was not aware of f_able (or if I was aware of it I forgot about it). I will definitely investigate it.

            Mike: I just contacted Stata tech support. Whatever I learn I'll post to this thread.

            Comment


            • #7
              Hi John
              If I may
              1. f_able works with any transformation of continuous variables. It doesn't work with dummies.
              2. This is why there are differences in marginal effects in your exercises!
              when using the following:
              reg y x1-x10
              margins use the implicit derivative to get the marginal effect. (the straightforward coefficient)

              When using dummies:
              reg y i.x1 i.x2 ... i.x10
              Stata actuallyhas to work harder. It compares the outcome when x1=0 and x1=1 (or all other values it contains). So at least double the time.

              On this regard f_able would be much slower, because it relies of numerical derivation.

              say
              reg y x1 x1sq

              It creates Yhat, then adds the epislon to x1,,say x1*, f_able recalculates x1sq , say x1sq* =(x1+e)^2. predicts yhat (yhat*) and gets the derivative
              dydx = (yhat* - yhat)/e

              Because of that, f_able can be very slow, (the cost of allowing for other function transformations)

              F

              Comment


              • #8
                Thanks for this clarification, Fernando.

                I can appreciate why computations with margins might be more complicated (and thus more time intensive) when factor variables are used. This is (perhaps) evident in an extension of the example in #3.
                Code:
                .
                . timer clear
                
                .
                . set seed 2345
                
                .
                . set obs 5000000
                Number of observations (_N) was 0, now 5,000,000.
                
                .
                . gen y = rnormal()
                
                .
                . forval j=1/10 {
                  2.  gen x`j'=uniform()>.5
                  3. }
                
                .
                . timeit 1: qui reg y x1-x10
                
                . timeit 2: qui margins, dydx(*) nose
                
                . timeit 3: qui reg y i.x1 x2-x10
                
                . timeit 4: qui margins, dydx(*) nose
                
                . timeit 5: qui reg y i.(x1-x10)
                
                . timeit 6: qui margins, dydx(*) nose
                
                .
                . timer list
                   1:      2.53 /        1 =       2.5340
                   2:     41.52 /        1 =      41.5240
                   3:      5.57 /        1 =       5.5730
                   4:     41.48 /        1 =      41.4750
                   5:     14.05 /        1 =      14.0500
                   6:    152.75 /        1 =     152.7480
                
                .
                However what's perplexing to me (and I would guess what's perplexing to Joro and Mike) is why estimation of a model in the first place is more time intensive when factor variables are used, before even getting to the margins computations. (Joro's example in #2 is a clear demonstration of this.) I'm hoping Stata tech support can shine some light on this.

                Comment


                • #9
                  Interesting! This suggests I might greatly speed up some nonlinear diff-in-diffs simulations I’ve done.

                  Comment


                  • #10
                    Here's the instructive response I received from Stata technical support.

                    Thank you for providing details on your concern regarding the processing time
                    associated to using factor variables with estimation and postestimation
                    commands. We passed this report to our developers and after an internal
                    evaluation, here is a response prepared to address this query.

                    You are right to point out that factor variables add additional time to
                    computation. Stata's factor-variable notation provides convenient
                    specification for categorical variables and for interactions of variables.
                    More importantly, it allows the -margins- command to perform marginal analysis
                    and compute marginal effects correctly in the presence of interactions and in
                    nonlinear models. However, as it is expected with general purpose
                    funcionalities, there's a cost to make factor variables compatible with all
                    the broad variety of estimators that support them. Programming an isolated
                    code for one or two specific cases could certainly significantly speed up the
                    computational time for a particular situation, but it would also add other
                    costs.

                    That is the short story. Below, is the long story. We present only the
                    necessary details to understand the costs of factor variable notation. We base
                    our discussion on the example presented on Statalist with 10 dummy variables.

                    Every Stata command that works with variables in the dataset must read the
                    value of that variable from each selected observation. Let r be the cost of
                    reading a value from a variable at a given observation. For a dataset with N
                    observations, the overall cost of using each value for the entire dataset is
                    then:

                    N * r

                    This would be the cost of computing a mean. For a covariance or correlation
                    between k variables the cost is then

                    N * r * k * (k+1)/2


                    So a regression of y on 10 indicator variables

                    . regress y x1-x10

                    is going to cost

                    N * r * k * (k+1)/2

                    where k here is 12, but in general is the number of regressors, plus 1 for the
                    depvar, and plus 1 for the intercept.

                    For factor variables, in addition to reading the value of the variable for a
                    given observation, we must then compare that value to the level for a given
                    factor variable element. Our x variables are binary 0/1, therefore the factor
                    elements look like

                    0.x1 1.x1
                    0.x2 1.x2
                    ...
                    0.x10 1.x10

                    So we need to access each factor variable 2 times, and also compare that given
                    value with the level. Factor variables automatically reorder themselves to
                    their first appearance when parsed, so we really only need to read the value
                    of a factor variable once for each observation, then find the level it
                    indicates. These level comparisons are less expensive than reading a value, so
                    let's denote that cost by c0. For a single 0binary factor variable the overall
                    cost of using each value and level for the entire dataset is then

                    N * (r + c0*2)

                    assuming we have to make a comparison with each level of our factor
                    variable. Stata stops looking when it finds a match, so this cost is a
                    worst case scenario.

                    In fact, Stata keeps track of the minimum, maximum, and number of levels
                    of each factor variable, and when all level values are contiguous in the
                    range, Stata can find the factor's level to indicate for a given value
                    without making comparisons with each level, so if the level jumping
                    costs c1 to perform, the actual cost in our example is

                    N * (r + c1)

                    While

                    c0 < c1

                    the difference is almost immeasurable for 2-level factor variables.
                    However, the c0 cost adds up quick when using factor variables with many
                    non-continuous levels.

                    Now back to your regression model, with all 10 factor variables the cost
                    is now

                    N * (r + c1) * k * (k+1)/2

                    thus by adding factor variables, the added cost for the matrix
                    accumulation X'X is

                    N * c1 * k * (k+1)/2

                    Stata must invert X'X. For the regression

                    . regress y x1-x10

                    an 11x11 matrix is inverted. For the regression

                    . regress y i.(x1-x10)

                    a 21x21 matrix is inverted, but there are 10 rows and columns composed
                    of zeros, for the base level of each factor variable, so the inversion
                    of this matrix should take the same amount of time as the one
                    constructed without factor variables. However, here is where -margins-
                    affects how estimators post their results to -e()-. -margins- performs
                    checks for estimable functions that depend on a matrix constructed from
                    X'X where the base level indicators are not zero. Inverting this X'X
                    matrix, which is 21x21 in our factor variables model, takes more time
                    than the one that is 11x11.

                    Now lets talk about -margins-. Comparing the -margins- timings

                    (1) . regress y x1-x10
                    . margins, dydx(*) nose

                    to

                    (2) . regress y i.(x1-x10)
                    . margins, dydx(*) nose


                    In (1) -margins- is computing the marginal effects of each x assuming it is a
                    continuous variable. There is special logic in -margins- for this case, since
                    the default prediction is -xb-. The marginal effect of x1 is computed from
                    the analytical derivative of the linear prediction with respect to x1, and so
                    on for x2, ..., and x10. So the cost here is a linear prediction for each
                    "continuous" predictor

                    N * r * k

                    -margins- checks for estimable functions for each marginal effect, so the real
                    cost is

                    N * r * k * 2

                    In (2) -margins- is computing the difference between the predictions assuming
                    each factor variable is 1 versus when it is 0. So the cost here is a linear
                    prediction for each level of each factor variable predictor

                    N * r * k * 2

                    with the extra check for estimable functions, the real cost is

                    N * r * k * 2 * 2

                    Note, what we said about the marginal effect comparison is reasonable for 0/1
                    factor variables and linear predictions without interactions.

                    Factor-variable notation has a cost. But the additional cost comes with a
                    benefit when we want to obtain effects using -margins- or other postestimation
                    tools. A benefit that is more transparent with interactions and nonlinear
                    models. There's no such thing as a free lunch.

                    I responded with this message. If I learn more I will share in this thread.
                    This is a useful lesson in the mechanics of how Stata handles factor variables. I appreciate both the brief and the extended discussion.

                    I do wonder, however, whether there is a special case that could be treated uniquely where all the factor variables are 0-1 dummies, which I speculate may be the most common situation for many analysts. The major value of factor variables in such cases is for -margins- not for estimation per se.

                    So is there an imaginable algorithm like this one:

                    [1] Program scans to determine if all factor variables are 0-1 dummies (yes/no).

                    [2] If yes, estimation proceeds without needing to do anything special. Treat factor variables as if "regular variables" and estimate accordingly.

                    [3] Post-estimation -margins- accesses the results from [2]. Then in doing its computations -margins- takes into account that some of the variables specified are factor variables (but all of those factor variables are 0-1 dummies).

                    Here's my conjecture:

                    a. Step [1] would take very little time.
                    b. In step [2] the major reduction in computation time would arise.
                    c. There would be no meaningful change in computation time for step [3].

                    I'd be interested to hear your reactions.

                    Comment


                    • #11
                      One thing that I noticed from examining the matrices of estimation commands is that if you specify the level using factor variables (e.g., 1.catvar instead of i.catvar), you do not get the base levels included in the stored matrices. That goes to the point:

                      Stata must invert X'X. For the regression

                      . regress y x1-x10

                      an 11x11 matrix is inverted. For the regression

                      . regress y i.(x1-x10)

                      a 21x21 matrix is inverted
                      So that gave me the idea to replicate your examples specifying the level using factor variable notation. For some reason, there are speed improvements for the example in #1 but not for #3, which is somewhat puzzling.

                      Code:
                      cap preserve
                      cap drop _all
                      
                      sysuse auto
                      
                      gen trep=rep-1
                      
                      loc lhs "trep"
                      loc rhs1 "mpg i.foreign"
                      loc rhs2 "mpg foreign"
                      loc rhs3 "mpg 1.foreign"
                      loc max=4
                      
                      expand 100
                      *ssc install timeit, replace
                      timer clear
                      
                      /* gmm: first residual equation is of form y - f(x,b)                         */
                      /*      second residual equation is of form y^2 - g(x,b)                      */
                      
                      timeit 1: qui gmm (1: `lhs'-`max'*exp({xa:`rhs1' _cons})/(exp({xa:})+exp({xb:`rhs1' _cons}))) ///
                          (2: `lhs'^2-(`max'^2)*(exp({xa:})/(exp({xa:})+exp({xb:}))^2)*             ///
                          ((exp({xa:})+exp({xb:})+exp({xa:})^2+exp({xa:})*exp({xb:}))/              ///
                          (exp({xa:})+exp({xb:})+1))), inst(`rhs1') igmm vce(robust) winit(i)
                       
                      gmm
                       
                      
                      
                      timeit 2: qui gmm (1: `lhs'-`max'*exp({xa:`rhs2' _cons})/(exp({xa:})+exp({xb:`rhs2' _cons}))) ///
                          (2: `lhs'^2-(`max'^2)*(exp({xa:})/(exp({xa:})+exp({xb:}))^2)*             ///
                          ((exp({xa:})+exp({xb:})+exp({xa:})^2+exp({xa:})*exp({xb:}))/              ///
                          (exp({xa:})+exp({xb:})+1))), inst(`rhs2') igmm vce(robust) winit(i)
                          
                      gmm
                          
                      
                      timeit 3: qui gmm (1: `lhs'-`max'*exp({xa:`rhs3' _cons})/(exp({xa:})+exp({xb:`rhs3' _cons}))) ///
                          (2: `lhs'^2-(`max'^2)*(exp({xa:})/(exp({xa:})+exp({xb:}))^2)*             ///
                          ((exp({xa:})+exp({xb:})+exp({xa:})^2+exp({xa:})*exp({xb:}))/              ///
                          (exp({xa:})+exp({xb:})+1))), inst(`rhs3') igmm vce(robust) winit(i)
                          
                      gmm
                          
                      
                      
                      
                      timer list
                      
                      cap restore

                      Res.:

                      Code:
                      . timer list
                         1:      7.39 /        1 =       7.3950
                         2:      1.84 /        1 =       1.8360
                         3:      3.02 /        1 =       3.0180

                      Comment


                      • #12
                        Based on Andrew Musau's observation in #11 I estimated a couple additional nonlinear GMM specifications. My timer report is consistent with Andrew's finding. For 0-1 dummy variables 1 and 3 use the
                        Code:
                        1.var
                        notation while 2 and 4 use the
                        Code:
                        i.var
                        notation.
                        Code:
                        . timer list
                           1:     47.73 /        1 =      47.7260
                           2:    423.28 /        1 =     423.2820
                           3:     56.88 /        1 =      56.8770
                           4:    301.85 /        1 =     301.8530
                        Using the 1.var notation instead of the i.var notation yields identical point estimates but different marginal effects (i.e. -margins- doesn't recognize 1.var as categorical).

                        I wonder if there's a potential modification of - margins- that would allow an option like
                        Code:
                        asfactor(var1 var2)
                        that tells -margins- to treat binary variables var1 and var2 as factor variables in computing the AMEs even though they may not have been specified as such in the original estimation.

                        Comment


                        • #13
                          Addendum: In an email exchange regarding this thread that I just had with Enrique Pinzon from StataCorp, he suggested trying the -quickderivatives- option in my nonlinear -gmm- examples.

                          I tried this and it sped up computation time enormously even though factor variables were included in the x-vector (e.g. 15s vs 238s for one specification). Moreover the point estimates were identical.

                          Had I used -quickderivatives- in the first place I probably wouldn't have noticed the factor-variable computation time issue. As such I may have unnecessarily opened a Pandora's Box by initiating this thread (even though I, for one, have learned a lot about factor variables as a consequence).

                          Comment

                          Working...
                          X