Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • SEM and GSEM: memory usage and mata

    My question is regarding sem and gsem in Stata and Mata memory.

    As you probably have suspected, I have found the old error r(3900) in my estimation. I am aware there is no solution to this problem besides upgrading the machine and OS (probably not possible). This question is mainly for my own curiosity.

    I am estimating a rather complicate model in SEM using a reasonable large dataset (2424 obs, 121 paths, 19 vars all observed) and in sem model it estimates with little difficult using QML (ML with the robust option), however, in gsem (including the listwise option), the exact same model gets an error ( see below) from mata that says it is trying to load a real matrix known as J() of [2424,24795] which I estimate at about half a gigabyte of data

    J(): 3900 unable to allocate real <tmp>[2424,24795]

    Why such the large increase in the use of memory? is it simply the integration methods of gsem, is their anyway to suspend them? and which matrix is this J()?

  • #2
    Short reply

    Robert can use gsem's dnumerical option (added in the 07oct2013 update) to
    decrease the memory requirements at a speed cost.

    Longer reply

    The gsem command allows for multi-level latent variables and outcomes from
    non-Gaussian families, so gsem is not as memory or computationally efficient
    as sem for the same model. Specifically sem is able to compute the marginal
    likelihood using the sample mean vector and variance matrix directly, gsem
    computes the marginal likelihood using quadrature on the conditional
    likelihood values which are calculated at each observation.

    The analytical derivatives for the gradient vector and Hessian matrix
    implemented in gsem are built up from intermediate calculations that involve
    matrices with as many rows as the dataset and columns that depend the model
    parameters. The largest intermediate matrices are part of the Hessian
    calculation.

    A saturated SEM model with 19 observed variables has

    Code:
    display 19+comb(19+1,2)
    209
    unique model parameters and

    Code:
    display comb(209+1,2)
    21,945
    unique variance estimates for the fitted model parameters.

    If gsem were to fit the saturated model, the intermediate matrix used to
    compute the gradient will have rows corresponding to the sample size and 209
    columns, while the intermediate matrix used to compute the Hessian will have
    21,945 columns.

    This is curiously less than the 24,795 columns reported by Robert, but I must
    admit that the above calculations do not account for latent variables and zero
    valued constraints on covariances. Regardless, I believe these columns
    correspond to the intermediate matrix needed to analytically compute the
    Hessian matrix.

    The dnumerical option was added to gsem in the 07oct2013 update to solve this
    problem. dnumerical causes gsem to use numerical methods for computing the
    gradient and Hessian. Using dnumerical will decrease gsem's memory
    requirements but will increase the time it takes gsem to fit the model.

    Comment


    • #3
      Thank you very much.

      I wasn't expecting a potential solution, i just want to address my curiosity which you have done most excellently.

      Comment


      • #4
        For anyone else who visits here, the speed-memory trade-off for dnumerical may not be worth it. I recently estimated an elaborate model that required about 35GB of memory (on a dataset that takes up less than 500MB). The computation time was about 3 hours without calling dnumerical. When adding the dnumerical option, the estimation takes longer than 4 days.

        Your mileage may vary, but know that, for more complex models, the trade-off might be upwards of 1 day for ever hour of computation.

        Comment

        Working...
        X