Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Predicting out of sample BLUPs after mixed

    Hi,

    I am trying to run a k-fold cross validation of a linear mixed effects model, using k-1 folds to predict the outcome in the fold left out of estimation.

    When I use the predict command with the fitted option (i.e. when including contributions from random effects), Stata returns missing values for all observations in the fold left out.

    A technical note in Stata's manual under mixed postestimation — Postestimation tools for mixed states:
    Out-of-sample predictions are permitted after mixed, but if these predictions involve BLUPs of random effects, the integrity of the estimation data must be preserved. If the estimation data have changed since the mixed model was fit, predict will be unable to obtain predicted random effects that are appropriate for the fitted model and will give an error. Thus to obtain out-of-sample predictions that contain random-effects terms, be sure that the data for these predictions are in observations that augment the estimation data.

    This suggests out of sample predictions involving BLUPs are possible. But I am not sure I understand what is meant by preserving the integrity of the estimation data / be sure that the data for these predictions are in observations that augment the estimation data. Can someone please elaborate?

    Many thanks.
    Last edited by Ayesha SAhmed; 07 Aug 2024, 05:14.

  • #2
    See this thread. All of the observations within the held out clusters have their own values on the predictors in the model. Accordingly, you can make predictions on held out data based on those values combined with the parameter estimates from the model. But because these clusters were not in the original model, then there is no information about them to make a prediction about their potential cluster-level contribution. If, on the other hand, at least one or two observations from those clusters are in the training data, then the model can make predictions about the cluster because some information about that cluster was included in the original estimation.

    The mixed effect model borrows information from the population (fixed effect) estimates, combined with what it knows about each cluster, to provide a compromise prediction. The more information (observations) it has for a given cluster, the less it's prediction is pulled toward the population estimate. This is a very nice quality for prediction tasks, and there is some evidence that mixed models are competitive with machine learning algorithms for prediction.

    If you want to predict hold out clusters, one approach would be to create an aggregate dataset that has variables corresponding to the cluster mean for all variables involved in your model. Then you can split the sample and run OLS (with each cluster having one row/observation) on the the aggregate variables. Then predict the values of the hold out clusters. It's not the same thing, but it is in the same spirit.

    Comment


    • #3
      Thank you for the clear explanation, the useful thread and the suggestion.

      Comment

      Working...
      X