Mixed effects growth model

Leon Edelman

Join Date: Sep 2016

Posts: 86
#1

Mixed effects growth model

23 Mar 2017, 18:53

hi all,

This is my first time running this type of model and i just want to make sure that the code matches the vebral explanation of what i'm trying to do.
want to look at the change in the percentage of income savings measured in dollars using repeated measures data for older adults. I want to compare three groups based on their father's occupation while growing up. I want to estimate different slopes and intercepts for the three groups. For example: group A is subjects whose father's were blue-collar; group B is subjects whose father's were white-collar; group C is subjects whose father's were other types of workers.

I have repeated measures for 10 annual surveys and subjects reported how much of their income they saved. I want to see if the intercept and trajectory differs for the three groups. This is what I come up with but I am not sure if this gets at the appropriate verbal explanation. Thank you very much for your time. I truly appreciate it. Thank you!

MODEL 1 looks at the impact of time on savings and estimates intercepts and slopes for each subject
xtmixed savings time || subjects:, var

MODEL 2 examines the differences in the intercept and slope for the two groups? Or do i need to add fathers_occupation to the right side of || as well?
m1 . m2 = xtmixed savings time##fathers_occupation || subjects:, var
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

23 Mar 2017, 19:52

Both models do what you describe, except that neither one estimates any random slopes. Based on your description, I believe you want to add time, not fathers_occupation to the second level of the model. E.g.:

Code:

mixed savings c.time##i.fathers_occupation || subjects: time

Note: In current Stata, the command -xtmixed- has been renamed to -mixed-. And -var- is now the default, so the option need not be specified. If you are using an older version of Stata, you should say so in your post.

Also, the # and ## operators assume that the variables they are applied to are to be treated as discrete unless you specify otherwise with c. I'm questioning whether you really want time to be discrete here. In growth models, time is usually treated as a continuous variable. So I added it to the code. If you really intend to treat time as discrete, then you will also need to generate explicit indicator variables for the time periods to include in the random slopes components:

Code:

forvalues i = 1/10 { gen byte time`i' = `i'.time } mixed savings i.time##i.fathers_occupation || subjects: time1-time10

This is doable, though, as I say, it would be rather odd for a growth model.
Comment

Leon Edelman

Join Date: Sep 2016
Posts: 86

24 Mar 2017, 06:51

Thank you Clyde very much. That is truly helpful. So the difference between c.time and time that "c." tells stata that it is continuous? How come "c." is specified on the left side of || but not the right side? I also am wondering if I can clarify that I interpret this correctly. Unfortunately there is no one to assist me in this endeavor. I truncate time from 1-5 so it's not as long in this example.

Code:

 mixed savings time##fathers_occupation || subject: time

Performing EM optimization:

Performing gradient-based optimization:

Iteration 0:   log likelihood = -103836.07  
Iteration 1:   log likelihood = -103835.51  
Iteration 2:   log likelihood = -103835.51  

Computing standard errors:

Mixed-effects ML regression                     Number of obs     =     38,511
Group variable: subject                          Number of groups  =      8,390

                                                Obs per group:
                                                              min =          2
                                                              avg =        4.6
                                                              max =          6

                                                Wald chi2(23)     =    2182.66
Log likelihood = -103835.51                     Prob > chi2       =     0.0000

-----------------------------------------------------------------------------------
              savings |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
             time |
               1  |  -.4354436   .0716817    -6.07   0.000    -.5759371     -.29495
               2  |  -.8518655   .0712723   -11.95   0.000    -.9915565   -.7121744
               3  |   -1.52412   .0721653   -21.12   0.000    -1.665561   -1.382678
               4  |  -2.037036   .0736227   -27.67   0.000    -2.181334   -1.892739
               5  |  -2.454391   .0767252   -31.99   0.000     -2.60477   -2.304013
                  |
     fathers_occupation |
               1  |  -.1373717   .1427089    -0.96   0.336    -.4170759    .1423326
               2  |  -.1359516   .1921245    -0.71   0.479    -.5125086    .2406055
               3  |  -.4465857   .3198196    -1.40   0.163    -1.073421    .1802492
                  |
time#fathers_occupation |
             1 1  |  -.1742648   .1371687    -1.27   0.204    -.4431105     .094581
             1 2  |   .2197551   .1844882     1.19   0.234     -.141835    .5813453
             1 3  |   .2447533    .304215     0.80   0.421    -.3514971    .8410038
             2 1  |  -.0230717   .1370744    -0.17   0.866    -.2917327    .2455892
             2 2  |   .1058541   .1838355     0.58   0.565    -.2544568     .466165
             2 3  |   .7603165   .3053384     2.49   0.013     .1618642    1.358769
             3 1  |   .0624327   .1390166     0.45   0.653    -.2100348    .3349003
             3 2  |   .1227907   .1867357     0.66   0.511    -.2432046     .488786
             3 3  |   .4926164   .3106028     1.59   0.113     -.116154    1.101387
             4 1  |   .1280688   .1424986     0.90   0.369    -.1512233    .4073609
             4 2  |   .0999962   .1915535     0.52   0.602    -.2754418    .4754343
             4 3  |    .643249   .3201277     2.01   0.044     .0158102    1.270688
             5 1  |  -.0790317   .1490299    -0.53   0.596    -.3711249    .2130616
             5 2  |  -.0202167   .2006891    -0.10   0.920    -.4135601    .3731266
             5 3  |   .5310669   .3367388     1.58   0.115    -.1289291    1.191063
                  |
            _cons |   24.01227   .0735584   326.44   0.000      23.8681    24.15644
-----------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
subject: Independent          |
                   var(time) |   .2897808   .0123275      .2665992    .3149782
                  var(_cons) |    12.6173   .2565029      12.12444    13.13018
-----------------------------+------------------------------------------------
               var(Residual) |   7.268972   .0651627      7.142371    7.397817
------------------------------------------------------------------------------
LR test vs. linear model: chi2(2) = 21743.17              Prob > chi2 = 0.0000

Note: LR test is conservative and provided only for reference.

Last edited by Leon Edelman; 24 Mar 2017, 07:00.

Comment

Roman Mostazir

Join Date: Apr 2014

Posts: 874
#4

24 Mar 2017, 07:29

Clyde Schechter, just a relevant precaution; because this is a survey data, doesn't it need to be 'svy'-set and prefixed? I haven't workd on any survey data more than a decade thus apologise for my ignorance if I am mistaken here.

Roman
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#5

24 Mar 2017, 12:20

How come "c." is specified on the left side of || but not the right side?

Because the right side of the ||, the random effects part of the model, does not support factor variable notation.

The model you show in #3, -mixed savings time##fathers_occupation || subject: time- is incorrectly specified, In the fixed part of the model (left of ||) you have implicitly told Stata to treat time as discrete. And in the top part of the output you got, you can see that it does that: it gives you separate estimates for each of the time lvels. But in the random part of the model (right of ||) time is a continuous variable. Notice that in the output you only got a single variance component estimate for time, not one for each level of time.

You need to decide whether you want to model time as discrete or continuous and then structure your model consistently in both parts. If it's continuous:

Code:

mixed savings c.time##fathers_occupation || subject: time

As I have said earlier, in growth models, time is mostly commonly treated as continuous, not discrete. But there's no law that says it can't be discrete.

If you want time to be discrete then it's:

Code:

forvalues i = 1/5 { gen time`i' = `i'.time } mixed savings i.time##fathers_occupation || subject: time?

As for this data being from surveys, if a complex survey design was used for sampling, then, yes, Roman is right, the data needs to be -svyset- and the analysis needs a -svy:- prefix. Of course, to do that, Leon will need information about the survey design so that he can properly designate the pweights, psu's and strata in his -svyset- command. (I had not noticed when I originally read #1 that the data came from surveys.)
Comment

Leon Edelman

Join Date: Sep 2016
Posts: 86

24 Mar 2017, 12:48

Thank you http://www.statalist.org/forums/memb...lyde-schechter

Code:

. mixed savings c.time##fathers_occupation || subject: time

Performing EM optimization:

Performing gradient-based optimization:

Iteration 0:   log likelihood =  -116221.8  
Iteration 1:   log likelihood = -116221.45  
Iteration 2:   log likelihood = -116221.45  

Computing standard errors:

Mixed-effects ML regression                     Number of obs     =     43,075
Group variable: subject                          Number of groups  =      8,214

                                                Obs per group:
                                                              min =          2
                                                              avg =        5.2
                                                              max =          7

                                                Wald chi2(7)      =    4522.00
Log likelihood = -116221.45                     Prob > chi2       =     0.0000

-------------------------------------------------------------------------------
       savings |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
         time |  -.5027777   .0094994   -52.93   0.000    -.5213962   -.4841592
              |
       fathers_occupation |
           1  |  -.0137992   .1143851    -0.12   0.904    -.2379898    .2103915
           2  |  -.0073569   .1550852    -0.05   0.962    -.3113183    .2966045
           3  |  -.0270225    .262062    -0.10   0.918    -.5406546    .4866096
              |
fathers_occupation#c.time |
           1  |   -.035111   .0187446    -1.87   0.061    -.0718498    .0016278
           2  |  -.0134632   .0254358    -0.53   0.597    -.0633165    .0363901
           3  |   .0179926   .0430237     0.42   0.676    -.0663324    .1023175
              |
        _cons |   24.35794   .0581638   418.78   0.000     24.24394    24.47194
-------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
subject: Independent          |
                   var(time) |   .1670434   .0067929      .1542463    .1809022
                  var(_cons) |   12.45193   .2410235      11.98838     12.9334
-----------------------------+------------------------------------------------
               var(Residual) |   7.657221   .0645592      7.531727    7.784806
------------------------------------------------------------------------------
LR test vs. linear model: chi2(2) = 23628.50              Prob > chi2 = 0.0000

Note: LR test is conservative and provided only for reference.

. margins fathers_occupation, dydx(time)

Average marginal effects                        Number of obs     =     43,075

Expression   : Linear prediction, fixed portion, predict()
dy/dx w.r.t. : time

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
time         |
      fathers_occupation |
          0  |  -.5027777   .0094994   -52.93   0.000    -.5213962   -.4841592
          1  |  -.5378887   .0161593   -33.29   0.000    -.5695603   -.5062171
          2  |  -.5162409   .0235954   -21.88   0.000    -.5624871   -.4699948
          3  |  -.4847852   .0419619   -11.55   0.000     -.567029   -.4025413
------------------------------------------------------------------------------

. margins fathers_occupation, dydx(time)

Average marginal effects                        Number of obs     =     43,075

Expression   : Linear prediction, fixed portion, predict()
dy/dx w.r.t. : time

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
time         |
      fathers_occupation |
          0  |  -.5027777   .0094994   -52.93   0.000    -.5213962   -.4841592
          1  |  -.5378887   .0161593   -33.29   0.000    -.5695603   -.5062171
          2  |  -.5162409   .0235954   -21.88   0.000    -.5624871   -.4699948
          3  |  -.4847852   .0419619   -11.55   0.000     -.567029   -.4025413
------------------------------------------------------------------------------

. margins fathers_occupation, at(time=(0(1)5)) vsquish

Adjusted predictions                            Number of obs     =     43,075

Expression   : Linear prediction, fixed portion, predict()
1._at        : time            =           0
2._at        : time            =           1
3._at        : time            =           2
4._at        : time            =           3
5._at        : time            =           4
6._at        : time            =           5

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  _at#fathers_occupation |
        1 0  |   24.35794   .0581638   418.78   0.000     24.24394    24.47194
        1 1  |   24.34414   .0984933   247.17   0.000      24.1511    24.53718
        1 2  |   24.35058   .1437651   169.38   0.000     24.06881    24.63236
        1 3  |   24.33092   .2555259    95.22   0.000     23.83009    24.83174
        2 0  |   23.85516   .0553817   430.74   0.000     23.74661    23.96371
        2 1  |   23.80625   .0938248   253.73   0.000     23.62236    23.99014
        2 2  |   23.83434   .1369448   174.04   0.000     23.56593    24.10275
        2 3  |   23.84613   .2433316    98.00   0.000     23.36921    24.32305
        3 0  |   23.35238   .0541453   431.29   0.000     23.24626    23.45851
        3 1  |   23.26836   .0918014   253.46   0.000     23.08843    23.44829
        3 2  |    23.3181   .1339883   174.03   0.000     23.05549    23.58071
        3 3  |   23.36134   .2380098    98.15   0.000     22.89485    23.82784
        4 0  |    22.8496   .0545598   418.80   0.000     22.74267    22.95654
        4 1  |   22.73047   .0925967   245.48   0.000     22.54899    22.91196
        4 2  |   22.80186   .1351492   168.72   0.000     22.53697    23.06675
        4 3  |   22.87656   .2400181    95.31   0.000     22.40613    23.34699
        5 0  |   22.34683   .0565889   394.90   0.000     22.23591    22.45774
        5 1  |   22.19258   .0961408   230.83   0.000     22.00415    22.38102
        5 2  |   22.28562   .1403255   158.81   0.000     22.01058    22.56065
        5 3  |   22.39177   .2491793    89.86   0.000     21.90339    22.88016
        6 0  |   21.84405   .0600693   363.65   0.000     21.72632    21.96178
        6 1  |   21.65469   .1021479   211.99   0.000     21.45449     21.8549
        6 2  |   21.76938   .1490996   146.01   0.000     21.47715    22.06161
        6 3  |   21.90699   .2647519    82.75   0.000     21.38809    22.42589
------------------------------------------------------------------------------

So after this command: margins fathers_occupation, dydx(time) does it suggest that
for fathers_occupation = 0 the slope is -5027777 and fathers_occupation = 2 has slope of -.5162409. And, the differences for the slopes are statistically significant?

Where do I interpret the intercept? I have read resources including from previous postings and links to interpretations and margins command but this is very difficult for me alone. Thank you to everyone in the Stata community for helping me to understand. It is very much appreciated.

I also wonder what is the difference between the two models below. Again, I just want to estimate different intercepts and slopes for each group and be able to report them

MODEL 1:

Code:

mixed savings fathers_occupation##c.time || subject: time, var

MODEL 2:

Code:

mixed savings c.time || subject: time fathers_occupation, var

Alternatively, Is there any benefit to running the models separately for different levels of fathers_occupation?

Last edited by Leon Edelman; 24 Mar 2017, 13:17.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#7

24 Mar 2017, 15:02

does it suggest that
for fathers_occupation = 0 the slope is -5027777 and fathers_occupation = 2 has slope of -.5162409. And, the differences for the slopes are statistically significant?

Those two numbers are indeed the average slopes for those occupational groups.

As for statistical significance of the difference between them, you can do that by re-running the -margins- command with the -post- option specified. That makes the -margins- results overwrite the -mixed- results in e(), enabling you to use the -test- command.

Code:

margins fathers_occupation, dydx(time) post test 2.fathers_occupation = 0.fathers_occupation

That said, don't be too quick to do statistical significance testing here. Significance testing is very sensitive to sample size. Your sample is pretty large, around 8,000 groups and over 40,000 observations. For an difference not to be statistically significant, it would have to be extremely small. At sample sizes this large, it is likely that even a difference too tiny to have any practical implications will turn out to be "statistically significant." So first I would ask whether a difference in slope of this magnitude, approximately 0.013, is large enough to matter in the real world. If it is, then you might want to confirm that it is also statistically significant. But if it's not, I would just skip the significance testing.

Where do I interpret the intercept?

So your estimate for _cons is 24.35794. This means that in the group with fathers_occupation = 0 they start out, on average, with savings = 24.35794 at time 0. In most models of this nature, this intercept term has no practical importance and is usually just ignored. But if the starting value in each group does have some interest in your context, then this is the starting value for the fathers_occupation = 0 group.

I also wonder what is the difference between the two models below.

The most important difference is that Model 2 is mis-specified. It is almost always wrong (and in this case it is definitely wrong) to include a variable like fathers_occupation in the random slopes without also listing it in the fixed-effects part of the model. So Model 2 is just invalid and should not be used.

Is there any benefit to running the models separately for different levels of fathers_occupation?

Not really. If you are worried about heteroscedasticitiy of the error distribution across occupational groups, you can overcome that by adding the -residuals(, by(fathers_occupation)- option to your -mixed- command. That will give you separate estimation of residuals in each of those occupational groups. Modeling all of the groups together has the advantage of regularizing the estimates by "borrowing from strength." Admittedly in a sample as large as yours, this isn't typically an issue (although it could be if one of the occupational classes has only a small number of observations). Really the only times I would see an advantage to modeling the occupational groups separately would be:

1. If I thought that the form of the model was different across groups. That is, for example, if I expected savings to grow linearly for fathers_occupation = x but quadratically for fathers_occupation = y, or something like that.

OR

2. If I had difficulty getting the combined model estimates to converge, I might have better luck doing each model separately, and I might also identify that the convergence problem was caused by one particular group (in which case I would redo the estimates for all of the groups except that one.)
Comment
Leon Edelman

Join Date: Sep 2016

Posts: 86
#8

27 Mar 2017, 07:44

Thank you http://www.statalist.org/forums/memb...lyde-schechter. I feel now I have a much better understanding of what is happening here! Moving forward suppose I want to test a mediating/moderating hypothesis as such:

Model 1: testing hypothesis about whether starting point and change in savings of subject varies with respect to their father's occupation

Code:

MODEL 1: mixed savings fathers_occupation##c.time || subject: time, mle covariance(unstructured) variance

Model 2: adjusts for covariates

Code:

MODEL 2: mixed savings fathers_occupation##c.time gender income race ethnicity|| subject: time, mle covariance(unstructured) variance

Model 3: test mediating/moderating hypothesis that spouse occupation changes the relationship between savings and fathers occupation

Code:

MODEL 3: mixed savings fathers_occupation##c.time##spouse_occupation gender income race ethnicity|| subject: time, mle covariance(unstructured) variance

Is this the appropriate syntax to do so for Model 3?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#9

27 Mar 2017, 10:51

These commands are appropriate for testing the corresponding moderating hypotheses.

They do not deal with mediation, which is a separate issue and cannot be handled in a single-equation model as far as I am aware. To deal with mediation, you would need to move this to -sem- and that includes emulating a growth model within -sem-. That can be done, but I have no experience with it.

I also don't even see how one can even sensibly posit a mediating role for these variables here, but this is not my content area so maybe I'm missing something.
Comment
Leon Edelman

Join Date: Sep 2016

Posts: 86
#10

27 Mar 2017, 17:13

Thank you http://www.statalist.org/forums/memb...lyde-schechter for your invaluable comments. Now -I am not to sure about address this as a separate question. Now suppose I want to examine birth year differences between two groups (1920-1930 vs 1940-1950). If I include age is that enough? Or do I need to do more? I have read papers on this but they do not describe much how they do it. Basically now I seek to extract different intercepts and slopes for the birth group but it is weird when considering age.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#11

27 Mar 2017, 17:32

If you want to model different intercepts for the different birth-year groups, you just add a variable i.birth_year_group, which is set to 0 for 1920-1930 and 1 for 1940-1950. If you want different slopes, then in addition to that you will need an interaction term i.birth_year_group##c.time.

You can also do this treating the variation in slope as a continuous linear function of age. In that case you would add age and c.age##c.time instead of the birth_year_group expressions of the preceding paragraph. But continuous age is different from birth year groups. Both are statistically OK: the question is which one makes more sense from a science perspective. One could argue that the 1920-1930 birth year group endured the great depression and world war II, whereas the 1940-1950 birth year group grow up in peace and prosperity, so that there is a truly qualitative difference between them with respect to their savings habits. That makes a lot of sense to me, more sense than thinking that there was a gradual linear secular trend in rates of savings over the birth cohorts from 1920 through 1950. But this is way out of my area of content expertise, and I'm just giving you a naive layperson's perspective about it.
Comment
Leon Edelman

Join Date: Sep 2016

Posts: 86
#12

28 Mar 2017, 07:10

Thank you http://www.statalist.org/forums/memb...lyde-schechter for your supportive and influential comments. I wish you were my professor! You are truly helping me more than I can express. One question I wonder is how to account for the differences in age if doing modeling different intercepts and slopes for the birth year groups because (obviously) one group is inherently older than the other. Here the goal is to look at birth group differences. Is including age on the left side of || sufficient to do this? Or is there an alternative modeling strategy that ought to be used to deal with the older and younger age groups?

Here is what I come up with but not sure if this takes care of that:

Code:

MODEL 4: mixed savings i.birth_year_group##c.time age fathers_occupation gender income race ethnicity|| subject: time, mle covariance(unstructured) variance
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#13

28 Mar 2017, 09:40

I think adjusting for age in the way you show is reasonable.
Comment
Kjell Weyde

Join Date: May 2016

Posts: 129
#14

28 Mar 2017, 14:36

Hi! I might have missed something, but would not time and age be very highly correlated here?

Kjell Weyde
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#15

28 Mar 2017, 14:53

Kjell Weyde Well, Leon would have to provide the answer to that. But as I understand the context, time here refers to time from the onset of some period of observation. So within each person, it will be highly (100%) correlated with age, but in the data set as a whole, not so much, different people being different ages at time 0. What is more problematic conceptually is that we are now talking about a model in which we have representations of all three of age, period (time) and cohort (birth group). And I think it is only the coarse approximate representation of the latter two, combined with the use of a random effects model, that prevents this model from being unidentifiable. But that is not a very secure position to be in. Even though the model will be identified, I suspect that the estimates of the age, period, and cohort effects will be highly correlated with each other, and none of them estimated with much precision.
Comment

Announcement