Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • xtgee syntax and postestimation for multiple timepoints study

    Hello, I sincerely apologise in advance for the basic-ness of this question but I am finding it difficult to source answers myself despite reading a number of texts on generalized estimating equations (GEE).

    I need help with the syntax I am using for xtgee with respect to my dataset and hypotheses, and then the correct syntax for postestimation commands particularly checking residuals.

    Stata version 14.2

    Background:
    I am running a randomized clinical trial whereby outcomes are measured at 3 time points (baseline:0, 8 weeks:1, 16 weeks:2). There are two groups (intervention:1, waitlist control:2). The intervention takes place between 0 and 8 weeks, then the 16 week timepoint is follow-up/retention. Randomization was stratified by sex (male:1 vs. female:2) and gross motor function level (gmfcs 1 or 2 vs. gmfcs 3) resulting in 4 strata. The main outcome copmp is a continuous variable with a value between 1 and 10, and generally has a negatively skewed distribution at baseline with a more normal distribution at time 1 and 2 for the intervention group.

    Here is an example of my data in wide format:
    Code:
    input int idno float(copmp0 copms0 copmc0 copmp1 copms1 copmc1 copmp2 copms2 copmc2) byte(cohort group sex gmfcs)
    1030 3.66667       2 8.66667       .       .       .       .       .        . 2 2 1 3
    1727 3.33333 3.66667 8.66667       5 2.33333 8.66667 2.66667 2.66667        7 3 2 1 1
    1969 3.33333 4.66667       4 2.33333 8.33333       2 4.66667 5.33333        5 1 2 1 1
    2010       3 4.33333       7 1.66667       3 8.33333 1.66667 1.33333  9.66667 5 2 2 3
    2199       1 3.66667 8.33333       8       8 7.66667 4.66667 5.33333  7.33333 6 1 2 2
    Correlations look like this:
    Code:
    . correlate copmp0 copmp1 copmp2
    (obs=30)
    
                 |   copmp0   copmp1   copmp2
    -------------+---------------------------
          copmp0 |   1.0000
          copmp1 |   0.0876   1.0000
          copmp2 |   0.1089   0.8455   1.0000
    
    
    . by group, sort : correlate copmp0 copmp1 copmp2
    
    ----------------------------------------------------------------------------------------------------------------------------------------------------
    -> group = 1
    (obs=15)
    
                 |   copmp0   copmp1   copmp2
    -------------+---------------------------
          copmp0 |   1.0000
          copmp1 |  -0.1714   1.0000
          copmp2 |   0.0080   0.4715   1.0000
    
    
    ----------------------------------------------------------------------------------------------------------------------------------------------------
    -> group = 2
    (obs=15)
    
                 |   copmp0   copmp1   copmp2
    -------------+---------------------------
          copmp0 |   1.0000
          copmp1 |   0.2052   1.0000
          copmp2 |   0.1160   0.8459   1.0000
    Hypothesis:
    My hypothesis is that copmp will be significantly greater in the intervention group compared to the waitlist following the intervention (8 weeks:1) and at follow up (16 weeks:2). I specified in the protocol that the stratification factors (sex, gmfcs) will also be covariables although I don't really know if this is strictly necessary (I realise I have to collapse gmfcs into 2 categories if I was to do this).

    Problem:
    The syntax I have been using is a mash of what my colleagues have recommended plus what I have guessed based on my readings. In terms of the specifications, I have been using the identity link and gaussian distribution. Originally my supervisor recommended an exchangeable correlation matrix but from my readings this doesn't seem to suit my dataset.
    Code:
    . xi: xtgee copmp i.group*i.time, i(idno) t(time) family(gaussian) link(identity) corr(exchangeable)
    i.group           _Igroup_1-2         (naturally coded; _Igroup_1 omitted)
    i.time            _Itime_0-2          (naturally coded; _Itime_0 omitted)
    i.group*i.time    _IgroXtim_#_#       (coded as above)
    
    Iteration 1: tolerance = .02625145
    Iteration 2: tolerance = .00015618
    Iteration 3: tolerance = 8.173e-07
    
    GEE population-averaged model                   Number of obs     =        100
    Group variable:                       idno      Number of groups  =         37
    Link:                             identity      Obs per group:
    Family:                           Gaussian                    min =          1
    Correlation:                  exchangeable                    avg =        2.7
                                                                  max =          3
                                                    Wald chi2(5)      =     162.78
    Scale parameter:                  2.904201      Prob > chi2       =     0.0000
    
    -------------------------------------------------------------------------------
            copmp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    --------------+----------------------------------------------------------------
        _Igroup_2 |  -.0855261   .5605329    -0.15   0.879     -1.18415    1.013098
         _Itime_1 |   4.708333    .469458    10.03   0.000     3.788213    5.628454
         _Itime_2 |   4.485148   .4977298     9.01   0.000     3.509616     5.46068
    _IgroXtim_2_1 |  -3.663343   .6825294    -5.37   0.000    -5.001076    -2.32561
    _IgroXtim_2_2 |  -3.379046   .7022752    -4.81   0.000     -4.75548   -2.002612
            _cons |   2.791667   .4016771     6.95   0.000     2.004394    3.578939
    -------------------------------------------------------------------------------
    Another issue is after I have done this, I have used predict residuals postestimation command, and the scatterplot of this seems nonsensical to me. I am sure I must be missing a step.

    Questions:
    1. Could you please comment on the choice of correlation matrix, or give me a suggestion about how I might choose the most appropriate one?
    2. Could you comment on my syntax and let me know if there is a more appropriate/simplified code to use to answer my hypotheses?
    3. Could you suggest whether it would be appropriate to include the stratification factors as covariables and how I might decide this?
    4. I need a stepwise process to look at postestimation, particularly examining residuals including the syntax I would use to both create the variables and then graph/plot

    I would be sincerely and genuinely grateful for any help. Happy to provide any extra information as necessary.

    Kind regards, Sarah

  • #2
    Well, I hate to disappoint you, but it looks to me like you are doing things pretty much correctly!

    1. Correlation matrix. This is usually an a priori decision. The exchangeable correlation is most appropriate when the data structure is one of repeated measures on the same sampling units using the same measurement approach on all occasions. If you have doubts about its appropriateness here, one approach might be to rerun the analysis using the unstructured option: this imposes no constraints. And then you run -estat wcorrelation- to see what the estimated correlation matrix turned out to be. If it looks more or less like an exchangeable correlation matrix, then you could go back to the results you got with that. Or perhaps the estimated matrix will look like one of the other options. Or maybe it will look idiosyncratic and you would want to stay with the unstructured version. Now, bear in mind that the use of an unstructured correlation matrix increases the number of degrees of freedom in your model, and sometimes you run into convergence problems. So this approach doesn't always work, but I think it's worth a try. You only have three time periods, so it isn't an exorbitant number of degrees of freedom to add to your model (though it would be comforting if there were more participants in your study.)

    2. Your syntax is appropriate but archaic. -xi:- is a mostly obsolete approach now that Stata has factor-variable notation. See -help fvvarlist-. There are still a few situations where -xi- is needed, but this isn't one of them. And using -xi- does do you some harm, because it prevents you from taking advantage of the -margins- command. -margins- is a godsend to anyone doing interaction models, as you have here. So you would be well advised to revise your syntax to:

    Code:
    xtgee copmp i.group##i.time, i(idno) t(time) family(gaussian) link(identity) corr(exchangeable)
    If you really want to pare down the syntax to a minimum, you can drop the link(identity) and family(gaussian) options as they are the default values for those options. Personally, I like having them there explicitly, and wouldn't make this change, but others might.

    3. Well, you really want to add not just sex and gmfcs but their interaction to reflect that you blocked on a 2 x 2 structure. So you would do that by just sticking i.sex##i.gmfcs on the end of the variable list in the -xtgee- command.

    4. I don't know what you have in mind here. The way to calculate residuals after -xtgee- is:
    Code:
    predict yhat, mu
    gen resid = copmp - yhat
    You don't explain what about your residual plots didn't make sense to you, nor do you show them, nor the code you used to generate them. So it's really impossible to comment on this.

    Added unsolicited advice. To simplify the interpretation of your results for this model, after you switch to factor variable notation, you will want to run the following commands:

    Code:
    // PREDICTED VALUES OF OUTCOME IN EACH
    // COMBINATION OF GROUP AND TIME
    margins group#time
    
    // MARGINAL EFFECT OF TREATMENT GROUP 
    // AT EACH TIME PERIOD
    margins time, dydx(group)
    
    // MARGINAL EFFECT OF PASSAGE OF TIME
    // IN EACH TREATMENT GROUP
    margins group, dydx(time)
    While you could, in principle, calculate all of the -margins- outputs directly from your regression output, the process is tedious and error-prone. Better to let Stata do it for you. The -margins- command is quite useful in many contexts; what is shown here is really the basics of it. I highly recommend Richard Williams' excellent Stata Journal article at http://www.stata-journal.com/sjpdf.h...iclenum=st0260 for a really clear explanation of the meat of the command. Then at some point read the -margins- section of the PDF manuals to see many worked examples and learn about some of the more advanced features.

    Comment


    • #3
      Thank you so much Clyde Schechter! That really answers my question comprehensively. I appreciate the advice and the simplification of my syntax. I am relieved to know that I was on the right track.
      Code:
      . xtgee copmp i.group##i.time i.sex i.gmfcs i.sex##i.gmfcs, i(idno) t(time) family(gaussian) link(
      > identity) corr(exchangeable)
      
      Iteration 1: tolerance = .03195667
      Iteration 2: tolerance = .00041245
      Iteration 3: tolerance = 3.645e-06
      Iteration 4: tolerance = 3.210e-08
      
      GEE population-averaged model                   Number of obs     =        100
      Group variable:                       idno      Number of groups  =         37
      Link:                             identity      Obs per group:
      Family:                           Gaussian                    min =          1
      Correlation:                  exchangeable                    avg =        2.7
                                                                    max =          3
                                                      Wald chi2(10)     =     172.32
      Scale parameter:                  2.703536      Prob > chi2       =     0.0000
      
      ------------------------------------------------------------------------------
             copmp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
           2.group |    -.22586   .5505946    -0.41   0.682    -1.305006    .8532856
                   |
              time |
                1  |   4.708333    .470352    10.01   0.000      3.78646    5.630206
                2  |   4.495374   .4987049     9.01   0.000      3.51793    5.472818
                   |
        group#time |
              2 1  |  -3.639063   .6834331    -5.32   0.000    -4.978568   -2.299559
              2 2  |  -3.364992   .7035151    -4.78   0.000    -4.743856   -1.986128
                   |
             2.sex |  -.3782025   .5339368    -0.71   0.479    -1.424699    .6682944
                   |
             gmfcs |
                2  |  -.6871934   .7174678    -0.96   0.338    -2.093404    .7190176
                3  |   .2604234   .8254954     0.32   0.752    -1.357518    1.878365
                   |
         sex#gmfcs |
              2 2  |   .4871263   .9790806     0.50   0.619    -1.431836    2.406089
              2 3  |  -1.248115   1.085688    -1.15   0.250    -3.376024    .8797943
                   |
             _cons |   3.244931   .5194169     6.25   0.000     2.226893     4.26297
      ------------------------------------------------------------------------------
      With respect to the residuals, thank you. That was the syntax I was after. I have run it and this was the outcome:
      Code:
      . predict yhat, mu
      
      . gen resid = copmp - yhat
      (11 missing values generated)
      
      . scatter resid yhat
      Click image for larger version

Name:	scatter_residuals_xtgee_20170724.png
Views:	1
Size:	28.6 KB
ID:	1403391

      Is this the correct syntax? As I understand GEE's significantly less than simpler linear models, I am confused about the implications of the clustering of the points. How do I interpret this plot?

      PS I will give the -margins- code a try and read up on the outputs. Thanks!

      Kind regards,

      Sarah Reedman

      Comment


      • #4
        ((PS I re-ran this with the correctly collapsed gmfcs variable (into 2 categories) and the plot looked near identical so disregard that I forgot this in the original code))

        Sarah Reedman

        Comment


        • #5
          The syntax is correct. The gap in the plot is just a reflection of the distributions of the values in your variables and the corresponding regression coefficients, so that the predicted value never falls in the 4.5 to 6 range. In particular, notice that the coefficients of time and its interaction with group are rather large relative to the other coefficients. Consequently, the predicted value for any observation is primarily determined by the values of time and group, with the other variables doing some "fine tuning." So it's not surprising to see this kind of separation.

          Comment


          • #6
            Hello, I have finished the trial and most of the preliminary analyses (yay!). One of my variables, copmc (confidence that a goal will be achieved, scored from 1 not confident to 10 very confident) had an interesting result.
            Code:
            . xtgee copmc i.group##i.time i.sex i.gmfcscoll i.sex##i.gmfcscoll, i(idno) t(time) family(gaussian) link(identity) corr(exch
            > angeable)
            
            Iteration 1: tolerance = .11089245
            Iteration 2: tolerance = .00142205
            Iteration 3: tolerance = .00001568
            Iteration 4: tolerance = 1.727e-07
            
            GEE population-averaged model                   Number of obs     =        100
            Group variable:                       idno      Number of groups  =         37
            Link:                             identity      Obs per group:
            Family:                           Gaussian                    min =          1
            Correlation:                  exchangeable                    avg =        2.7
                                                                          max =          3
                                                            Wald chi2(8)      =      15.82
            Scale parameter:                   3.91437      Prob > chi2       =     0.0450
            
            -------------------------------------------------------------------------------
                    copmc |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            --------------+----------------------------------------------------------------
                  2.group |  -.6504023   .6539038    -0.99   0.320     -1.93203    .6312255
                          |
                     time |
                       1  |   .4675945   .4017842     1.16   0.245     -.319888    1.255077
                       2  |   .7270913   .4099202     1.77   0.076    -.0763376     1.53052
                          |
               group#time |
                     2 1  |  -1.313004    .607189    -2.16   0.031    -2.503073   -.1229356
                     2 2  |  -1.471249   .5955365    -2.47   0.013    -2.638479   -.3040191
                          |
                    2.sex |  -.0266612   .6390791    -0.04   0.967    -1.279233    1.225911
              2.gmfcscoll |   .7521456   1.130621     0.67   0.506    -1.463831    2.968122
                          |
            sex#gmfcscoll |
                     2 2  |  -1.663367   1.489969    -1.12   0.264    -4.583653    1.256918
                          |
                    _cons |   7.912239   .5881995    13.45   0.000      6.75939    9.065089
            -------------------------------------------------------------------------------
            
            . margins group#time
            
            Predictive margins                              Number of obs     =        100
            Model VCE    : Conventional
            
            Expression   : Linear prediction, predict()
            
            ------------------------------------------------------------------------------
                         |            Delta-method
                         |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
              group#time |
                    1 0  |   7.841679   .4671701    16.79   0.000     6.926043    8.757316
                    1 1  |   8.309274   .4671701    17.79   0.000     7.393637     9.22491
                    1 2  |    8.56877   .4743299    18.07   0.000     7.639101     9.49844
                    2 0  |   7.191277   .4558023    15.78   0.000     6.297921    8.084633
                    2 1  |   6.345867   .5110252    12.42   0.000     5.344276    7.347458
                    2 2  |   6.447119   .4901965    13.15   0.000     5.486351    7.407886
            ------------------------------------------------------------------------------
            
            . margins time, dydx(group)
            
            Average marginal effects                        Number of obs     =        100
            Model VCE    : Conventional
            
            Expression   : Linear prediction, predict()
            dy/dx w.r.t. : 2.group
            
            ------------------------------------------------------------------------------
                         |            Delta-method
                         |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
            2.group      |
                    time |
                      0  |  -.6504023   .6539038    -0.99   0.320     -1.93203    .6312255
                      1  |  -1.963406   .6937468    -2.83   0.005    -3.323125   -.6036877
                      2  |  -2.121652   .6834608    -3.10   0.002     -3.46121    -.782093
            ------------------------------------------------------------------------------
            Note: dy/dx for factor levels is the discrete change from the base level.
            
            . margins group, dydx(time)
            
            Average marginal effects                        Number of obs     =        100
            Model VCE    : Conventional
            
            Expression   : Linear prediction, predict()
            dy/dx w.r.t. : 1.time 2.time
            
            ------------------------------------------------------------------------------
                         |            Delta-method
                         |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
            1.time       |
                   group |
                      1  |   .4675945   .4017842     1.16   0.245     -.319888    1.255077
                      2  |  -.8454097   .4552449    -1.86   0.063    -1.737673     .046854
            -------------+----------------------------------------------------------------
            2.time       |
                   group |
                      1  |   .7270913   .4099202     1.77   0.076    -.0763376     1.53052
                      2  |   -.744158   .4319045    -1.72   0.085    -1.590675    .1023593
            ------------------------------------------------------------------------------
            Note: dy/dx for factor levels is the discrete change from the base level.
            I was wondering how exactly to interpret the margins command, particularly
            Code:
            margins time, dydx(group)
            with these results
            Code:
            ------------------------------------------------------------------------------
                         |            Delta-method
                         |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
            2.group      |
                    time |
                      0  |  -.6504023   .6539038    -0.99   0.320     -1.93203    .6312255
                      1  |  -1.963406   .6937468    -2.83   0.005    -3.323125   -.6036877
                      2  |  -2.121652   .6834608    -3.10   0.002     -3.46121    -.782093
            ------------------------------------------------------------------------------
            I know that this is the marginal effect of treatment group at each time period, but I am struggling to understand what it means here for this particular analysis. Is it that participants allocated to group 2 (the control group) are 1.963 points lower at time 1 than they were at baseline (time 0)? And does the margin at time 2 (-2.12) refer again to difference from baseline (time 0)?

            Thanks for your help, Sarah

            Comment

            Working...
            X