Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dyadic panel data - Modeling dyadic panel data with (mixed) random-effects model in Stata

    Dear Statalist
    I am working with dyadic panel data where I have multiple waves of observations nested within parent-child dyads (variable "dyad_id"), which are further nested within parents ("parent_id"). My goal is to assess the effect of children’s education ("ch_edu") on parental life satisfaction ("parent_lsat"). Notably, the outcome variable ("parent_lsat") is measured at the higher (parental) level.

    To be more clear, my data are in the following format:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(parent_id dyad_id) byte wave float parent_lsat int parent_age byte(parent_edu parent_gender) float(ch_age ch_edu) byte ch_gender float ch_mstat
     7 19 7  6 69 1 0 46 2 1 .
     7 20 3  3 60 1 0 31 1 0 1
     7 20 4  4 62 1 0 33 1 0 .
     7 20 5  6 64 1 0 35 1 0 0
     7 20 7  6 69 1 0 40 1 0 .
     7 21 3  3 60 1 0 29 2 1 0
     7 21 4  4 62 1 0 31 2 1 .
     7 21 5  6 64 1 0 33 2 1 0
     7 21 7  6 69 1 0 38 2 1 .
     7 22 3  3 60 1 0 28 1 0 1
     7 22 4  4 62 1 0 30 1 0 .
     7 22 5  6 64 1 0 32 1 0 0
     7 22 7  6 69 1 0 37 1 0 .
     8 23 1  3 61 1 1 27 2 0 2
     8 24 1  3 61 1 1 23 1 1 2
     9 25 1  2 56 1 0 27 2 0 2
     9 25 2  8 59 1 0 30 2 0 1
     9 26 1  2 56 1 0 23 1 1 2
     9 26 2  8 59 1 0 26 2 1 1
    11 27 3  6 81 0 1 60 0 1 0
    11 29 4 23 83 0 1 58 1 0 0
    11 30 3  6 81 0 1 55 1 0 2
    11 30 4 23 83 0 1 57 1 0 .
    11 30 5 38 85 0 1 59 1 0 .
    11 31 4 23 83 0 1 56 0 0 0
    11 32 3  6 81 0 1 46 0 0 2
    11 32 4 23 83 0 1 48 0 0 .
    11 32 5 38 85 0 1 50 1 0 .
    12 33 3 17 87 0 0 60 0 1 0
    12 35 4 34 89 0 0 58 1 0 0
    12 35 5 39 91 0 0 60 1 0 .
    12 35 7 26 95 0 0 64 1 0 .
    12 36 3 17 87 0 0 55 1 0 2
    12 36 4 34 89 0 0 57 1 0 .
    12 36 5 39 91 0 0 59 1 0 .
    12 36 7 26 95 0 0 63 1 0 .
    12 37 4 34 89 0 0 56 0 0 0
    12 38 3 17 87 0 0 46 0 0 2
    12 38 4 34 89 0 0 48 0 0 .
    23 40 3  7 60 2 0 31 1 0 2
    23 40 4 10 62 2 0 33 1 0 .
    23 40 5 31 64 2 0 35 2 0 2
    23 40 7 31 69 2 0 40 2 0 .
    23 41 3  7 60 2 0 29 1 0 2
    23 42 5 31 64 2 0 32 2 0 0
    23 42 7 31 69 2 0 37 2 0 .
    25 43 3  0 58 1 1 31 1 0 2
    25 43 5  6 62 1 1 35 1 0 2
    25 43 7  6 67 1 1 40 2 0 .
    25 44 3  0 58 1 1 29 1 0 2
    25 45 5  6 62 1 1 32 2 0 0
    end
    label values wave lab_wave_year
    label values parent_age num
    label values parent_edu lab_isced
    label values ch_edu lab_isced
    label def lab_isced 0 "Low", modify
    label def lab_isced 1 "Med", modify
    label def lab_isced 2 "High", modify
    label values parent_gender gender
    label def gender 0 "Fathers", modify
    label def gender 1 "Mothers", modify
    label values ch_gender ch_gender
    label def ch_gender 0 "Son", modify
    label def ch_gender 1 "Daughter", modify
    label values ch_mstat combined_status_label
    label def combined_status_label 0 "Married", modify
    label def combined_status_label 1 "Cohabiting (not married)", modify
    label def combined_status_label 2 "Not cohabiting", modify

    I am considering using the mixed command in Stata to fit a two-level random-effects model, treating the variance-covariance structure as unstructured and the standard errors clustered at the parental level. Here is an example of the command I am planning to use:

    Code:
    mixed parent_lsat i.ch_edu i.ch_gender ch_age i.parent_edu i.parent_gender || dyad_id:, vce(cluster parent_id) cov(un)
    I would appreciate any feedback on this approach. Are there alternative models or methods that might be more appropriate for my data? Also, are there any potential issues I should be aware of when fitting this model?

    Many thanks in advance for your help.

  • #2
    A couple of things to start off. First, you say that the outcome is at the parent level, which would suggest that it is stable within parents, however, this is not the case. Compare it in the output below to the parent_gender variable, which doesn't vary within parent_id:
    Code:
    rename parent* p*
    sort p_id wave
    list p_id dyad_id wave p_lsat p_gender in 1/19, sepby(p_id) noobs
     +-------------------------------------------+
      | p_id   dyad_id   wave   p_lsat   p_gender |
      |-------------------------------------------|
      |    7        20      3        3    Fathers |
      |    7        21      3        3    Fathers |
      |    7        22      3        3    Fathers |
      |    7        20      4        4    Fathers |
      |    7        21      4        4    Fathers |
      |    7        22      4        4    Fathers |
      |    7        20      5        6    Fathers |
      |    7        21      5        6    Fathers |
      |    7        22      5        6    Fathers |
      |    7        19      7        6    Fathers |
      |    7        20      7        6    Fathers |
      |    7        21      7        6    Fathers |
      |    7        22      7        6    Fathers |
      |-------------------------------------------|
      |    8        23      1        3    Mothers |
      |    8        24      1        3    Mothers |
      |-------------------------------------------|
      |    9        25      1        2    Fathers |
      |    9        26      1        2    Fathers |
      |    9        25      2        8    Fathers |
      |    9        26      2        8    Fathers |
      +-------------------------------------------+
    parent_lsat not only varies within parent_id, it also varies within dyad_id. If it didn't vary within dyad_id, you would get no results from the mixed effects model with a dyad_id random intercept. That said, parent_lsat is the same for each wave of data, irrespective of the dyad_id. That tells me that parents rated their life satisfaction at a point in time and that it is being assigned to the all the dyad_ids at that time. That means that you are not likely to get useful variation in the parent_lsat outcome at the dyad_id level.

    As might be expected given the above, it appears that the most important clustering variable is parent_id, not dyad_id. This makes sense to me if what the parent_lsat measures is indeed the parent's subjective life satisfaction. I would expect that to be more about parents, especially given how the all dyad pairs are given the same parent_lsat at each wave.
    Code:
    mixed p_lsat || p_id:, reml
    eststo p_only 
    
    mixed p_lsat || dyad_id:, reml
    eststo dyad_only 
    
    mixed p_lsat || p_id: || dyad_id:, reml
    eststo p_dyad
    
    esttab p_only p_dyad , se stats(N aic bic) transform(ln*: exp(2*@) 2*exp(2*@)) ///
        eqlabels("" "var(p_id)" "var(Residual)" "var(d_id)", none) varwidth(13) nostar
    
    ---------------------------------------
                           (1)          (2)
                        p_lsat       p_lsat
    ---------------------------------------
    _cons                12.84        12.84
                       (4.110)      (4.110)
    
    var(p_id)            105.5        105.5
                       (67.30)      (67.30)
    
    var(Residual)        68.76        68.76
                       (14.63)      (14.63)
    
    var(d_id)                      5.95e-23
                                 (4.95e-22)
    ---------------------------------------
    N                       51           51
    aic                  377.4        379.4
    bic                  383.2        387.2
    ---------------------------------------
    Standard errors in parentheses
     
    esttab dyad_only , se stats(N aic bic) transform(ln*: exp(2*@) 2*exp(2*@)) /// 
        eqlabels("" "var(d_id)" "var(Residual)" , none) varwidth(13) nostar
    --------------------------
                           (1)
                        p_lsat
    --------------------------
    _cons                14.47
                       (2.359)
    
    var(d_id)            88.63
                       (38.57)
    
    var(Residual)        76.36
                       (20.02)
    --------------------------
    N                       51
    aic                  395.2
    bic                  401.0
    --------------------------
    Standard errors in parentheses
    As you can see from these models with no predictors, the AIC and BIC both favor the model with a single parent_id random intercept. A likelihood ratio test comparing the model with parent_id only vs a model with parent_id and dyad_id favors the more complicated model, however, the variance estimate for dyad_id in that model is so small as to be ignorable.

    Once you add the predictors to your model, it wipes out any between parent variance (or between dyad variance within parents), so it seems that modeling this example data with a random intercept is not even necessary. The predictors explain nearly all variance that is due to differences in parents. In the larger data, this may not be the case, so you may still need to have random intercepts for parent_id. Alternatively, given a large enough sample size, you could use OLS with standard error adjustment.

    You should think carefully about the wave/timing aspect of the parent reports and how or whether it should be incorporated into any modeling.

    One thing to look at in the larger data is the residuals of any linear model (OLS or mixed), as the outcome seems to be somewhat bimodal (or at least clearly non-normal) in the example data. As such, another approach may be necessary, including something like an ordinal model.

    Comment


    • #3
      Since you say that the dyads are themselves nested in parents, then this should really be a three-level model, no?

      Code:
      mixed parent_lsat i.ch_edu i.ch_gender ch_age i.parent_edu i.parent_gender ///
          || parent_id: || dyad_id:, vce(cluster parent_id) cov(un)
      That said, when I run the code with the example data, the variance components at both the parent_id and dyad_id level are extremely small, suggesting that a single-level model using plain old -regress- would be satisfactory. But in principle, the modeling should, at least initially, follow the structure of the data--then you can remove superfluous levels after the fact if their variance components are too small to meaningfully contribute to the model.

      (Of course, it may be that your real data set is larger than the example shown and the results might be quite different and support the need for all three levels.)

      The other thing you might want to consider is whether to look for cross-level interactions (i.e. random slopes) for some of your variables. That's a subject matter question, and not myself knowing much about this area, I can't offer you specific advice on this. I just raise it as a general point in model development.

      Added: Crossed with #2, which gives a more thorough analysis of the lack of variation at the dyad and parent levels.
      Last edited by Clyde Schechter; 22 Nov 2023, 09:48.

      Comment


      • #4
        Dear Erik and Clyde, thank you so much for your suggestions!

        All Erik’s observations are correct: the outcome (parent_lsat) is self-reported by the parent at each wave and therefore it varies within parents across time. The value of parent_lsat, for each specific wave, is assigned to all the dyad_ids (i.e., the identifier for the parent-child dyad). Therefore, within each dyad and wave, the value of parent_lsat is constant.

        Since the main independent variable of interest (i.e., ‘ch_edu’, children’s education) vary between dyads and within dyads across time, I would be inclined towards Clyde’s suggestion to initially consider a model with three levels. However, this is what the full sample indicates:

        Code:
        * ======================================================================= * 
        *     mixed p_lsat || p_id:, reml                   // "p_id_only"    
        *     mixed p_lsat || dyad_id:, reml                // "dyad_id_only" 
        *     mixed p_lsat || p_id: || dyad_id:, reml       // "both"         
        *     
        *     esttab p_only p_dyad , se stats(N aic bic) transform(ln*: exp(2*@) 2*exp(2*@)) ///
        *         eqlabels("" "var(p_id)" "var(Residual)" "var(d_id)", none) varwidth(13) nostar
        *     
        *     esttab dyad_only , se stats(N aic bic) transform(ln*: exp(2*@) 2*exp(2*@)) /// 
        *         eqlabels("" "var(d_id)" "var(Residual)" , none) varwidth(13) nostar
        *     
        * ======================================================================= * 
        
        p_id_only                        dyad_id_only                   both
        --------------------------     --------------------------     --------------------------
        _cons                13.69     _cons                13.89     _cons                13.69
                          (0.0345)                       (0.0226)                       (0.0345)
        var(p_id)             110.1    var(d_id)            96.44     var(p_id)            110.1
                           (0.537)                        (0.363)                        (0.537)
        var(Residual)        27.84     var(Residual)        39.25     var(Residual)        27.84
                          (0.0609)                        (0.106)                       (0.0609)
        --------------------------     --------------------------     var(d_id)         7.52e-10
        N                   519715     N                   519715                     (1.51e-08)
        aic              3484609.0     aic              3789414.2     --------------------------
        bic              3484642.5     bic              3789447.7     N                   519715
        --------------------------     --------------------------     aic              3484611.0
                                                                      bic              3484655.7
                                                                      --------------------------
        Do you think a three-level model is problematic in this case, considering there's no variance at the dyad level? Many thanks again for the suggestions.

        Comment


        • #5
          I wouldn't say that a three-level model is problematic. But it's clear that dyad id is adding nothing meaningful to the model. So, yes, I would revert to just the two-level model with pid.

          Comment


          • #6
            Thank you so much for the feedback!

            Comment

            Working...
            X