Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • to account for clustering or not to account for clustering?

    I’m doing an analysis of applicants for grants over several years. In a given year, duplicate people have been dropped (those who submit more than one application in a given year). However, it is quite common for the same person to be found in multiple years, and the number of years in the data can vary across people.

    The question I am trying to answer: is there a "statistically significant" linear trend over time in the percentage of females, % of males, and % unknown? Also, I’d like to show the regression trends in a graph with the confidence intervals. I realize that with having an unknown category, increases in the % females and % males over time need to be interpreted with caution.

    Proposed set up: 3 separate logistic regression models. Outcome is 1) female (vs not female), 2) male (vs not male), 3) unknown (vs not unknown).

    The explanatory variable is year, coded as: 1, 2, 3, etc. (use to determine the linear trend)

    Question:

    **1) does one need to account for the fact that the same person can be found in different years? For my purposes, I just want to know if the overall percentage increased over the years, regardless of whether some were the same people or not. Also, the outcome (gender) does not change over time within a given person. Therefore, it seems like my goal is maybe to treat them as independent but the data has some of the people in the same years. Can one do a regular logit does one need to do a GEE for example accounting for the panel data?

    Note, question cross posted here (no replies as of now): https://stats.stackexchange.com/ques...-for-clusterin

  • #2
    MJ:
    I'm not clear with why you should perform three different logistic regression instead of including a three-level predictor for -gender-.
    In addition, I suspect that you have a panel dataset.
    Clustering can make sense conditional on the number of clusters at hand.
    As far as testing for trend in logistic regression is concerned, see Joseph Coveney 's helpful reply at https://www.statalist.org/forums/for...ting-for-trend.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      TY Carlos! An example of my data:
      person year gender time
      1 2005 M 2
      1 2006 M 3
      1 2007 M 4
      2 2004 F 1
      2 2006 F 3
      3 2004 M 1
      4 2005 U 2
      4 2006 U 3
      4 2007 U 4
      5 2008 M 5
      6 2008 U 5
      6 2009 U 6

      The question: did the overall proportion of females in this population increase “significantly” between 2004 and 2009? Therefore, I was treating gender as a dependent variable. Maybe a multiomial logit would be better since gender has 3 categories (M, F, Unknown). There was one independent variable for time, a continuous variable I created based on year.

      I’m wondering if I need to account for clustering (same person in mutliple years). It seems like based on the data I should, but my question is whether there was an overall increase in females in the population, regardless of whether some are the same people. The dependent variable, gender, does not change within a person over time.

      Comment


      • #4
        MJ:
        an idea might be:
        Code:
        encode gender,g(gender_n)
        mlogit gender_n i.year, vce(cluster person)
        or:
        Code:
        encode gender,g(gender_n)
        mlogit gender_n c.year##c.year, vce(cluster person)
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          I will try that. Thank you SO much!!!

          Comment


          • #6
            Just had a followup question to this. I ran 3 models with different treatment of “year” - c.year, i.year, and c.year##c.year. Below is the code for the models along with plots showing predicted probabilities vs. observed data. Based on the plots, does using c.year seem better than i.year? In the c.year##c.year model, c.year#c.year is omitted from the model, so I have ruled out that option. I noticed in the c.year plot, the lines seemed curved – wasn’t sure if that’s an issue or not.

            Code below:

            Code:
            . **FIRST ANALYSIS: YEAR IS CONTINUOUS
            .
            . quietly mlogit gender_n c.year, rrr vce(cluster person)
            
            . quietly margins, at(year = (2007(1)2013)) post
            
            .
            . marginsplot, legend(pos(3)) recast(line) plot1opts(lcolor(gs8)) ciopt(color(black%20)) recastci(rarea) title("Marketing"
            > ) ytitle("Proportion") xlabel(2007 "2007" 2008 "2008" 2009 "2009" 2010 "2010" 2011 "2011" 2012 "2012" 2013 "2013") plotd
            > (,label( "F" "M" "U")) addplot((scatter prop year if gender_n==4, msymbol(circle) mcolor(black) msize(vsmall)) (scatter
            > prop year if gender_n==5, msymbol(circle) mcolor(black) msize(vsmall)) (scatter prop year if gender_n==6, msymbol(circle
            > ) mcolor(black) msize(vsmall)))
            
              Variables that uniquely identify margins: year _outcome
            
            Click image for larger version
            
            Name:	fakedata2.png
            Views:	1
            Size:	83.1 KB
            ID:	1633868
              
            
            
            
            . ****SECOND ANALYSIS: YEAR IS CATEGORICAL
            .
            . quietly mlogit gender_n i.year, rrr vce(cluster person)
            
            . quietly margins i.year
            
            . marginsplot, legend(pos(3)) recast(line) plot1opts(lcolor(gs8)) ciopt(color(black%20)) recastci(rarea) title("Marketing"
            > ) ytitle("Proportion") xlabel(2007 "2007" 2008 "2008" 2009 "2009" 2010 "2010" 2011 "2011" 2012 "2012" 2013 "2013") plotd
            > (,label( "F" "M" "U")) addplot((scatter prop year if gender_n==4, msymbol(circle) mcolor(black) msize(vsmall)) (scatter
            > prop year if gender_n==5, msymbol(circle) mcolor(black) msize(vsmall)) (scatter prop year if gender_n==6, msymbol(circle
            > ) mcolor(black) msize(vsmall)))
            
              Variables that uniquely identify margins: year _outcome
            
             
            
            Click image for larger version
            
            Name:	fakedata.png
            Views:	1
            Size:	92.2 KB
            ID:	1633869
            
            
            
             **THIRD ANALYSIS:  QUADRATIC:
            .
            . mlogit gender_n c.year##c.year, vce(cluster person)
            
            note: c.year#c.year omitted because of collinearity
            Iteration 0:   log pseudolikelihood = -12589.199  
            Iteration 1:   log pseudolikelihood = -12440.741  
            Iteration 2:   log pseudolikelihood =  -12435.84  
            Iteration 3:   log pseudolikelihood = -12435.832  
            Iteration 4:   log pseudolikelihood = -12435.832  
            
            Multinomial logistic regression                 Number of obs     =     15,134
                                                            Wald chi2(2)      =     226.26
                                                            Prob > chi2       =     0.0000
            Log pseudolikelihood = -12435.832               Pseudo R2         =     0.0122
            
                                          (Std. Err. adjusted for 8,237 clusters in person)
            -------------------------------------------------------------------------------
                          |               Robust
                 gender_n |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            --------------+----------------------------------------------------------------
            F             |  (base outcome)
            --------------+----------------------------------------------------------------
            M             |
                     year |   .0341617   .0102538     3.33   0.001     .0140646    .0542589
                          |
            c.year#c.year |          0  (omitted)
                          |
                    _cons |  -69.78338   20.61233    -3.39   0.001    -110.1828   -29.38397
            --------------+----------------------------------------------------------------
            U             |
                     year |  -.2563419   .0182097   -14.08   0.000    -.2920322   -.2206515
                          |
            c.year#c.year |          0  (omitted)
                          |
                    _cons |   513.2264   36.57276    14.03   0.000     441.5451    584.9077
            -------------------------------------------------------------------------------



            Comment


            • #7
              MJ:
              I would go with the -c.year- model (without interaction).
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                Great - thank you, Carlo!

                Comment

                Working...
                X