to account for clustering or not to account for clustering?

MJ Smith

Join Date: Oct 2018

Posts: 77
#1

to account for clustering or not to account for clustering?

20 Sep 2021, 16:42

I’m doing an analysis of applicants for grants over several years. In a given year, duplicate people have been dropped (those who submit more than one application in a given year). However, it is quite common for the same person to be found in multiple years, and the number of years in the data can vary across people.

The question I am trying to answer: is there a "statistically significant" linear trend over time in the percentage of females, % of males, and % unknown? Also, I’d like to show the regression trends in a graph with the confidence intervals. I realize that with having an unknown category, increases in the % females and % males over time need to be interpreted with caution.

Proposed set up: 3 separate logistic regression models. Outcome is 1) female (vs not female), 2) male (vs not male), 3) unknown (vs not unknown).

The explanatory variable is year, coded as: 1, 2, 3, etc. (use to determine the linear trend)

Question:

**1) does one need to account for the fact that the same person can be found in different years? For my purposes, I just want to know if the overall percentage increased over the years, regardless of whether some were the same people or not. Also, the outcome (gender) does not change over time within a given person. Therefore, it seems like my goal is maybe to treat them as independent but the data has some of the people in the same years. Can one do a regular logit does one need to do a GEE for example accounting for the panel data?

Note, question cross posted here (no replies as of now): https://stats.stackexchange.com/ques...-for-clusterin
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#2

21 Sep 2021, 01:36

MJ:
I'm not clear with why you should perform three different logistic regression instead of including a three-level predictor for -gender-.
In addition, I suspect that you have a panel dataset.
Clustering can make sense conditional on the number of clusters at hand.
As far as testing for trend in logistic regression is concerned, see Joseph Coveney 's helpful reply at https://www.statalist.org/forums/for...ting-for-trend.

Kind regards,
Carlo
(Stata 19.0)
Comment
MJ Smith

Join Date: Oct 2018

Posts: 77
#3

21 Sep 2021, 07:31

TY Carlos! An example of my data:

person year gender time

1 2005 M 2

1 2006 M 3

1 2007 M 4

2 2004 F 1

2 2006 F 3

3 2004 M 1

4 2005 U 2

4 2006 U 3

4 2007 U 4

5 2008 M 5

6 2008 U 5

6 2009 U 6

The question: did the overall proportion of females in this population increase “significantly” between 2004 and 2009? Therefore, I was treating gender as a dependent variable. Maybe a multiomial logit would be better since gender has 3 categories (M, F, Unknown). There was one independent variable for time, a continuous variable I created based on year.

I’m wondering if I need to account for clustering (same person in mutliple years). It seems like based on the data I should, but my question is whether there was an overall increase in females in the population, regardless of whether some are the same people. The dependent variable, gender, does not change within a person over time.
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17707

21 Sep 2021, 07:47

MJ:
an idea might be:

Code:

encode gender,g(gender_n)
mlogit gender_n i.year, vce(cluster person)

or:

Code:

encode gender,g(gender_n)
mlogit gender_n c.year##c.year, vce(cluster person)

Kind regards,
Carlo
(Stata 19.0)

Comment

MJ Smith

Join Date: Oct 2018

Posts: 77
#5

21 Sep 2021, 15:46

I will try that. Thank you SO much!!!
Comment

MJ Smith

Join Date: Oct 2018
Posts: 77

29 Oct 2021, 10:06

Just had a followup question to this. I ran 3 models with different treatment of “year” - c.year, i.year, and c.year##c.year. Below is the code for the models along with plots showing predicted probabilities vs. observed data. Based on the plots, does using c.year seem better than i.year? In the c.year##c.year model, c.year#c.year is omitted from the model, so I have ruled out that option. I noticed in the c.year plot, the lines seemed curved – wasn’t sure if that’s an issue or not.

Code below:

Code:

. **FIRST ANALYSIS: YEAR IS CONTINUOUS
.
. quietly mlogit gender_n c.year, rrr vce(cluster person)

. quietly margins, at(year = (2007(1)2013)) post

.
. marginsplot, legend(pos(3)) recast(line) plot1opts(lcolor(gs8)) ciopt(color(black%20)) recastci(rarea) title("Marketing"
> ) ytitle("Proportion") xlabel(2007 "2007" 2008 "2008" 2009 "2009" 2010 "2010" 2011 "2011" 2012 "2012" 2013 "2013") plotd
> (,label( "F" "M" "U")) addplot((scatter prop year if gender_n==4, msymbol(circle) mcolor(black) msize(vsmall)) (scatter
> prop year if gender_n==5, msymbol(circle) mcolor(black) msize(vsmall)) (scatter prop year if gender_n==6, msymbol(circle
> ) mcolor(black) msize(vsmall)))

  Variables that uniquely identify margins: year _outcome


  



. ****SECOND ANALYSIS: YEAR IS CATEGORICAL
.
. quietly mlogit gender_n i.year, rrr vce(cluster person)

. quietly margins i.year

. marginsplot, legend(pos(3)) recast(line) plot1opts(lcolor(gs8)) ciopt(color(black%20)) recastci(rarea) title("Marketing"
> ) ytitle("Proportion") xlabel(2007 "2007" 2008 "2008" 2009 "2009" 2010 "2010" 2011 "2011" 2012 "2012" 2013 "2013") plotd
> (,label( "F" "M" "U")) addplot((scatter prop year if gender_n==4, msymbol(circle) mcolor(black) msize(vsmall)) (scatter
> prop year if gender_n==5, msymbol(circle) mcolor(black) msize(vsmall)) (scatter prop year if gender_n==6, msymbol(circle
> ) mcolor(black) msize(vsmall)))

  Variables that uniquely identify margins: year _outcome

 





 **THIRD ANALYSIS:  QUADRATIC:
.
. mlogit gender_n c.year##c.year, vce(cluster person)

note: c.year#c.year omitted because of collinearity
Iteration 0:   log pseudolikelihood = -12589.199  
Iteration 1:   log pseudolikelihood = -12440.741  
Iteration 2:   log pseudolikelihood =  -12435.84  
Iteration 3:   log pseudolikelihood = -12435.832  
Iteration 4:   log pseudolikelihood = -12435.832  

Multinomial logistic regression                 Number of obs     =     15,134
                                                Wald chi2(2)      =     226.26
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -12435.832               Pseudo R2         =     0.0122

                              (Std. Err. adjusted for 8,237 clusters in person)
-------------------------------------------------------------------------------
              |               Robust
     gender_n |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
F             |  (base outcome)
--------------+----------------------------------------------------------------
M             |
         year |   .0341617   .0102538     3.33   0.001     .0140646    .0542589
              |
c.year#c.year |          0  (omitted)
              |
        _cons |  -69.78338   20.61233    -3.39   0.001    -110.1828   -29.38397
--------------+----------------------------------------------------------------
U             |
         year |  -.2563419   .0182097   -14.08   0.000    -.2920322   -.2206515
              |
c.year#c.year |          0  (omitted)
              |
        _cons |   513.2264   36.57276    14.03   0.000     441.5451    584.9077
-------------------------------------------------------------------------------

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#7

29 Oct 2021, 10:15

MJ:
I would go with the -c.year- model (without interaction).

Kind regards,
Carlo
(Stata 19.0)
Comment
MJ Smith

Join Date: Oct 2018

Posts: 77
#8

29 Oct 2021, 16:19

Great - thank you, Carlo!
Comment

person	year	gender	time
1	2005	M	2
1	2006	M	3
1	2007	M	4
2	2004	F	1
2	2006	F	3
3	2004	M	1
4	2005	U	2
4	2006	U	3
4	2007	U	4
5	2008	M	5
6	2008	U	5
6	2009	U	6

Announcement