Assessing glm model fit and interaction terms when using multiple imputed, multiple-stage survey data with specified subpopulation

Cait Dixon

Join Date: Dec 2022

Posts: 1
#1

Assessing glm model fit and interaction terms when using multiple imputed, multiple-stage survey data with specified subpopulation

05 Dec 2022, 18:37

I'm doing a secondary data analysis of a multiple-stage design cross-sectional study (over 40,000 observations) assessing the association between a binary outcome and predictor variable and whether the interaction between the predictor and a third variable is significant. This is how I svyset the data:

Code:

svyset PSU [pweight=finalweight], strata(strata) fpc(samplingfraction) vce(linearized) singleunit(missing)

Based on what I've read on this forum and elsewhere, it seems that creating a subpopulation of observations you want to include is a better approach than outright dropping the variables with missing values. I created an indicator variable where '1' is assigned to values that are missing for the outcome and main predictor variable.
I've run a multivariate logistic regression and assessed model fit and interaction terms using the following simplified approach:

Code:

use ".dta", clear svy, subpop(if exclusion != 1): logit outcome i.predictor i.third_variable i.fourth_variable i.fifth_variable i.predictor#i.third_variable, or contrast predictor#third_variable svylogitgof linktest

However, I no longer feel that this is the most appropriate approach as approximately 20% of my data contains missing values for variables I want to include in the multivariate stage (7 variables) as well as missing values for declaring the survey design (9%). Furthermore, my outcome variable is common (>50%). After researching a bit, a log-binomial model with multiple imputation seems to be a good approach to deal with the missing data and common binary outcome.

Code:

use ".dta", clear drop if PSU == "" | strata ==. | samplingfraction ==. | finalweight ==. mi set mlong mi register imputed third_variable fourth_variable fifth_variable mi register regular outcome predictor mi impute chained (mlogit) third_variable fourth_variable fifth_variable, add(5) rseed(1234) augment noisily mi svyset PSU [pweight=finalweight], strata(strata) fpc(samplingfraction) vce(linearized) singleunit(missing) mi estimate, eform: svy, subpop(if exclusion != 1): glm outcome i.predictor i.third_variable i.fourth_variable i.fifth_variable i.predictor#i.third_variable, family(binomial) link(log)

To check for the significance of the interaction term I use:

Code:

mi test 2.predictor#2.third_variable 2.predictor#3.third_variable 3.predictor#2.third_variable 3.predictor#3.third_variable

I found this helpful thread that included a manual entry for linktest:

Code:

mi estimate, saving (miest, replace): svy, subpop(if exclusion != 1): glm outcome i.predictor i.third_variable i.fourth_variable i.fifth_variable i.predictor#i.third_variable, family(binomial) link(log) mi predict _hat using miest, xb mi passive: gen _hatsq = _hat*_hat mi estimate: svy, subpop(if exclusion != 1): glm outcome _hat _hatsq, family(binomial) link(log)

I'd appreciate any input on:
Does my approach of not imputing my outcome and predictor variable but rather keeping them as a subpopulation make sense? They are both generated variables from other variables. Both variables have approximately 7% missing values.

Is both the mi test command and adapted linktest the most appropriate tests to use in this case? Not sure if this is relevant but interestingly, linktest always suggested there was no link error when I ran it for every combination of variable interaction terms in the multivariate logistic regression even though sylogitgof did change.

I'm still pretty new to Stata (first post here) and starting to get deeper into stats so I hope I've included all the relevant information
Tags: None

Announcement

Assessing glm model fit and interaction terms when using multiple imputed, multiple-stage survey data with specified subpopulation