Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Assessing glm model fit and interaction terms when using multiple imputed, multiple-stage survey data with specified subpopulation

    I'm doing a secondary data analysis of a multiple-stage design cross-sectional study (over 40,000 observations) assessing the association between a binary outcome and predictor variable and whether the interaction between the predictor and a third variable is significant. This is how I svyset the data:
    Code:
    svyset PSU [pweight=finalweight], strata(strata) fpc(samplingfraction) vce(linearized) singleunit(missing)
    Based on what I've read on this forum and elsewhere, it seems that creating a subpopulation of observations you want to include is a better approach than outright dropping the variables with missing values. I created an indicator variable where '1' is assigned to values that are missing for the outcome and main predictor variable.
    I've run a multivariate logistic regression and assessed model fit and interaction terms using the following simplified approach:
    Code:
    use ".dta", clear
    
    svy, subpop(if exclusion != 1): logit outcome i.predictor i.third_variable i.fourth_variable i.fifth_variable i.predictor#i.third_variable, or
    
    contrast predictor#third_variable
    svylogitgof
    linktest
    However, I no longer feel that this is the most appropriate approach as approximately 20% of my data contains missing values for variables I want to include in the multivariate stage (7 variables) as well as missing values for declaring the survey design (9%). Furthermore, my outcome variable is common (>50%). After researching a bit, a log-binomial model with multiple imputation seems to be a good approach to deal with the missing data and common binary outcome.
    Code:
    use ".dta", clear
    
    drop if PSU == "" | strata ==. | samplingfraction ==. | finalweight ==.
    
    mi set mlong
    mi register imputed third_variable fourth_variable fifth_variable
    mi register regular outcome predictor
    mi impute chained (mlogit) third_variable fourth_variable fifth_variable, add(5) rseed(1234) augment noisily
    
    mi svyset PSU [pweight=finalweight], strata(strata) fpc(samplingfraction) vce(linearized) singleunit(missing)
    
    mi estimate, eform: svy, subpop(if exclusion != 1): glm outcome i.predictor i.third_variable i.fourth_variable i.fifth_variable i.predictor#i.third_variable, family(binomial) link(log)
    To check for the significance of the interaction term I use:
    Code:
    mi test 2.predictor#2.third_variable 2.predictor#3.third_variable 3.predictor#2.third_variable 3.predictor#3.third_variable
    I found this helpful thread that included a manual entry for linktest:
    Code:
    mi estimate, saving (miest, replace): svy, subpop(if exclusion != 1): glm outcome i.predictor i.third_variable i.fourth_variable i.fifth_variable i.predictor#i.third_variable, family(binomial) link(log)
    
    mi predict _hat  using miest, xb
    mi passive: gen _hatsq = _hat*_hat
    
    mi estimate: svy, subpop(if exclusion != 1): glm outcome _hat _hatsq, family(binomial) link(log)
    I'd appreciate any input on:
    1. Does my approach of not imputing my outcome and predictor variable but rather keeping them as a subpopulation make sense? They are both generated variables from other variables. Both variables have approximately 7% missing values.
    1. Is both the mi test command and adapted linktest the most appropriate tests to use in this case? Not sure if this is relevant but interestingly, linktest always suggested there was no link error when I ran it for every combination of variable interaction terms in the multivariate logistic regression even though sylogitgof did change.
    I'm still pretty new to Stata (first post here) and starting to get deeper into stats so I hope I've included all the relevant information
Working...
X