Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Specification of svyset for melogit model of complex survey data

    Hi,

    I'm looking for assistance in correctly specifying a random effects logistic model using complex survey data.

    I am using Demographic and Health Survey data which is collected in a two-stage cluster design. Clusters are selected within strata using probability proportional to size and households are then randomly selected within clusters. The data of interest are collected from individuals within these households. The only available weight variable is for households (all individuals within a household are assigned the same weight). DHS does not produce cluster or strata-level weights. A final complication is that I am trying to run my model with pooled data from 29 surveys (147,175 observations).

    The end goal is to adjust for the random effect of the survey (n=29). It is unclear to me how to properly svyset my data to account for the pooled data as well as for the survey design with the limitations of the available weight variables.

    For any individual survey I would normally use:
    Code:
    svyset psu [pw=hhwgt], strata(strata)
    svy: logistic outcome exposure a b c d, or
    In a melogit model I am having trouble figuring out the proper svyset command. I want to be able to account for random effects of the survey in my pooled model. For example:
    Code:
    svyset survey || psu, strata(strata) || _n, weight(hhwgt)
    svy: melogit outcome exposure a b c d, or || survey:
    This code produces an error: "too many weight variables svyset; there are more svyset weight variables than levels specified in the model. an error occurred when svy executed melogit"

    If I run instead:
    Code:
    svy: melogit outcome exposure a b c d, or || survey: || psu:
    I get the following error:
    "numerical overflow;
    You have attempted something that, in the midst of the
    necessary calculations, has resulted in something too large
    for Stata to deal with accurately. Most commonly, this is
    an attempt to estimate a model (say with regress) with more
    than 2,147,483,647 effective observations. This effective
    number could be reached with far fewer observations if you
    were running a frequency-weighted model."

    If I group the survey and strata:
    Code:
    egen stratagroup=group(survey strata)
    svyset stratagroup || _n, weight(hhwgt)
    svy: melogit outcome exposure a b c d, or || stratagroup:
    or if I group survey and psu:
    Code:
    egen psugroup=group(survey psu)
    svyset psugroup || _n, weight(hhwgt)
    ​svy: melogit outcome exposure a b c d, or || psugroup:
    then the model converges and gives output. For example:
    Click image for larger version

Name:	Capture.PNG
Views:	1
Size:	44.3 KB
ID:	1348388

    I'm just uncertain that either of these provides the proper specification of the survey design.

    Advice would be most welcome.


  • #2
    Welcome to Statalist, Lia? Exactly what do you mean by saying the goal is to "to adjust for the random effect of the survey"?

    And, what do you mean by "survey"? There can be surveys in different countries in the same year; surveys in same country in multiple years? So, if "year" and "country" define survey, how many years are there and how many countries?

    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3
      Hi Steve,

      Thank you for your reply.

      I am looking at nationally-representative household survey data from 29 surveys in sub-Saharan Africa. Each survey used a similar sample design, employing the most recent census as a sample frame and selecting clusters using probability proportional to size sampling among the identified strata (usually region and urban/rural). Households are then randomly selected within those clusters. I am using individual-level data on children within the selected households. The weight variables available represent both the likelihood of being sampled and the likelihood of non-response.

      I am interested in associations between several household and individual-level variables and my individual-level outcome, but would like to adjust for the non-independence of the pooled data at the survey level, in addition to using svy commands to account for the survey design. Although you are correct that there is likely to be a time effect, as the surveys were not all conducted in the same year, I am not particularly interested in that element.

      I have explored using meta-analysis (metan macro) with the re option but it is unclear to me what assumptions are being made about the data structure and whether or not this option would be preferable to an melogit model. In addition, it would be nice to have a table of the adjusted coefficients of all of the covariates in the pooled model which is not produced in the meta-analysis. My understanding is that the melogit model may produce biased estimates without specifying weights at both the individual level and the cluster level (perhaps the survey level as well?). The only weight variable I have at my disposal is a combination of individual and cluster level adjustment. I found a paper from 2009 that suggests options for how to address this situation (scaling weights or using unweighted data) (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2717116/) but was hoping that updates had been made since that time that might help.

      Thanks,
      Lia

      Comment


      • #4
        Countries are not sampled. Therefore, if you want to study between country variation, then I don't think that you can do a survey analysis. You'll want to do at least a three-level meglm model, with levels country, PSU, HH. I don't see how to add the sampling region information to a multi-level model, but you should include urban/rural as a covariate.

        Some questions
        You refer to "weight variables" that represent sampling and non-response probabilities. What are they? (use their names). Is there a separate variable for HH and child non-response?

        To compute the PSU level weights, we would need a variable containing: 1) the "size" of the study PSU that was used in the selection process; I'm guessing that this was the census number of HH, but if not, give details; and 2) a variable containing the number of HH selected in each PSU;
        Last edited by Steve Samuels; 17 Jul 2016, 17:06.
        Steve Samuels
        Statistical Consulting
        [email protected]

        Stata 14.2

        Comment

        Working...
        X