Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to set -svyset- when multilevel modeling a binary outcome to take account of complex survey design?

    Dear all,

    I am reposting my previous unanswered query with more information in the hope I might get a response this time.

    My question essentially boils down to how to set -svyset- in order to run a 2-level logit model taking account of complex survey design (psu, strata, and weights).

    To give you a bit of background, I have gone through a variety of material including the Stata Manuals (here & here) as well as the book entitled 'Multilevel and Longitudinal Modelling using Stata' (2005 ed.) by Sophia Rabe‐Hesketh, but I am looking for someone to confirm (or not) my understanding. I tried to engage with another piece (Rabe‐Hesketh, S., and Skrondal, A. (2006). Multilevel modeling of complex survey data) but wasn't able to fully digest it. Particularly when the discussion around needing to have weights at both levels was introduced. As I mentioned, I have also tried to get assistance from Stata users via the forum a couple of months ago but have unfortunately not had much luck.

    Put succinctly, I am trying to run a 2-level logistic regression taking into account complex survey design, but I'm not quite sure if my Stata code is correct. I have chosen to run a 2-level logistic regression (i.e. using -melogit- because I cannot use -svyset- with -xtlogit-). I am using Stata 15.1.

    My data is 2-level in that it reflects multiple observations over time (2009-2018) nested within an individual (identified by the pidp variable in the code below). My dependent variable, empstatus, is coded as 1=employed and 0=unemployed. My dataset comprises of approximately 160,000 observations.

    As stated above, my question essentially boils down to how to set -svyset- in order to run a 2-level logit model taking account of complex survey design (psu, strata, and weights). I think I have understood how to apply weights in a multilevel logit model, notably through the following code,

    Code:
    melogit empstatus i.gender age i.race[pweight=weight] || pidp: , allbase
    However, unless I am mistaken this code doesn't take into account psu and stratification (identified through my 'strata' and 'psu' variables in the code below) which is what I would like to do. Please note the values for the 'strata', 'psu' an 'weights' variables are provided by the survey provider when I download the data so I don't actually calculate them.

    From all the code I have played around with, Model 1 below seems to make the most sense to me (but as I said I'm not 100% certain it is indeed doing what I want).

    Model 1:

    Code:
    svyset, clear
    svyset psu, strata(strata) weight(weight) singleunit(scaled)
    svy: melogit y x1 x2 x3 ||pidp: , or
    However, I have also played around and found other possible ways I could set -svyset-, for example:

    Model 2: SAME AS MODEL 1 BUT 'psu' REPLACED BY 'pidp' IN THE SVYSET COMMAND


    Code:
    svyset, clear
    svyset pidp, strata(strata) weight(weight) singleunit(scaled)
    svy: melogit y x1 x2 x3 ||pidp: , or
    Model 3: THIS MODEL SETS -SVYSET- FOLLOWING THE METHOD IN THE STATA MANUAL (link 1 in this email p.107) BY CREATING A NEW WEIGHT 'pst1s1'


    Code:
    svyset, clear
    svyset pidp, weight(weight) singleunit(scaled) || _n, weight(pst1s1)
    svy: melogity x1 x2 x3 ||pidp: , or
    Formula used to create 'pst1s1':


    Code:
    sort pidp
    generate sqw = weight * weight
    by pidp: egen sumw = sum(weight)
    by pidp: egen sumsqw = sum(sqw)
    generate pst1s1 = weight*sumw/sumsqw
    Please note: when I create the 'new weight', pst1s1, it ends up having a value of 1 or 0.999999 for all observations.


    Model 4: THIS MODEL SETS -SVYSET- FOLLOWING THE METHOD IN THE STATA MANUAL (link 1 in this email p.107) BY CREATING A NEW WEIGHT 'pst1s1' (see above) BUT DIFFERENT TO MODEL 3, I INCLUDE 'strata' IN THE -SVYSET- COMMAND


    Code:
    svyset, clear
    svyset pidp, strata(strata) weight(weight) singleunit(scaled) || _n, weight(pst1s1)
    svy: melogit y x1 x2 x3 ||pidp: , or
    Model 5: THIS MODEL IS THAT SAME AS MODEL 4 BUT THE ONLY DIFFERENCE IS THAT 'pidp' IS REPLACED BY 'psu' IN THE CODE FOR -SVYSET-


    Code:
    svyset, clear
    svyset psu, strata(strata) weight(weight) singleunit(scaled) || _n, weight(pst1s1)
    svy: melogit y x1 x2 x3 || pidp: , or
    Please note: Because the 'new weight', pst1s1, has a value of 1 or 0.999999 for all observations, the odd ratios and standard errors from Model 5 are identical to those in Model 1.


    Model 6: AN ALTERNATIVE METHOD WIHTOUT USING SVYSET BUT ADDING ONLY WEIGHTS


    Code:
     melogit y x1 x2 x3 [pweight=weight] || pidp: , or


    All the above commands run with no error in Stata 15.1, and all models (bar model 6) yield the same odds ratio coefficients with slight differences in the standard errors. That said, I'm not sure which is the correct command to execute, and I'm keen to understand what exactly I am doing rather than simply 'push buttons' in Stata.


    Therefore, I would be very grateful if you could advise which of the above models is the correct one to execute if I want to run a 2-level logit regression taking into account complex survey design (i.e. strata, psu, & weights).

    Thank you in advance, and have a pleasant day.

    Samir Sweida-Metwally
    Last edited by Samir Sweida-Metwally; 27 Sep 2019, 11:08. Reason: Multilevel Modelling; complex survey design; svyset

  • #2
    Hello Samir,

    I have the same exact question and I was wondering which approach you eventually settled on. Any guidance would be greatly appreciated.

    Thanks,
    Petyr

    Comment


    • #3
      Hi, Snap!

      I've been trying to figure this out with regards to analysing UK Household Longitudinal Survey data which has PSU, strata and individuals longitudinal weights.

      I also read that the weights are required at different levels.

      So, obviously the individual weight will be entered for the individual level (below which is time, in my model).

      Now, with the higher levels this is tricky - I was told:

      "The cluster variable (PSU) should be one of your levels, usually the highest."

      But the PSU unit seems to be geographical and to be within one of my higher spatial units, so that's not going to work.

      - So I'm thinking, maybe a cross-classification MLM where individuals are members of both groups to alloow for this?

      Also advised:
      "Sometimes stratification is allowed for in MLMs, but if not - excluding stratification makes estimates conservative - so not a problem."

      Hope this is of some help to you!




      Comment

      Working...
      X