Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to keep the same analytic sample across specifications in data with a complex survey design

    Hi everyone,

    So here is the issue. I want to run different specifications on the same set of cases (i.e., the same analytic sample), and I think this is usually straightforward with non-survey data. But I'm using a dataset to which I have applied svyset to account for the complex survey design. I have set the psu, sampling weight, and strata. I think I have run different specifications on the same set of cases, but across specifications I get different numbers for the "Number of obs," "Population size," "Subpop. no. obs," and "Subpop. size."

    My code looks like the following:

    Line
    1 svy, subpop(if memberOfSiblings==1): reg outcome indepVar1 indepVar2 indepVar3 i.siblingSet_id
    2 gen esmpl = e(sample)
    3 svy, subpop(esmpl==1): reg outcome indepVar1
    4 svy, subpop(esmpl==1): reg outcome indepVar1 indepVar2 indepVar3 i.siblingSet_id

    In Line 1, there are two variables I want to be clear about. First, memberOfSiblings is an indicator variable I created to identify people who are part of my siblings sample. So memberOfSiblings==0 for people who are the only child in their family, and 1 otherwise. To be clear, my sub-population consists of people who have one or more siblings; I don't think this is any different than defining a subpopulation based on race or gender, although correct me if I'm wrong. Second, notice also that in Line 1, I think I'm controlling for sibling fixed effects by including i.siblingSet_id, where siblingSet_id is a numeric variable (e.g., 1, 2, ..., 300, 400) that groups siblings and assigns those from the same family the same unique identifier. By controlling for sibling fixed effects I think I'm comparing siblings within families.

    In Line 2, I mark only those cases that were used in the estimation in Line 1. But when I run the statement in Line 3, which is the less restrictive specification (relative to the specification in Line 4), the "Number of obs," "Population size," "Subpop. no. obs," and "Subpop. size" are all larger than the values I obtain in Line 4.

    My questions
    1. Why might Number of obs (which I take to refer to the number of cases in my dataset) be different across Lines 3 and 4. I trust the way I've coded memberOfSiblings?
    2. When you are using complex survey data and want to run different specifications on the same cases, is the procedure to do this (which I outlined in Lines 1-4 above) the same as the one you'd follow with data that you're not weighting?
    I guess I'm just wondering whether running different specifications on the same cases in complex survey data is really conceptually quite different than running different specifications on say administrative data, since in the former each case is weighted to represent a larger number of cases.
    Last edited by Elc Estrera; 09 Sep 2018, 15:34. Reason: Edit: Fixed a typo.

  • #2
    The most likely explanation is that the discrepancy is due to missing values of indepVar2, indepVar3, or siblingSet_ID. Remember that -regress- omits any observation where any variable mentioned in the command contains a missing value. So if there are observations that have non-missing values for indepVar1 but are missing a value on indepVar2, indepVar3, or siblingSet_ID, in the 3rd command you show, those observations are included, but in the fourth command they are excluded.

    The way to assure the same sample is used for different nested specifications is to run the specification containing the most variables first, and then use the e(sample) results from that analysis to restrict the sample in the other regressions (which contain fewer variables). In this case, with just two different specifications, you just need to do them in the reverse order.

    Comment

    Working...
    X