Hi everyone,
So here is the issue. I want to run different specifications on the same set of cases (i.e., the same analytic sample), and I think this is usually straightforward with non-survey data. But I'm using a dataset to which I have applied svyset to account for the complex survey design. I have set the psu, sampling weight, and strata. I think I have run different specifications on the same set of cases, but across specifications I get different numbers for the "Number of obs," "Population size," "Subpop. no. obs," and "Subpop. size."
My code looks like the following:
Line
1 svy, subpop(if memberOfSiblings==1): reg outcome indepVar1 indepVar2 indepVar3 i.siblingSet_id
2 gen esmpl = e(sample)
3 svy, subpop(esmpl==1): reg outcome indepVar1
4 svy, subpop(esmpl==1): reg outcome indepVar1 indepVar2 indepVar3 i.siblingSet_id
In Line 1, there are two variables I want to be clear about. First, memberOfSiblings is an indicator variable I created to identify people who are part of my siblings sample. So memberOfSiblings==0 for people who are the only child in their family, and 1 otherwise. To be clear, my sub-population consists of people who have one or more siblings; I don't think this is any different than defining a subpopulation based on race or gender, although correct me if I'm wrong. Second, notice also that in Line 1, I think I'm controlling for sibling fixed effects by including i.siblingSet_id, where siblingSet_id is a numeric variable (e.g., 1, 2, ..., 300, 400) that groups siblings and assigns those from the same family the same unique identifier. By controlling for sibling fixed effects I think I'm comparing siblings within families.
In Line 2, I mark only those cases that were used in the estimation in Line 1. But when I run the statement in Line 3, which is the less restrictive specification (relative to the specification in Line 4), the "Number of obs," "Population size," "Subpop. no. obs," and "Subpop. size" are all larger than the values I obtain in Line 4.
My questions
So here is the issue. I want to run different specifications on the same set of cases (i.e., the same analytic sample), and I think this is usually straightforward with non-survey data. But I'm using a dataset to which I have applied svyset to account for the complex survey design. I have set the psu, sampling weight, and strata. I think I have run different specifications on the same set of cases, but across specifications I get different numbers for the "Number of obs," "Population size," "Subpop. no. obs," and "Subpop. size."
My code looks like the following:
Line
1 svy, subpop(if memberOfSiblings==1): reg outcome indepVar1 indepVar2 indepVar3 i.siblingSet_id
2 gen esmpl = e(sample)
3 svy, subpop(esmpl==1): reg outcome indepVar1
4 svy, subpop(esmpl==1): reg outcome indepVar1 indepVar2 indepVar3 i.siblingSet_id
In Line 1, there are two variables I want to be clear about. First, memberOfSiblings is an indicator variable I created to identify people who are part of my siblings sample. So memberOfSiblings==0 for people who are the only child in their family, and 1 otherwise. To be clear, my sub-population consists of people who have one or more siblings; I don't think this is any different than defining a subpopulation based on race or gender, although correct me if I'm wrong. Second, notice also that in Line 1, I think I'm controlling for sibling fixed effects by including i.siblingSet_id, where siblingSet_id is a numeric variable (e.g., 1, 2, ..., 300, 400) that groups siblings and assigns those from the same family the same unique identifier. By controlling for sibling fixed effects I think I'm comparing siblings within families.
In Line 2, I mark only those cases that were used in the estimation in Line 1. But when I run the statement in Line 3, which is the less restrictive specification (relative to the specification in Line 4), the "Number of obs," "Population size," "Subpop. no. obs," and "Subpop. size" are all larger than the values I obtain in Line 4.
My questions
- Why might Number of obs (which I take to refer to the number of cases in my dataset) be different across Lines 3 and 4. I trust the way I've coded memberOfSiblings?
- When you are using complex survey data and want to run different specifications on the same cases, is the procedure to do this (which I outlined in Lines 1-4 above) the same as the one you'd follow with data that you're not weighting?
Comment