Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Svy subpop - number of subpop observations

    Hi there,

    I am currently working on a longitudinal analysis of survey data. My analysis is only interested in a subsample of the full sample; therefore, I created a dummy variable that indicates whether an observation meets the criteria for the subsample (1 = eligible observation, 0 = non-eligible observation).

    When I use the count function to determine the number of eligible observations, the sample size is 2,576. Yet, 435 of these observations have a 0 weight. Thus, I anticipated that the sample size for the subsample would be 2141.

    However, when I go to generate descriptive and inferential statistics using the sample sizes differ.



    When I generate the mean age for my subsample (see below), the sample size for the subsample is 2141, as expected.



    Click image for larger version

Name:	Mean.png
Views:	1
Size:	23.6 KB
ID:	1778916




    However, when I go to generate the proportions for racial discrimination (one of my predictors), the sample size drops to 2127. I thought this may be due to missingness in the racial discrimination variable, and I identified 15 observations that had .a (our code for not applicable) for this variable. However, the removal of these observations for the calculation of the proportions would result in a sample size of 2126, not 2127.
    Click image for larger version

Name:	W5RD.png
Views:	1
Size:	26.9 KB
ID:	1778917




    Further to this, when I ran nested logistic regression models on the sample subsample, the sample size was reduced even further to 2095. Again, I tested to see if this was due to missingness in my analytical variables, which identified that 68 observations had .a in one or more of my analytical variables. However, if these observations were not included in the logistic models, then the sample size would be 2073, not 2095.

    Click image for larger version

Name:	logistic.png
Views:	1
Size:	68.0 KB
ID:	1778918




    If anybody has any insight into why these sample sizes differ, it would be greatly appreciated.
    Last edited by Evie Gates; 17 Jun 2025, 11:48.

  • #2
    I thought this may be due to missingness in the racial discrimination variable, and I identified 15 observations that had .a (our code for not applicable) for this variable. However, the removal of these observations for the calculation of the proportions would result in a sample size of 2126, not 2127.

    Perhaps one of those 15 observations that with w5RDindication == .a was not part of the estimation sample for the mean age calculation?

    Code:
    // RE-RUN THE ESTIMATION OF MEAN AGE HERE
    count if missing(w5RDindication == .a) & e(sample)
    Perhaps this result is 14, not 15.

    Comment


    • #3
      Hi Clyde,

      I have looked into the issue further and I think it is possibly related to the use of the nestreg command in conjunction with subpop. As the number of observations (full sample) for logistic regression models appears to be much lower than we would expect, and when I run these logistic regressions as separate models, the number of observations (full sample) is comparable to those reported in the descriptive statistics. Thus, I think maybe nestreg is only using the subsample to calculate both the coefficients and standard errors, when the subpop option with svyset is supposed to use the full sample for standard error calculations. However, I cannot find any information regarding the use of nestreg and subpop together.

      Comment


      • #4
        I also do not know how -nestreg- calculates standard errors when applied to a -svy, subpop(...):- command. What I do know about -nestreg- is that in all circumstances, it first identifies the estimation sample that is available when all of the groups of predictor variables are used together. That is the set of observations that can be included in the estimation of all of the nested models that -nestreg- will work on. This is important because the comparisons of the nested models can only be valid when all of them are estimated on the same estimation sample. That common estimation sample is, necessarily, the smallest estimation sample that can be configured with those variables. So the sample size that -nestreg- works with is at most equal to the sample sizes you would get carrying out the nested estimates separately, and, much more commonly, it is smaller.

        Comment

        Working...
        X