Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sample size for different models

    I am working to complete my Masters thesis and I am confused on which sample I should use for different models.

    My original dataset, after restricting age to the working population, has 80,000 observations. I plan to estimate two models:

    First model investigates wage equations for workers in different locations:
    1. reg wage x1 x2 x3 if location=="England" ; obs: 21000
    2. reg wage x1 x2 x3 if location=="Wales" ; obs: 18000
    3. reg wage x1 x2 x3 if location=="Ireland" ; obs: 22000

    Second model investigates wellbeing equations:
    1. reg wellbeing x1 x2 x3 x4 x5 x6 if location=="England" ; obs: 7000
    2. reg wellbeing x1 x2 x3 x4 x5 x6 if location=="Wales" ; obs: 3000
    3. reg wellbeing x1 x2 x3 x4 x5 x6 if location=="Ireland" ; obs: 6000

    Since my second model has more control variables, there are less observations.

    My question is, when I describe my sample size in the descriptive statistics section, do I describe my sample size as 80,000 observations even though the majority are not used in the regressions? Should I consider dropping some variables with missing values? Thanks in advance.

  • #2
    Welcome!

    The second analysis lost 80% of the sample and that is a lot. The first question the audience would generally wonder is "Are the retained 20% similar to the full sample?" So, it would be necessary to revise your descriptive statistics table by adding one more column that is specific to those 16,000 people in analysis part II. Another preemptive action to take would be to run the part I again with the same 20% retained sample, and examine if the results in part I between the two versions are somewhat comparable (or not.)

    As for whether you should drop the variable that caused a lot of missing, that's perhaps better to be discussed with your advisor. Generally, it is possible to identify proxy variables that can stand in for the variable with high amount of missing. The process of identifying them, however, requires some understanding of the causal framework and literature. So, you'll need your advisor's input.

    Lastly, do also confirm again the mechanisms of those missing values. Sometimes it could be due to logical skip patterns built into the survey. (For example, someone mistakenly use "number of cigarette smoked" to represent smoking without knowing that there was a triaging question before that asking "do you ever smoke?" So, all the non-smokers would be missing in "number of cigarette smoked," while they should actually have a 0 as their data.)

    Comment


    • #3
      (in addition to great points by Ken)

      Dear Albert, how I look at your study (considering no errors) is that your overall sample is 80, 000 for which you may have collected the data but due to missing values it is reduced for the different regressions specific to the three locations. Each regression answers a question specific to a location and the sample size (you have mentioned as the observations for that regression) are specific to "that" regression equation. The overall sample size of your study would still be 80, 000 if your intention was to have complete data for all variables for these 80, 000 observations 'spread over three locations' to reflect your methodology and data collection. The analysis and the explanatory variables should be more dependent on hypothesis and not driven by the data itself.

      Comment

      Working...
      X