Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Number of observations varying across years

    Hi,

    the dataset I am using contains observations from 18 years, in every year there are between 200 and 500 observations. But in the most recent year, there are 1.000 observations. I am not sure, whether I should exclude this most recent year from my analysis, since I have so many more observations here which might bias the outcome of my regression analysis (especially if the influence of certain regressors varies across time).
    It would be great if you could tell me your opinion to this problem. Do you think using weights (sampling or frequency??) could allow including all observations (even the 1.000 from the most recent year)?

    Thank you very much for your comments & help!
    Ally

  • #2
    Ally:
    excluding observations in, in general, a dangerous approach, as you may well end up with a meked-up sample that has little to do with the original dataset.
    That said, you do not explain which are the goals of your research or why older years report less observation than more recent ones.
    As far as sampling weights are concerned, in instances like yours they are worth considering if you can rely upon sound assumptions that the probability of drawing observations across years is different.
    As an aside, if you have time-varying predictors in your (I assume) panel dataset, in order tom increase your chances of getting helpful replies you should also provide more details about the way you intend to develop your regression model (e.g.: fixed or random effect specification?).
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      In addition to Carlo's good advice, I think an important consideration here is understanding why there is so much more data for this most recent year. I think that, in conjunction with your research goals, would be crucial for figuring out whether including or excluding that year, or some other approach such as weighting, would lead to the most appropriate analysis

      Comment


      • #4
        Thank you very much for your advice!

        The reason why there is so much more data for the most recent year is my method of data editing.
        I have panel data but I just want to keep the most recent observation for each person. Therefore, first of all, I edited my data in the way I needed it (reducing it to relevant age groups, education, occupation and so on) and in the end I only kept the most recent observation for each person. As a result, I have many observations for the "oldest" age group as well as for the most recent survey year.
        Of course, this pattern does not change - not matter which survey years I include or exclude from my data. Therefore, I was thinking that it might be advantageous to simply delete all observations from the most recent survey year as well as all observations of persons with the "oldest" age, such that I do not have one survey year or age group with many more observations than for the rest.
        Do you think this is a good idea or does it harm the quality of my data/analyses?
        I hope my explanation is fairly comprehensible.

        Thanks a lot!
        Ally

        Comment


        • #5
          Ally:
          I'm not clear with the approach you followed, nor wit your research goal.
          Anyway, it seems that you have created a subsample from the original dataset.
          Whether this is what you need for your research I cannot say, but the suspect that you are going to analyze a different beast from the original one still holds.
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment

          Working...
          X