Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -xtnbreg, fe- minimum observations per group

    Hello dear forum members,

    I am estimating a panel model with fixed effects using yearly data from 2009 to 2013 (N=1848). Whereas 80% of the data encounter less than 5% of missing values, the remaining 20% encounter 75% of the missing values. These "problematic" 20% of the data are review-based rating scores, which (due to their nature) are not given every year to every entity in the data set. For example, an entity may be reviewed (and given a rating) say once or twice in 5 years (and not necessarily in consecutive years), thus resulting in missing values.

    -xtnbreg, fe- estimation results in a total N=751 with 2,049 total observations, i.e., min observations per group 2, max = 5, and avg = 2.7

    Does this relatively low average number of observations per group (i.e., 2.7 out of 5) bias the estimates substantially? Or poses any other problems?

    Thank you in advance,
    Anton
    Last edited by Anton Ivanov; 07 Jun 2015, 13:37.

  • #2
    The number of observations that are missing the review based rating scores isn't all that important in terms of bias (precision is a different matter). The two most relevant questions are:

    1. Is this score the dependent variable or one of the predictors?

    2. Why are these rating scores missing in the particular years they are missing. If the scheduling of these rating sessions is just at regular intervals, or at intervals selected by random assignment, or is based only on factors that are independent of the unobserved rating score, then there is no bias introduced through the missingness. But if it's like an accreditation review, where you get re-evaluated at a shorter interval if your last review wasn't so great, then the missingness is by no means ignorable and you likely have a bias problem on your hands. If you have appropriate variables such that the unobserved scores are independent of the unobserved values conditional on those other variables, then you may be able to reduce the bias using multiple imputation.

    Comment


    • #3
      Thank you for reply, Professor Schechter. Please see my responses below:

      Originally posted by Clyde Schechter View Post
      The number of observations that are missing the review based rating scores isn't all that important in terms of bias (precision is a different matter). The two most relevant questions are:

      1. Is this score the dependent variable or one of the predictors?
      The rating score is an independent variable.

      Originally posted by Clyde Schechter View Post
      2. Why are these rating scores missing in the particular years they are missing. If the scheduling of these rating sessions is just at regular intervals, or at intervals selected by random assignment, or is based only on factors that are independent of the unobserved rating score, then there is no bias introduced through the missingness. But if it's like an accreditation review, where you get re-evaluated at a shorter interval if your last review wasn't so great, then the missingness is by no means ignorable and you likely have a bias problem on your hands.
      Consider, for example, ratemds.com or vitals.com -- any online user can review a doctor at any given time -- completely at random. Whereas some doctors have multiple user reviews (and corresponding star rating scores) across a number of years, others may have one single review in all years.

      Originally posted by Clyde Schechter View Post
      If you have appropriate variables such that the unobserved scores are independent of the unobserved values conditional on those other variables, then you may be able to reduce the bias using multiple imputation.
      I did consider MI, yet I am afraid that having 75% missing is too much and there may be negative comments from the reviewers on that.


      Comment


      • #4
        Well, it is often said of consumer ratings, though I don't know if it's really true, that people are more likely to rate if their experience was particularly good or particularly bad. In any case, I think it would be very difficult to assert that missingness is independent of the unobserved value and keep a straight face. So I think ignoring the missingness is out of the question.

        There may be people on the forum who know more about consumer ratings and can provide better advice on how to handle it. It may be that there are other variables that you might be able to rely upon to claim missingness at random and use multiple imputation. But I wouldn't know. Unless somebody can give you a credible argument that these ratings are missing at random, and identify the covariates that support that, MI won't really help with bias reduction.

        If I were left to my own devices and faced this situation I'd probably do two analyses: one which used the full data set and omitted the ratings variable altogether, and another which included it (and, necessarily, included only the observations with the ratings). And I would hope that the results were, in other respects, highly similar.

        Actually, although some reviewers do tend to look askance at analyses with only a small fraction of the data being unimputed, there is nothing in the theorems underlying how multiple imputation reduces bias that says this is a problem. What is true is that the less real data you have, the more imputations you have to run to get results with useful precision. But that's just a matter of putting in the time and effort.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          Well, it is often said of consumer ratings, though I don't know if it's really true, that people are more likely to rate if their experience was particularly good or particularly bad. In any case, I think it would be very difficult to assert that missingness is independent of the unobserved value and keep a straight face. So I think ignoring the missingness is out of the question.
          Sir, you are correct on what people say about the ratings. I include a qplot of ratemds rating as an example (vitals are very similar). And I cannot ignore the missingness, this is a problem indeed.

          Click image for larger version

Name:	qplot_rmd.png
Views:	1
Size:	43.2 KB
ID:	1297335

          Originally posted by Clyde Schechter View Post
          If I were left to my own devices and faced this situation I'd probably do two analyses: one which used the full data set and omitted the ratings variable altogether, and another which included it (and, necessarily, included only the observations with the ratings). And I would hope that the results were, in other respects, highly similar.
          I did conduct these two analyses and the estimates for all the predictors (except the rating score) are pretty similar. However, it is precisely the impact of the rating score that I am specifically interested in (the rest of the regressors are more for the control purposes).

          On a side note: I have collected review-based rating scores from both ratemds.com and vitals.com. The correlation between them is 0.35 (even though I expected it to be higher). The mean scores vary interestingly across years (see below, sorry for the accidentally switched axis names). Included in the model separately, neither has a significant effect. However, when both included in the model, both are significant and with opposite signs.

          Click image for larger version

Name:	rmd_vitals_mean_years.png
Views:	1
Size:	69.0 KB
ID:	1297336

          Originally posted by Clyde Schechter View Post
          Actually, although some reviewers do tend to look askance at analyses with only a small fraction of the data being unimputed, there is nothing in the theorems underlying how multiple imputation reduces bias that says this is a problem. What is true is that the less real data you have, the more imputations you have to run to get results with useful precision. But that's just a matter of putting in the time and effort.
          Absolutely agree with you, Sir. It could be that this is my only option left, since collecting more data (i.e., increasing N) may not result in higher number of valid observations anyway.
          Last edited by Anton Ivanov; 07 Jun 2015, 16:46.

          Comment


          • #6
            Very interesting. That quantile plot is really dramatic!

            The finding that when both are included they are both significant but with opposite signs suggests that, at least under the conditions when both ratings are available, the outcome you are studying is, in a sense, predicted by a weighted difference between them!

            Good luck. I can't think of anything more that would be helpful at this point.

            Comment

            Working...
            X