Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using logistic regression for (small n) census data

    Hello everyone,

    I'm currently doing a study on the ambitiousness of the national implementation of the EU Emissions Trading Scheme in dependence of various independent variables (GDP per capita, industries' share of GDP in %, members of environmental NGOs as in % of the national population, etc). As the EU-directive only set the margins of the directive, the Member States had plenty of leeway to customize the national implementation to their national needs. My goal is to demonstrate that the higher the resources of the affected industries are, the laxer the national implementation of the EU ETS will be. On the contrary, the higher the resources of environmental NGOs are, the more ambitious the national implementation will be. The theoretical model is actully a lot more complicated, but I don't want to bother you unnecessarily. If something is unclear, however, I'd be happy to extent my theoretical model.

    I collected data for all (until that point) 27 Member States and therefore have a full census - this leads me to the following questions

    - If I use the data to make estimations about Member States general behavior in EU environmental policy, will that legitimate the use of a census?
    - The census only has 27 cases, therefore not complying with the thumb rule 10 observations/independent variable. Can I multiple the cases by e.g. 5? As it's a census, the distribution of values shouldn't be changed, but maybe I'm missing out on other details.

    I'd be really grateful for any comments and indications!
    Best regards,
    Maya

  • #2
    This is something of a philosophical question. If your goal is to calculate the relationship between the resources of the affected industries and the laxity of national implementation of the EU ETS specifically for exactly these 27 countries during exactly the time period of your data, and if you are willing to suspend disbelief and assert that your data contain no measurement error, then you have a census and all you need to do is run your model and use the coefficients. (You also have to believe that your model is correctly specified.) The standard errors and the usual apparatus of statistical inference (CIs, p-values) are irrelevant. The ratio of observations to variables would also be irrelevant as that, too, refers to the applicability of your findings outside of your sample.

    More briefly put, if you are satisfied that your data are perfect and complete measures of all information bearing on the question, then ordinary statistical inference goes out the window.

    On the other hand, you may wish to make generalizations beyond your sample, such as to different time periods, or to imagine that the same ETS might be applied elsewhere. In that case, sampling theory does apply, though as far as inference to other times and places is concerned, your sample is highly restricted and you are engaging in some risky extrapolation. And you may wish to consider that your variables contain measurement error that represents a sample of the possible values of the variables. For this context, statistical inference would be as appropriate as any ordinary application of standard inferential statistics and the usual "rules" would apply.

    Comment


    • #3
      The advice in Clyde's first paragraph applies only if you believe that your model is purely descriptive. So, I go with Clyde's second paragraph, which invokes what is known as a super-population approach. In that case, you really are limited to models with 3-4 predictors. Overfitting means that your model is so tailored to the current data set that it will not replicate in others. Adding copies of the data will not protect from this-- the regression coefficients would be identical. but it will invalidate all reported significance levels.
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4
        I overlooked your title. With logistic regression, the relevant n for the rules of thumb is not the number of observations, but the minimum of the number of events and non-events. With n = 27 that minimum will be at most 13, and you can plan on fitting 1 or 2 predictors. My suggestion would be to switch to a continuous or scored outcome, so that you can use linear regression.
        Steve Samuels
        Statistical Consulting
        [email protected]

        Stata 14.2

        Comment


        • #5
          With a small number of events or non-events, exact logistic regression ( exlogistic) should be a better option than the standard logistic programs.
          Steve Samuels
          Statistical Consulting
          [email protected]

          Stata 14.2

          Comment


          • #6
            Thanks for the helpful input, Clyde and Steve! A linear regression isn't really an option, as not all of my independent variables have a strong linear relationship to my dependent variable. I'll try using the exact logistic regression and see how that works with my data. Thanks again!

            Comment

            Working...
            X