Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strategy for analyzing 90 observations with a dichotomous dependent variable

    Dear statalisters,

    I am involved in a project analyzing the effect of Environmental Impact Assessment's effect on wind-power plant applications being granted or not, the expectation being that a higher (reflecting more negative EIA) decreases the chances of an application being granted. We have a dataset of 86 observations with a dichotomous dependent variable which has 32 0's and 60 1's, using Stata SE 15.1. Since this is the whole population of applications for the relevant period (Norway 1999-2018), collecting more data is not an option. We simply aim to test the hypothesis of whether worse EIA reduces chances of concession being granted. We will at least try to publish a paper on the data set which is quite innovative, but we also want to test our most basic hypothesis. What we ask for is your opinion on our approach. Any comments - critical or constructive - are very welcome.

    We have thought to do the following:

    1. Dichotomize the 8-category EIA-variable so as to reduce chances of perfect separation/sparse cells. The 2x2 table with the dependent variable looks like this:
    EIA = 0 EIA = 1 Total
    Rejected 12 18 30
    Granted 47 9 56
    Total 59 27 86

    2. When we run regressions, never use more than maximum 2 more covariates in addition to the IV of interest and be more than normally vary of separation issues, collinearity, and instability between specifications and when dropping cases. We have done some preliminary analysis on this, and the IV of interest changes little when introducing one control at the time.

    3. Due to no model being optimal for such a small dataset, we have opted for using several estimators: logit, logit with robust std err., rare events logit (King & Zeng's ReLogit), firthlogit, exact logit and (mainly to probe some assumptions) the good old LPM. I trust the firth logit and exact logit the most, but after what I can understand these have different strengths and weaknesses with Firth logit having the most reliable point estimate, whereas its standard error can be misleading, and if I got it right, the converse holds true for exact logit. From what I can judge (see below) our preliminary analyses shows little difference in the magnitude and standard errors across models (the exception being OLS which is on a different scale and of course not in odds ratios). However, I have some qualms as to how reliable any model would be in such a small sample and how much can be done, in particular when it comes to judging substantive impact.

    4. Here's our code
    Code:
      
     *Model 1. Logistic
     logistic conc_1 revKU_nat if included == 1 
     *Model 2. Logistic robust std err
     logistic conc_1 revKU_nat if included == 1, robust 
     *Model 3. ReLogit
     relogit conc_1 revKU_nat if included == 1 
     *Model 4. OLS
     reg conc_1 revKU_nat if included == 1
     *Model 5. Firth logit
     firthlogit conc_1 revKU_nat if included == 1, or 
                 /*Obtaining reliable significance values for coeff of interest a la Heinze and Schemper (2002) a lr-test of the nested vs. full model BUT constrains the variable of interest  to zero 
                estimates store Full 
                constraint 1 revKU_nat = 0 
                firthlogit conc_1 revKU_nat if included == 1, constraint(1) 
                estimates store Constrained 
                lrtest Full Constrained 
                *lr-test: testval/p-val =  16.92/0.0000
      *Model 6. Exact logit        
      exlogistic conc_1 revKU_nat if included == 1, memory(2g) test(prob)
    And the results (given in odds ratios except for Model 4):
    Model 1. Logistic regression Model 2. Logistic regression Model 3. ReLogit Model 4. OLS Model 5. Firth logit Model 6. Exact logit
    EIA 0.128*** 0.128*** 0.135*** -0.463*** 0.135*** 0.132***
    (0.0665) (0.0669) (0.0693) (0.1000) (0.0690) NA(see teststat)
    Constant 3.917*** 3.917*** 3.797*** 0.797*** 3.800*** NA
    (1.267) (1.274) (1.207) (0.0560) (1.208) NA
    Observations 86 86 86 86 86 86
    prob-test 0.000041/0.0001
    lr-test 16.92/0.0000
    R-squared 0.204
    Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1

    Again, thanks for your comments!
    Best,
    Ole Magnus Theisen

  • #2
    You didn't get a quick answer. Your question is long and complex. You'll increase your chances of a helpful answer by following the FAQ on asking questions and by trying to ask one thing at a time.

    First, you don't tell us what EIA is although you talk about better or worse EIA. Is EIA metric or ordinal or something else? If EIA is metric, then stay with one variable. Indeed, many folks use continuous x methods when they have ordinal x's. If EIA is ordinal, then you might do the dummy approach but that adds a bunch of parameters. As I said, many use continuous x methods when they have an ordinal x.

    I don't see that you have much of a problem. You've done this many ways and get very similar results (.128 looks a lot like .132 to me). Normally, undersized samples just mean you don't get statistical significance (until you get very very small samples). Even with logits adapted for small samples, you still get the same results. I'd just report them all and say the results are robust.

    Comment


    • #3
      At # Phil Bromley:

      1. Thanks for pointing out that I failed to explain what EIA means. EIA is an abbreviation for Environmental Impact Assessment which is required in any power plant application in Norway (it is more or less standardized within the whole EU+-area). I guess I fell into the jargon trap.

      2. Thanks also for asking about its operationalization and suggesting testing it as continuous. The EIA-variable originally takes values from 0 (no or insignificant impact of the plant being built compared to the present situation ) to 4 (a very big negative impact of the plant being built), but with half-scores included, so that it goes from 0 to 0.5 to 1 to 1.5 and ends at 4. We did test it as it stood, but that led to sparse cells and an extremely high value for both the point estimate and standard error of the constant term, while the variable of interest behaved quite as expected. That operationalization also led to larger differences between the estimators, than in the models shown above, possibly indicating instability. We therefore settled on recoding it into a dummy taking the value 0 for values ranging from 0 (no or insignificant changes) up until and including 2 (medium negative impact) and 2.5 (medium to large negative impact) and higher (4 a very big negative impact) into 1. This is the operationalization used in the models shown above. We also tested the model with the cutoff between 1.5 and 2 with similar, but weaker results (still significant). I think we have to settle with the dichotomization approach instead of coding each value as a dummy, since we quickly get into a sparse cells problem. We will rather have to conduct robustness checks moving the cutoff up and down, as we already have tested a bit.

      3. Thanks for suggesting reading through the FAQs. I did, but I guess I was a bit blind to my question. I will try to chisel it up into edible pieces.

      4. I am relieved to hear that you don't see any other fundamental challenges than those I was already aware of.

      Best,
      Ole Magnus Theisen


      Comment

      Working...
      X