Strategy for analyzing 90 observations with a dichotomous dependent variable

Ole Theisen

Join Date: Mar 2015
Posts: 4

Strategy for analyzing 90 observations with a dichotomous dependent variable

20 Mar 2019, 15:07

Dear statalisters,

I am involved in a project analyzing the effect of Environmental Impact Assessment's effect on wind-power plant applications being granted or not, the expectation being that a higher (reflecting more negative EIA) decreases the chances of an application being granted. We have a dataset of 86 observations with a dichotomous dependent variable which has 32 0's and 60 1's, using Stata SE 15.1. Since this is the whole population of applications for the relevant period (Norway 1999-2018), collecting more data is not an option. We simply aim to test the hypothesis of whether worse EIA reduces chances of concession being granted. We will at least try to publish a paper on the data set which is quite innovative, but we also want to test our most basic hypothesis. What we ask for is your opinion on our approach. Any comments - critical or constructive - are very welcome.

We have thought to do the following:

1. Dichotomize the 8-category EIA-variable so as to reduce chances of perfect separation/sparse cells. The 2x2 table with the dependent variable looks like this:

	EIA = 0	EIA = 1	Total
Rejected	12	18	30
Granted	47	9	56
Total	59	27	86

2. When we run regressions, never use more than maximum 2 more covariates in addition to the IV of interest and be more than normally vary of separation issues, collinearity, and instability between specifications and when dropping cases. We have done some preliminary analysis on this, and the IV of interest changes little when introducing one control at the time.

3. Due to no model being optimal for such a small dataset, we have opted for using several estimators: logit, logit with robust std err., rare events logit (King & Zeng's ReLogit), firthlogit, exact logit and (mainly to probe some assumptions) the good old LPM. I trust the firth logit and exact logit the most, but after what I can understand these have different strengths and weaknesses with Firth logit having the most reliable point estimate, whereas its standard error can be misleading, and if I got it right, the converse holds true for exact logit. From what I can judge (see below) our preliminary analyses shows little difference in the magnitude and standard errors across models (the exception being OLS which is on a different scale and of course not in odds ratios). However, I have some qualms as to how reliable any model would be in such a small sample and how much can be done, in particular when it comes to judging substantive impact.

4. Here's our code

Code:

  
 *Model 1. Logistic
 logistic conc_1 revKU_nat if included == 1 
 *Model 2. Logistic robust std err
 logistic conc_1 revKU_nat if included == 1, robust 
 *Model 3. ReLogit
 relogit conc_1 revKU_nat if included == 1 
 *Model 4. OLS
 reg conc_1 revKU_nat if included == 1
 *Model 5. Firth logit
 firthlogit conc_1 revKU_nat if included == 1, or 
             /*Obtaining reliable significance values for coeff of interest a la Heinze and Schemper (2002) a lr-test of the nested vs. full model BUT constrains the variable of interest  to zero 
            estimates store Full 
            constraint 1 revKU_nat = 0 
            firthlogit conc_1 revKU_nat if included == 1, constraint(1) 
            estimates store Constrained 
            lrtest Full Constrained 
            *lr-test: testval/p-val =  16.92/0.0000
  *Model 6. Exact logit        
  exlogistic conc_1 revKU_nat if included == 1, memory(2g) test(prob)

And the results (given in odds ratios except for Model 4):

	Model 1. Logistic regression	Model 2. Logistic regression	Model 3. ReLogit	Model 4. OLS	Model 5. Firth logit	Model 6. Exact logit
EIA	0.128***	0.128***	0.135***	-0.463***	0.135***	0.132***
	(0.0665)	(0.0669)	(0.0693)	(0.1000)	(0.0690)	NA(see teststat)
Constant	3.917***	3.917***	3.797***	0.797***	3.800***	NA
	(1.267)	(1.274)	(1.207)	(0.0560)	(1.208)	NA
Observations	86	86	86	86	86	86
prob-test						0.000041/0.0001
lr-test					16.92/0.0000
R-squared				0.204

Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1

Again, thanks for your comments!
Best,
Ole Magnus Theisen

Tags: None

Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

21 Mar 2019, 12:02

You didn't get a quick answer. Your question is long and complex. You'll increase your chances of a helpful answer by following the FAQ on asking questions and by trying to ask one thing at a time.

First, you don't tell us what EIA is although you talk about better or worse EIA. Is EIA metric or ordinal or something else? If EIA is metric, then stay with one variable. Indeed, many folks use continuous x methods when they have ordinal x's. If EIA is ordinal, then you might do the dummy approach but that adds a bunch of parameters. As I said, many use continuous x methods when they have an ordinal x.

I don't see that you have much of a problem. You've done this many ways and get very similar results (.128 looks a lot like .132 to me). Normally, undersized samples just mean you don't get statistical significance (until you get very very small samples). Even with logits adapted for small samples, you still get the same results. I'd just report them all and say the results are robust.
Comment
Ole Theisen

Join Date: Mar 2015

Posts: 4
#3

21 Mar 2019, 13:35

At # Phil Bromley:

1. Thanks for pointing out that I failed to explain what EIA means. EIA is an abbreviation for Environmental Impact Assessment which is required in any power plant application in Norway (it is more or less standardized within the whole EU+-area). I guess I fell into the jargon trap.

2. Thanks also for asking about its operationalization and suggesting testing it as continuous. The EIA-variable originally takes values from 0 (no or insignificant impact of the plant being built compared to the present situation ) to 4 (a very big negative impact of the plant being built), but with half-scores included, so that it goes from 0 to 0.5 to 1 to 1.5 and ends at 4. We did test it as it stood, but that led to sparse cells and an extremely high value for both the point estimate and standard error of the constant term, while the variable of interest behaved quite as expected. That operationalization also led to larger differences between the estimators, than in the models shown above, possibly indicating instability. We therefore settled on recoding it into a dummy taking the value 0 for values ranging from 0 (no or insignificant changes) up until and including 2 (medium negative impact) and 2.5 (medium to large negative impact) and higher (4 a very big negative impact) into 1. This is the operationalization used in the models shown above. We also tested the model with the cutoff between 1.5 and 2 with similar, but weaker results (still significant). I think we have to settle with the dichotomization approach instead of coding each value as a dummy, since we quickly get into a sparse cells problem. We will rather have to conduct robustness checks moving the cutoff up and down, as we already have tested a bit.

3. Thanks for suggesting reading through the FAQs. I did, but I guess I was a bit blind to my question. I will try to chisel it up into edible pieces.

4. I am relieved to hear that you don't see any other fundamental challenges than those I was already aware of.

Best,
Ole Magnus Theisen
Comment

Announcement

Strategy for analyzing 90 observations with a dichotomous dependent variable

Comment

Comment