Model selection for binary time-invariant dependent variable in panel data/tackling heteroschedasticity at multi-levels

Zubai Khan

Join Date: Sep 2019

Posts: 16
#1

Model selection for binary time-invariant dependent variable in panel data/tackling heteroschedasticity at multi-levels

20 May 2020, 12:18

I am using time-invariant binary DV having continuous IVs. the DV is whether a financial services company (may be a bank or any other) offer venture capital (VC) (dummy 1) or not (0). Once a company offers VC, it does so during the entire time of the data for each company and vice versa (i.e. time invariant). All IVs are continuous and most of them are time-variant. I have above 100,000 observations and more than 4500 companies operating in 60+ countries (the actual observations get reduce to around 35000 because not all variables have data for all groupd and t). IVs include company characteristics (such as return, size, debt) and country level characteristics (i.e. GDP, R&D, Financial development and so forth). I tried logistic reg but the Pseudo R2 is too low.

logit vc rdexpend lnintan lnass lnextdebt lnroa lndeps lnrgdpna lnemp irr lnxr lntax ef ,vce(cluster p
> id)

Iteration 0: log pseudolikelihood = -9214.4682
Iteration 1: log pseudolikelihood = -8856.4005
Iteration 2: log pseudolikelihood = -8790.3987
Iteration 3: log pseudolikelihood = -8790.3133
Iteration 4: log pseudolikelihood = -8790.3133

Logistic regression Number of obs = 32,976
Wald chi2(12) = 107.77
Prob > chi2 = 0.0000
Log pseudolikelihood = -8790.3133 Pseudo R2 = 0.0460

(Std. Err. adjusted for 4,497 clusters in pid)

Robust
vc Coef. Std. Err. z P>z [95% Conf. Interval]

rdexpend .1669236 .0852872 1.96 0.050 -.0002362 .3340833
lnintan .2751894 .0486187 5.66 0.000 .1798985 .3704803
lnass -.0208578 .026323 -0.79 0.428 -.0724499 .0307343
lnextdebt -.3319592 .0498951 -6.65 0.000 -.4297518 -.2341666
lnroa .0638105 .5441491 0.12 0.907 -1.002702 1.130323
lndeps .2949721 .2057622 1.43 0.152 -.1083144 .6982586
lnrgdpna -.1490657 .063379 -2.35 0.019 -.2732862 -.0248451
lnemp -.7388525 .474698 -1.56 0.120 -1.669244 .1915385
irr -1.309219 2.035959 -0.64 0.520 -5.299625 2.681188
lnxr -.102139 .0455526 -2.24 0.025 -.1914204 -.0128576
lntax .2445949 .3868812 0.63 0.527 -.5136783 1.002868
ef -.0248256 .0088889 -2.79 0.005 -.0422476 -.0074036
_cons 1.184486 2.692227 0.44 0.660 -4.092182 6.461154

When I run simple pooled OLS, the R-squared is well below 0.1 while including variety of variables. I am not sure if I should use xtlogit re or logit. I feel that the data requires logit because the DV is time invariant. When I run "between" regression, obviously, variables appear to be significant as expected because the DV is time-demeaned. I read in a book that if you are sure that there is no individual effects in your data (or normal OLS assumptions are not voilated) then use logit other xtlogit. When I run xtlogit, the results look strange (they are not sig either unexpectedly).
So would anybody give comments
1. which model would be appropriate? logit or xtlogit ,re?
2. should I use cluster pid (which is the company code) or cluster cid (country code) to tackle hetero or would it be useful to use only vce (robust) when in fact i have already taken size of companies and size of countries ?. (Importantly, cid works for logit but does not work for xtreg ,re)
My code is

logit vc rdexpend lnintan lnass lnextdebt lnroa lndeps lnrgdpna lnemp irr lnxr lntax ef ,vce(cluster pid)

The statafile is attached.
I apologize if I have not clarified things enough.
Attached Files

abc.dta (19.14 MB, 1 view)
Tags: binary DV, logit, model selectio, multi-level hetero, panel data
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

21 May 2020, 14:25

You did not get a quick answer. You will increase your chances of useful answer by following the FAQ on asking questions-provide Stata code with code delimiters, readable Stata output, and sample data using dataex. Please do not post files – many of us will not open files from people we don't know.

Pseudo-R squared is not a criterion to decide whether you need a logit estimator. Given that pooled OLS also gives low R squared strongly suggest that your variables don't explain much of the variance in the dependent variable.

It's not clear to me what you mean with the between regression sense normal panel estimators don't.

Whether logit or xtlogit is appropriate depends a lot on your problem. It sounds like you have panel data so I would lean to a panel estimator. However, if the dependent variable does not vary within panels then you probably should go to a between estimator. You don't get the nice econometric properties of the within estimator, but the within estimator doesn't work on the between variations in the dependent variable.
Comment
Zubai Khan

Join Date: Sep 2019

Posts: 16
#3

21 May 2020, 19:54

Thank you for your advice. Everytime i post, i go to FAQ but somehow, i have have gotten grasp of the things expected on the forum like code delimiters, readable Stata output, and sample data using dataex. Your comments are useful. I am reading relevant chapters of the econometric books as well, once i get more undertstanding, I will respond back with questions and might be self-answers. I will consider your opinion. The point is that the between estimation also give low r sqaured and its estimates are close to logit.
Comment

Announcement

Model selection for binary time-invariant dependent variable in panel data/tackling heteroschedasticity at multi-levels

Comment

Comment