Heckman Selection Model Error - Urgently need help for my dissertation in Venture Capital/ Entrepreneurship

Max Schmitz

Join Date: Aug 2015

Posts: 13
#1

Heckman Selection Model Error - Urgently need help for my dissertation in Venture Capital/ Entrepreneurship

01 Aug 2015, 02:02

Hey guys,

I am currently writing my dissertation in the field of Venture Capital and Entrepreneurship and got stuck with a robustness check (Heckman approach) and an error in stata that prevents me from proceeding. I read a lot of forum entries but none of them really addressed my problem:

1. Background information:
There has been research on investor types such as Venture Capital Firms (VC), Corporate Venture Capital Firms (CVC) and Business Angels (BA) and that those investors add value to startups other than there financial investment e.g. access to network, professionalisation, expertise, experience etc;

Since start ups go through a lifecycle, divided in 3 stages (Seed, Expansion and Later Stage), I was wondering whether certain investor types (VC, CVC or BA) is better at certain stages.

Therefore my three hypothesis I like to test:
Hypothesis 1. A significant investment of a Business Angel in the seeding stage increases the likelihood of survival for start-ups.

Hypothesis 2. A significant investment of a Corporate Venture Capital firm in the second stage increases the likelihood of survival for start-ups.

Hypothesis 3. A significant investment of a Venture Capital firm in the third stage increase the likelihood of survival for start-ups.

Start-Up survival can be measured by looking at the binary variable: success (company status was active, acquired, went IPO) and failure (company status was defunct, bankruptcy)

Other variables are defined in the attached excel data sets that are related to start-up characteristics (size, age, industry) and investor characteristics (size, age, type)

2. Methodology employed:

Probit regression is used, since the dependent variable will be success or failure (1 or 0);

Independent variables: investor type (dummies for BA, VC and CVC), venture characteristics (size, age, industry), investor characteristics (size, age, type) and time (date of investment)

3. Selection bias:
VC, BA or CVC investor types can increase the likelihood of survival due to value adding activities or it could be that VCs just screen the market better and invest only in more successful startups (endogeneity, selection bias)

How I want to employ Heckman selection approach:
Outcome equation: dependent variable = success of startup (binary) and independent variables (BA, VC or CVC dummies; and control variables)

Selection equation: dependent variable = BA dummy (binary) and independent variables (control such as venture characteristics and investor characteristics)

4. Stata command used and error type:
Stata command: heckman performance_outstanding_exit BA_dummy time2_dummy1 time2_dummy2 time2_dummy3 time2_dummy4 time2_dummy5 time2_dummy6 time2_dummy7 time2_dummy8 lninvage lninvsize lncompage, select(BA_dummy = VC_dummy CVC_dummy lninvage lninvsize lncompage time2_dummy1 time2_dummy2 time2_dummy3 time2_dummy4 time2_dummy5 time2_dummy6 time2_dummy7 time2_dummy8) twostep

Error: note: BA_dummy omitted because of collinearity

This error also occurs when I use VC dummy or CVC dummy instead of BA_dummy; If I also run a probit regression with all those variables of the outcome equation, including BA-dummy, there is no warning of collinearity

My guess is that the collinearity is there because stata does something with the outcome and selection equation. But my knowledge about probit, Heckman etc is quite limited. In the end of my BSc my statistic knowledge is quite limited and I hope to get some help from you guys who have expertise in the field.

Thank you very much in advance,

Max
Tags: None
Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#2

01 Aug 2015, 07:17

in the heckman model you cannot have the dependent variable of the selection equation as a explanatory variable in the outcome equation as there won't be any variation variation. Recall that in the heckman model performance is only observed if BA_dummy is observed. Therefore the equation for performance only considers observation where BA_dummy = 1. Bottom line you cannot identify the effect of BA with this model.
Comment
Max Schmitz

Join Date: Aug 2015

Posts: 13
#3

01 Aug 2015, 13:54

Hey Christophe,

first of all thanks for your fast reply . Desperately tried to find a way today to test for selection bias or endogeneity. Another question to you, if heckman doesn't work, do you have another suggestion which models or tests I could use to test for selection bias or endgoeneity. Especially in my model, in which performance could be enhanced due to investor type (BA, VC or CVC) or just because one of the types picks better startups and therefore performance is better of ventures funded by a VC for example.

Since this model is the heart of my thesis, it would be more than helpful if you have any additional advice for me.

Best,

Max
Comment
Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#4

02 Aug 2015, 02:33

First of all the question you are adressing seems quite complex and I am not sure I have understood every element of the setup. You write that there are 3 stages in the life-cycle. Where does that relate with the definition of your dependent variable performance_outstanding_exit? Why do you include VC_dummy and CVC_dummy in the BA_dummy equation (and not in the performance equation)? Are VC or CVC an alternative to BA, in other words are they mutually exclusive, in which case one could interpret them as different treatments?

Now if we abstract form VC and CVC and want to test endogeneity of BA with respect to performance, I think that a biprobit would be more appropriate. But to identify the effect of BA on performance and test for its endogeneity you will need an instrument, i.e. conditional on the other (exogenous) variables of the model a variable affecting BA but not performance. Actually an instrument is also required in the heckman model in order to identify the model without relying on the normality assumption. You can test for endogeneity of BA by using a control function approach that is constructing generalized residuals from the BA equation, running a probit for performance on BA, the exogenous variables and the residuals. If the estimated parameter for the residuals is statistically different from zero, then it will evidence that BA is endogenous. If you estimate the full model then a non-zero correlation between the error terms of the two equation would also be evidence of endogeneity. This approach is described in Wooldridge, J. Econometric Analysis of Cross Section and Panel Data, 2nd Edition, 2010. Look at the chapter on binary outcomes.

If you want to see if there are selection on observables you could compare the different covariates measured before treatment for those who have chosen BA compared to VC and CVC and see if there are any significant differences. You could read Wooldridge's chapter on treatment effects.
Comment
Max Schmitz

Join Date: Aug 2015

Posts: 13
#5

02 Aug 2015, 10:25

Thanks again Christophe, very helpful . Still I have some questions since my statistic knowledge is very limited (only the basics).

Regarding your first point, here a few more points on the setup:

My data contains all PE deals from the Thomson One Banker databank from 1995 - 2012. I have basically three sets of data: the first set includes all deals in the seed stage (first stage of a new venture getting funding), second set includes all deals in the expansion stage (second stage) and the third one all deals in the later stage (third stage of a startup getting funding).

Then I included only deals that were funded by a BA, VC, CVC or two or three of them at the same time (so they are not mutually exclusive). The heckman screenshot I attached is just a trial and error, I had VC and CVC dummy in the performance equation before. I just tried to get rid of the omitted error.

Then I try to test my hypothesis whether BA funding in the first stage increases the likelihood of survival compared to VC or CVC (thats why I included the three investor type dummies); I did the same with two other hypothesis, saying that CVC funding in the second stage increases the likelihood compared to other kind of funding, and VC funding in the third stage increases the likelihood compared to the other two.

In case the BA dummy is significant in the first probit regression and the VC and CVC is not, that does not mean that BA is better than the other two types, since it could be due to selection bias or endogeneity. Therefore I wanted to include some robustnes checks, as other papers did before.

1. Model: first stage venture data

IPO of venture (1 or 0) = BA dummy + VC dummy + CVC dummy + Time dummies for each year ('95-'12) + ln(size of Investor) + ln(age of investor) + ln(age of venture) + ln(total funding to date) + total number of investors

--> In order to find evidence for my first hypothesis the BA dummy should show significance and the other two none

2. Model: Second stage venture data

IPO of venture (1 or 0) = BA dummy + VC dummy + CVC dummy + Time dummies for each year ('95-'12) + ln(size of Investor) + ln(age of investor) + ln(age of venture) + ln(total funding to date) + total number of investors

--> In order to find evidence for my first hypothesis the CVC dummy should show significance and the other two none

3. Model: Third stage venture data

IPO of venture (1 or 0) = BA dummy + VC dummy + CVC dummy + Time dummies for each year ('95-'12) + ln(size of Investor) + ln(age of investor) + ln(age of venture) + ln(total funding to date) + total number of investors

--> In order to find evidence for my first hypothesis the VC dummy should show significance and the other two none

Regarding the second point you addressed:

A few questions from my side again (sorry, but I am very thankful for your advices, I just need some additional information on specifics to really implement your suggestions):

1. Instrumental variable:
I created an instrument variable (local availability of BA funding compared to other deals), it is correlated to the endogenous variable BA dummy (whether or not getting funding by BA) but it is uncorrelated to performance of a venture).

My question now is, should I just run the profit excluding the BA dummy and including the local availabilty variable for BA? What about the other two dummies VC and CVC? And if the instrument variable is significant what does it tell me about endogeneity about the BA dummy?

2. Control function approach (in case the instrumental variable is not strong enough):
Could you be more specific on the steps I need to do one by one? I couldnt really follow which two equations I need to set up and need to compare and how to test the significance of the residuals.

3. Selection on observables:
Do you mean selection bias? I read the part in the pdf of Wooldridge but it is not 100% clear how I can test this for my model in specific. Would be super nice if you could be more specific on this as well.

Again thank you very much and I happy for every advice you have for me. None of the students of my study group has worked with probit regressions yet or done any kind of robustness checks regarding endogeneity and selection bias.

Max
Comment
Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#6

04 Aug 2015, 08:15

1. You should first run a regression of BA on your instrument and the other exogenous variables to see if your instrument is strongly correlated with BA. If it is and you are convinced
(or your referee) that it is uncorrelated with performance then you can use it as an instrument with an instrumental variable. In the linear case you would use 2SLS.
As far as I understand the three variables BA, VC and CVC are potentially endogenous. So you need at least 3 instruments. The significance of the variable does not tell you anything abut endogeneity.
You postulate that BA is endogenous and therefore need an instrument both to test it and to obtain consistent estimates.

2. If the model was completly linear you could estimate this model by IV (2SLS).
In the linear case the control function is actually equivalent to the IV estimator, but it offers the possibility to test the endogeneity.

y1 = a1*y2 + xb1 + e1 (y2 is the potentially endogenous variable)
y2 = z*b21 + xb22 + e2 (z is your instrument, x other exogenous variables)

you perform the test by following these two steps
1.regress y2 on z and x, compute the residuals u2
2. regress y1 on y2, x and u2.

If the estimated parameter for u2 is statistically different from 0, then it is evidence that y2 is endogenous.
Now it is a non-linear model, that's why I suggested the solution proposed in Wooldridge, which is very similar to the procedure sketched above.

Your model at the seed stage with BA, VC and CVC potentially endogenous

P = 1(a1 * BA + a2*VC + a3*CVC + X*b1 + e_p > 0 )
BA = 1( Z*b2 + e_ba > 0)
VC = 1( Z*b3 + e_vc > 0)
CVC = 1( Z*b4 + e_cvc > 0)

Z =(W,X)

W = the set of instruments for BA, VC and CVC (at least three instruments, one for each variable)
X = other exogenous variables.

You could estimate the equation for each potentially endogenous variable separately, construct the residuals and run the first equation with those
residuals as additional explanatory variables.

Unless I have misunderstood the set-up I do not see why you choose the heckman model for your problem.
Then I think you should be more thorough in explaining why you choose this model. The heckman model solves the problem that the dependent variable is
not observed for part of the sample and that this selection process is not random. But I don't think that is the case here.
You do not have a case, where you don't observe the survival of a firm if they haven't a BA.
That is survival can also be observed if BA = 0.

IV-methods try to handle selection on UNobservables. That is there are common unobserved factors which affect both the outcome and the treatment variable.
This result in a violation of the linear regression assumptions wich invalidates any causal interpretation of an OLS for example.
This is what you postulate in your project. Maybe you should be more explicit about what you mean by endogeneity and selection in your project.
Where does the endogeneity comes from? What do you exactly mean by selection? Selection can have different meaning, but can result in different models depending on the selection
mechanism. It seems that you interpret endogeneity and selection to be the same thing. If you haven't done it try to read the chapter on IVs form Mostly Harmless Econometrics by Angrist and Pischke

3. With selection on observables we think of a model where by conditionning on a set a variable we can make postulate both the outcome and the treatment (endogenous) variable are uncorrelated. If there is som selection on observables characteristics the different group of firms should show difference on these observables.
Therefore to see if there is selection on observables, you can for example see how the means of the observed variables differ across the different groups. But that does not tell you if there is no selection on unobservables.
But I don't think you should worry too much about that.
Comment
Max Schmitz

Join Date: Aug 2015

Posts: 13
#7

05 Aug 2015, 03:41

Christophe,

thanks again for your detailed instructions. I will definitely try it out right now. I also sent you a pm via this forum. Don't know whether you received it or not. I just said thank you again.

Since my deadline is very soon and in case I am allowed to ask you follow up questions that might pop up after I did all your steps, email might be a faster way to communicate. So here is mine: [email protected] .

Honestly, I am already happy with the feedback you gave me and don't want to further spam you with questions, but I think it would be sad if minor problems pop up which prevents me from implementing the whole procedure we just discussed now. It is the first time for me that I am trying a more sophisticated statistical approach and it is fun and really interesting.

In case it is okay to further keep in touch, would be nice if you can send me an email, that I can reply to you in case I need to.

Anyways, thank you so much.

Best,

Max
Comment
Max Schmitz

Join Date: Aug 2015

Posts: 13
#8

11 Aug 2015, 03:17

Hey Christophe,

I am almost there. Just followed your instructions and have to remaining questions:

1) How to generate residuals in STATA after a probit? I googled certain commands, but none of them seem to work (e.g. glm, predict etc.) I attached a screenshot of my error by trying to compute the residuals; I think the "predict residual" command is wrong, since I can write any name instead of residual with the same result

2) Whether the approach is correct in STATA. I attached screenshots of your suggested regressions here as well. The final step you said is to include the residual variables in the outcome equation, should I include all three at the same time? And replace then the dummies I guess?

Background information: In STATA the variable "local_avail..." is always the instrument I generated, "performance_outstanding" is my dependent binary one

As a last question, you said if my model is linear, I thought I am using a probit instead of an OLS regression since it is not linear what I am doing and the dependent is binary? Again I am a bit confused.

Again, thank you so much for helping me out!

best,

Max

Attached Files

Screenshots-Max-S.docx (608.9 KB, 1 view)
Comment
Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#9

11 Aug 2015, 05:06

I will first advise you to read chapter 15 of Wooldridge (cited in a previous post) and particularly (and carefully) section 15.7.3.

1) +2) You can run a regression like this (here is an example with only one endogenous variable)

Code:

probit BA_dummy instrument exog_variables predict index , xb gen double resid = BA_dummy - normal(index) probit performance BA_dummy resid exog_variables

You can test the exogeneity of BA_dummy by testing whether the parameter for resid is statistically different form zero. If you can reject this hypothesis you can estimate the model by bivariate probit. Now by reading Wooldridge again, I can see that with this procedure you can test the exogeneity of BA_dummy but will not get consistent estimates. So you have to estimate the model with a bivariate probit (biprobit).

Code:

biprobit (performance = BA_dummy exog_variables) (BA_dummy=instrument exog_variables)

Now you have 3 potentially endogenous variables. To test the exogeneity of these variables get residuals for each endogenous variables as explained before and run a final regression for performance with the endogenous variables AND the computed residuals and test whether the parameters for the residuals are statistically different from zero. If more than one variable is endogenous, it will difficult to get consistent estimates since the full model becomes quite complicated (tri- or quadrivariate probit). One solution could be to ignore the non-linearity of the model and estimate the model by 2SLS.

Regarding your last question I took the linear case as an analogy on how you test the exogeneity of a variable.
Comment
Max Schmitz

Join Date: Aug 2015

Posts: 13
#10

11 Aug 2015, 05:46

Thanks again. Will read it again more thoroughly. Just scanned it quickly after your first post.

I did the biprobit regression just as a first trial. How can I see whether there is endogeneity or not, never worked with output of a biprobit. Attached are the screenshots of my biprobit.

In addition, I asked a former Professor of mine regarding testing my 3 hypotheses. He mentioned that it is not the case if the output of the first model (ventures in stage 1) shows a significant coefficient for the BA dummy and two insignificant coefficients for the other two dummies, that H1 is true, which said that BA is best at Stage 1 of a venture compared to other types of investors.

By saying this I thought of changing my model a bit and making it more simple, only including the BA dummy in Model 1 and not all three investor dummies. If the coefficient is significant and positive, I can say that BA has a positive impact on Startup survival (and leave out the relative part to other investors).

Same for the other two Hypotheses, by just including the CVC dummy in the second model, no BA or VC dummy, and the same for H3 by only including the VC dummy.

Then I can also test for endogeneity more "easily" by following your residual approach without the complexity you just mentioned in your last post. Please correct me if I am wrong. Happy to hear your thoughts.

Best

Max
Attached Files

Screenshots-Max-S2.docx (314.5 KB, 1 view)
Comment
Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#11

12 Aug 2015, 01:52

If you estimate the model by biprobit and want to test the exogeneity of BA:dummy, then you can perform a hausman test where you compare the estimates of the performance equation of the biprobit and the estimates of a probit for the same equation (see help hausman). The Hausman test is a formal statistical test, but you will be able to see evidence of endogeneity just by comparing the two sets of coefficients Otherwise the residual approach should also work. It is a very good thing if you can simplify your model.
Comment
Max Schmitz

Join Date: Aug 2015

Posts: 13
#12

13 Aug 2015, 01:08

Christophe,

I am stuck again. Tried to understand how to do a hausman test in STATA after running the biprobit, but it didnt work out. I also read that running a biprobit and by just looking at the output it is possible to say something about endogeneity. Unfortunately, I couldnt find any forum entries that elaborate on interpreting a biprobit stata output and I have no clue. Attached is the biprobit output.

It would be super cool if you could help me out again with two things:

1) how to interpret a biprobit output, maybe by referring to my output
2) explain how to use hausman in STATA after running the biprobit

e
Thank you very much again. You will definitely have a special spot in my acknowledgment section !

Attached Files
Comment
Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#13

14 Aug 2015, 00:38

1) You have to compute the marginal effects of the model to be able to interpret your results (like in a plain probit actually). Then you can quantify how the probability of succes is affected of VC investment. Stata can do that for your relatively easily. See the command margins in biprobit postestimation.
Normally the sign of the coefficient of VC_dummy should tell you in which direction the probability of success of the first equation will go. For the other coefficients it is more complicated as it is a simultaneous equations and the effect will depend on what happens in both equations and the correlation coefficient of the error terms, rho, which is significant by the way). The statistically significant correlation coefficient will tell you that some common unobserved factors affect both equations, which tells you that there is some selection. Conditional on the observable factors those who are more likely to have vc are also more likely to succeed. Does that make sense?

2) In biprobit postestimation see also the hausman entry to see how to perform the test in Stata. You wil see som examples for how to test the model. The first example with the heckman model is closest to what you need. note that you only want to test the first equation of the biprobit model. A sketch of the code would be

Code:

probit performance vc_dummy x estimates store probit biprobit (performance = vc_dummy....) (vc_dummy = ....) hausman probit . , equation(1:1)

The principle of the hausman test is relatively simple and intuitive. You have one estimator which is consistent (the biprobit) whether vc_dummy is exogenous or endogenous whereas the other (the probit) is consistent if vc_dummy is exogenous and not otherwise . If vc_dummy is exogenous then the two sets of estimators should be close to each other (but not necessarily equal) because both are consistent. This is what the hausman test does, testing whether the difference between the two sets of estimates are close enough in a statistical sense. Note that you try to reject the null hypothesis of exogeneity of vc_dummy. Hope this helps.
Comment
Max Schmitz

Join Date: Aug 2015

Posts: 13
#14

14 Aug 2015, 01:59

Thanks a lot Christophe. No clue how someone can be such a great expert in so specific tests. Everybody I asked had no idea since they weren't into the topic. So thanks again. Those tests which I can include now really spiced up my analysis section!

I did now exactly what you suggested to do and was wondering about some general points now after reflecting of what we did the last weeks:

(1)

I got the point that probit is needed if the dependent is a binary variable (1 or 0). Therefore I run the probit regression (performance =...) with performance outstanding (yes or no), and the VC dummy (1 or 0) as main independent var. The output shows that the VC dummy in the probit is highly insignificant (.570 p value). So I fail to reject that VC has no impact on venture performance.

However, when I run the biprobit (performance = ...) (VC dummy =....), the first equation in parentheses which is exactly the probit regression gives me a significant VC dummy now (.0014 p value). The reason why we said we need a birprobit was only for testing the VC dummy on endogeneity, didnt we? If not, which VC dummy coefficient and the respective p value should I take now to answer my Hypothesis whether the presence of a VC increases venture performance? I guess the probit one. I am just wondering why the biprobit one is different then.

(2)

Since it is the first time I am running a Hausman test, I was wondering whether you can confirm my interpretation of the output attached (Prob>chi2 = 1.0000). Does this mean that we fail to reject Ho (insignificant) therefore we cannot say whether there is an unobsovered factor, so no proof for endogeneity of the VC dummy but also no proof of exogeneity? Then I would have a problem since you said that the biprobit shows a significant rho, meaning there is endogeneity.

(3)

You said the birpobit above has a significant rho, but rho doesn't have a p value. How do you know that rho is significant or is prob > chi2 = 0.0362 the p value at the bottom of my output?

(4)

One last thing which is very specific to my analysis section in my thesis. Since we started with our control function approach (regressing the residuals of the selection equation of the dummies as dependent variable against the dependent variable performance of the outcome equation) I was wondering whether I can still include this as an additional measure of robustness or whether you would suggest only including the biprobit plus Hausman test in my robustness section?

I would structure the part then as the following:

- describe the instrumental variable approach, including validating that it is a valid instrument (dependent on endogenous var and independent of dependent var)
- explain why I use a biprobit model (endogenous variable is binary, so a normal probit and regressing residuals would give us inconsistent results (2SLS)
- explain why Hausman on top of biprobit is needed (this is my question below)

My question here is, what does the birpobit test in the first place or do we only use it in order to perform the Hausman test? You said my birpobit was significant meaning there is an unobserved factor, so actually we already proofed that there might be endogeneity of the VC dummy, didnt we? So why then Hausman on top?

A lot of text but I think for you it will be possible to answer with a few words. I just wanted to make sure that I now fully understood the approach and can start writing .

Thanks again!
Comment
Max Schmitz

Join Date: Aug 2015

Posts: 13
#15

17 Aug 2015, 01:39

Hey Christophe,

did you have any chance yet to have a look at my last post? sorry for bothering again, just wrapping up my robustness section and would be nice to have some final feedback from you .

Best,

Max
Comment

Announcement

Heckman Selection Model Error - Urgently need help for my dissertation in Venture Capital/ Entrepreneurship

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment