xtset xtprobit

Matthias Enichlmayr

Join Date: Jun 2014

Posts: 31
#1

xtset xtprobit

29 Jul 2014, 07:51

Dear all,

I have a simple question for you.
I have a data set on the german automobile market (2005 - 2011) and I want to regress the number of accidents by the policyholders on the type of insurance coverage they have. Besides the type_of_insurance_coverage I also include control variables like sex, type_of_car engine_power etc.
In a static model I used a bivariate probit model.
Now I want to measure the same effect using panel data.
After using the command "xtset id_of_policyholder year" I used the command "xtprobit accidents insurance_coverage sex engine_power age_of_the_car ... , re
My question: is this approach econometrically correct? Or do I have to include xtset somewhere in the regression?
Another point is if "andom effects" is correct or if I should use a fixed effects approach?

Appreciate your help

Matthias
Tags: None
Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#2

29 Jul 2014, 08:54

Hi Matthias, a couple of questions and comments from me.

First you mention that your dependent variable is number of accidents, yet you're using binary probit models. number of accidents seems as a count variable to me not binary, how is this possible?

Now with respect to your first question, xtset tells Stata what the level 1 and level 2 variables are for all xt related commands. You only need to use it when you want to define what those two variables are. If they don't change throughout your analysis, you then only need to call it once.

Your second question is impossible to answer without knowing the data. Having said that, it is something that is testable but not when using xtprobit since it doesn't do a fixed effects estimation. You would have to use xtlogit for that, run fixed effects and random effects estimates and then do a Hausman specification test using the hausman command. No matter what the result of the Hausman test is, I always find it useful to do both fixed and random effects estimations and compare the coefficients to see how much of an bias the regressors may cause on the random effects estimation, assuming that fixed effects is appropriate that is.

I hope this helps,

Alfonso Sanchez-Penalver
Comment
Matthias Enichlmayr

Join Date: Jun 2014

Posts: 31
#3

29 Jul 2014, 10:09

Hi Alfonso,

thanks for your instant reply.
number_of_accidents is in my regression a dummy variable (1= at least one accident, 0 = no accident)

using xtlogit is a good point, since it allows the fixed effects option.
I still don't understand your second argument: "xtset id_of_policyholder year" should tell Stata that I am using panel data, and that year is the time variable. That's how I interpret the help option.
So all I have to do is adjusting the data first, then I use the command xtset id_of_policyholder year to tell Stata that I am using panel data, then I run the xtlogit regression pretty much the same way as I did with my bivariate probit

xtlogit number_of_accidents type_of_coverage sex engine power garage age_of_the_policyholder age_of_the_car kilometers_travelled_per_year ...(further control variables), re

(the difference is in my opinion, that I take one of the two dependant variables of the bivariate probit and add it as an independant variable in the xtlogit model [e.g. type_of_coverage]
I am running it twice (once with the option , re and once with the option, fe
Afterwards, I am using the Hausman test to figure out whether the fixed effects specification or the random effects specification better suits my regression, right?
Concerning the robustness of my model, I would further test for heteroskedasticity (e.g. Breusch-Pagan test) and serial correlation (Breusch-Godfrey test)
Do you agree?

Thanks for your patience with my patchy empirical knowledge

Matthias
Comment
Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#4

29 Jul 2014, 10:52

A problem arises if a bivariate model is the true model when including one of the response variables as an explanatory variable in the other equation: that of endogeneity. Notice that if the errors of the two equations are correlated in the bivariate model, then the response variable of one of the equations is also correlated with the errors of the other equation, thus the source for the endogeneity problem. I'm assuming (guessing) that you're doing this because there is no bivariate xtlogit command? Because in the other discussion I metioned that the user command cmp (SSC) will do a bivariate probit with random effects, and how to account for fixed effects by including dummy variables. You can use that to see if the errors are correlated across equations and if they are that is a clear indication that you should not be including a response variable from one equation in the other one.

Having said that, you can always run the two equations separately without including the response variables as explanatory variables. Run cmp on each of them with random effects and cmp for the bivariate probit with random effects. Then you can compare coefficients (or marginal effects whatever you prefer) of both methods in both equations, as well as determine if there is correlation across errors of both equations at any level (the grouping level or the year level).

If you want to do fixed effects with the bivariate probit, you can still use cmp with the group dummies as I explained in the other discussion. So you can estimate the individual probit equations with the group dummies, and the bivariate probit with the group dummies in each equation, all with cmp, and compare. I don't know how to perform the Hausman test then, but it gives you a direct comparison of the common coefficients in both estimations, random and fixed effects.

After reading both discussions I believe this is the best way forward for you. It may have the learning curve of how to use cmp, but once you do it allows you to estimate single equations and bivariate models, as well as do random effects, fixed effects, or pooled (population averaged) models (i.e. an estimation of just a constant in the equation, so neither fixed nor random effects).

With respect to BP test or BG test... with binary variables it is not so straight forward as with continuous response variables, so I wouldn't try to do too much for now. Another thing to think about is that with four years of data (you mentioned 2008 - 2011 in the other discussion), there is not much dimension to the time series to capture serial correlation, and if the heteroskedasticity in the data is because of differences at the insuree level, you would have captured it already with the random or fixed effects.

Last edited by Alfonso Sánchez-Peñalver; 29 Jul 2014, 10:54.

Alfonso Sanchez-Penalver
Comment
Matthias Enichlmayr

Join Date: Jun 2014

Posts: 31
#5

29 Jul 2014, 13:44

Hi Alfonso,

hhanks for the additional information:

I think I understood how to implement a cmp biprobit:
According to "cmp help" I need the following command: (1) cmp (accidents_per_year=..explanatory variables...) (type_of_coverage =...explanatory variables...), ind($cmp_oprobit $cmp_oprobit) nolr tech(dfp)
Below you see my biprobit estimation:
As you can see it is quite extensive!

(2) biprobit accidents_per_year type_of_coverage_dummy km_travelled_per_year_tsd age_of_insuree sex_of_insuree initial_car_value age_car garage_dummy no_claims_bonus_d1 no_claims_bonus_d5 no_claims_bonus_d7 no_claims_bonus_d10 no_claims_bonus_d20 no_claims_bonus_d30 no_claims_bonus_d40 no_claims_bonus_d50 no_claims_bonus_d56 no_claims_bonus_d57 no_claims_bonus_d58 no_claims_bonus_d59 no_claims_bonus_d61 no_claims_bonus_d62 no_claims_bonus_d63 no_claims_bonus_d64 no_claims_bonus_d65 no_claims_bonus_d66 no_claims_bonus_d67 no_claims_bonus_d68 no_claims_bonus_d69 no_claims_bonus_d71 no_claims_bonus_d72 no_claims_bonus_d73 no_claims_bonus_d74 no_claims_bonus_d75 no_claims_bonus_d76 no_claims_bonus_d77 no_claims_bonus_d78 no_claims_bonus_d79 no_claims_bonus_d81 no_claims_bonus_d82 no_claims_bonus_d83 no_claims_bonus_d84 no_claims_bonus_d85 no_claims_bonus_d86 no_claims_bonus_d87 no_claims_bonus_d88 no_claims_bonus_d89 type_of_car_d0 type_of_car_d10 type_of_car_d11 type_of_car_d12 type_of_car_d13 type_of_car_d14 type_of_car_d15 type_of_car_d16 type_of_car_d17 type_of_car_d18 type_of_car_d19 type_of_car_d20 type_of_car_d21 type_of_car_d22 type_of_car_d23 type_of_car_d24 type_of_car_d25 type_of_car_d26 type_of_car_d27 type_of_car_d28 type_of_car_d29 type_of_car_d30 type_of_car_d31 type_of_car_d32 type_of_car_d33 type_of_car_d34 regional_class_d1 regional_class_d2 regional_class_d3 regional_class_d4 regional_class_d5 regional_class_d6 regional_class_d7 regional_class_d8 regional_class_d9, robust nolog

Accidents_per_year and type_of_coverage are my two dependant variables
My independant variables are:
km_travelled_per_year_tsd
age_of_insuree
sex_of_insuree
initial_car_value
age_car
garage_dummy
Additionally I have three ordinal variables: no_claims_bonus, type_of_car and regional_class
I included (n-1) dummies for these three variables.

When I am including that many variables in the cmp command (1), it does not work. It only works when I leave for instance only two indepedant variables
e.g. (3) cmp (accidents_per_year = km_travelled_per_year_tsd age_of_insuree)(type_of_coverage = km_travelled_per_year_tsd age_of_insuree), ind($cmp_oprobit $cmp_oprobit)
(3) works.

If I understood you correctly, I have to include the dummies for the different polidyholders (in my case 140 dummy variables)
Do I still have to include all the explanatory variables from (2) or can I drop some of the independant variables from equation (2)?

I just fear that it does not work with that many variables like in the normal bivariate probit model......

Do you have any idea how to proceed?
Comment

Alfonso Sánchez-Peñalver

Join Date: Mar 2014
Posts: 432

30 Jul 2014, 08:43

What do you mean that cmp does not work? Please be specific.

You are correct to fear that it may not work, but if the data is right it should converge (eventually) and give you results.

As for your syntax of cmp, it's not fine. You are indicating that the two equations are ordered probits, and they are not, they are simple probits. so you need ind($cmp_probit $cmp_probit). This may be why it's not working. I suggest you try without tech(dfp) first. If it doesn't converge, you may want to add diff option first, and see how it goes. And if it still doesn't converge then you start playing with the optimization techniques with the tech() option.

For fixed effects yes, simply add the dummy variables for the insurees to the set of explanatory variables. You can estimate them with both biprobit and cmp, and compare.

For the random effects and following your example:

Code:

cmp (accidents_per_year = explanatory variables || insuree_categorical_variable:) (type_of_coverage = explanatory variables || insuree_categorical_variable:), ind($cmp_probit $cmp_probit)

Remember that it is the categorical variable not the dummies for the insuree that go in the definition of each equation. This estimation may take some time and be difficult to converge. If this happens the first thing is to try it with the diff option, and if that still doesn't seem to converge then you can play with the techniques, and use the tech(dfp) you had before. But maybe it converges fine. Just letting you know that it may be time intensive, and then again it may be not.

Now some tip for simplicity. When I have so many explanatory variables and want to use them in different regressions I define global macros with them. In your case I would do

Code:

global x1 "km_travelled_per_year_tsd age_of_insuree sex_of_insuree initial_car_value age_car garage_dummy no_claims_bonus_d1 no_claims_bonus_d5 no_claims_bonus_d7 no_claims_bonus_d10 no_claims_bonus_d20 no_claims_bonus_d30 no_claims_bonus_d40 no_claims_bonus_d50 no_claims_bonus_d56 no_claims_bonus_d57 no_claims_bonus_d58 no_claims_bonus_d59 no_claims_bonus_d61 no_claims_bonus_d62 no_claims_bonus_d63 no_claims_bonus_d64 no_claims_bonus_d65 no_claims_bonus_d66 no_claims_bonus_d67 no_claims_bonus_d68 no_claims_bonus_d69 no_claims_bonus_d71 no_claims_bonus_d72 no_claims_bonus_d73 no_claims_bonus_d74 no_claims_bonus_d75 no_claims_bonus_d76 no_claims_bonus_d77 no_claims_bonus_d78 no_claims_bonus_d79 no_claims_bonus_d81 no_claims_bonus_d82 no_claims_bonus_d83 no_claims_bonus_d84 no_claims_bonus_d85 no_claims_bonus_d86 no_claims_bonus_d87 no_claims_bonus_d88 no_claims_bonus_d89 type_of_car_d0 type_of_car_d10 type_of_car_d11 type_of_car_d12 type_of_car_d13 type_of_car_d14 type_of_car_d15 type_of_car_d16 type_of_car_d17 type_of_car_d18 type_of_car_d19 type_of_car_d20 type_of_car_d21 type_of_car_d22 type_of_car_d23 type_of_car_d24 type_of_car_d25 type_of_car_d26 type_of_car_d27 type_of_car_d28 type_of_car_d29 type_of_car_d30 type_of_car_d31 type_of_car_d32 type_of_car_d33 type_of_car_d34 regional_class_d1 regional_class_d2 regional_class_d3 regional_class_d4 regional_class_d5 regional_class_d6 regional_class_d7 regional_class_d8 regional_class_d9"

global x2 "$x1 insureeDum*"

I don't remember what we called the dummies for the insurees in the other post, but notice I'm using the * wildcard to include all of them. Now in $x1 you have all the variables for the plain and random effects estimations, and in $x2 you have the variables for fixed effects explanation. You can go and do

Code:

* Plain bivariate probits
biprobit accidents_per_year type_of_coverage_dummy $x1, robust nolog
cmp (accidents_per_year = $x1) (type_of_coverage_dummy $x1), vce(robust) ind($cmp_probit $cmp_probit)

* Fixed effects bivariate probits
biprobit accidents_per_year type_of_coverage_dummy $x2, robust nolog
cmp (accidents_per_year = $x2) (type_of_coverage_dummy $x2), vce(robust) ind($cmp_probit $cmp_probit)

* Random effects bivariate probit
cmp (accidents_per_year = $x1 || insuree_categorical_variable:) (type_of_coverage_dummy $x1 || insuree_categorical_variable:), vce(robust) ind($cmp_probit $cmp_probit)

Alfonso Sanchez-Penalver

Comment

Alfonso Sánchez-Peñalver

Join Date: Mar 2014
Posts: 432

30 Jul 2014, 15:09

Missing several equals in my last bit of code

Code:

* Plain bivariate probits
biprobit accidents_per_year type_of_coverage_dummy $x1, robust nolog
cmp (accidents_per_year = $x1) (type_of_coverage_dummy = $x1), vce(robust) ind($cmp_probit $cmp_probit)
 
* Fixed effects bivariate probits
biprobit accidents_per_year type_of_coverage_dummy $x2, robust nolog
cmp (accidents_per_year = $x2) (type_of_coverage_dummy = $x2), vce(robust) ind($cmp_probit $cmp_probit)
 
* Random effects bivariate probit
cmp (accidents_per_year = $x1 || insuree_categorical_variable:) (type_of_coverage_dummy = $x1 || insuree_categorical_variable:), vce(robust) ind($cmp_probit $cmp_probit)

Last edited by Alfonso Sánchez-Peñalver; 30 Jul 2014, 15:12.

Alfonso Sanchez-Penalver

Comment

Matthias Enichlmayr

Join Date: Jun 2014

Posts: 31
#8

31 Jul 2014, 15:37

Hi Alfonso,
again, thanks for your help. I really appreciate that.

Your commands with the global macros and the commodity when including a lot of dummies perfectly work.
Unfortunately only one of the regressions (the fixed effects bivariate probit with the cmp command) does work (but only sometimes).
After adjusting all the explanatory variables, I have 532 observations left. Since we have four years, that leaves us with 133 individuals.
I tried your specifications:
biprobit accidents_per_year type_of_coverage_dummy $x1, robust nolog
Result: it does not converge! (not concave) Even after omitting the dummies for no_claims_bonus, type_of_car and regional_class!

* Plain bivariate probits biprobit accidents_per_year type_of_coverage_dummy $x1, robust nolog
Result: fitting model does not converge (non concave)
cmp (accidents_per_year = $x1) (type_of_coverage_dummy = $x1), vce(robust) ind($cmp_probit $cmp_probit)
Result: error_message_2 pops up.

* Fixed effects bivariate probits biprobit accidents_per_year type_of_coverage_dummy $x2, robust nolog cmp (accidents_per_year = $x2) (type_of_coverage_dummy = $x2), vce(robust) ind($cmp_probit $cmp_probit) Result: the cmp-specification does work, but only sometimes, and sometimes error_message_1 pops up * Random effects bivariate probit cmp (accidents_per_year = $x1 || insuree_categorical_variable (type_of_coverage_dummy = $x1 || insuree_categorical_variable, vce(robust) ind($cmp_probit $cmp_probit) Result: I didn't run this one so far. I guess that "insuree_categorical_variable" and "insureeDum*" is the same!! If you have a spontaneous idea what might be the problem do not hesitate to make a proposal, in the meantime I will resume working out the problem with optional commands like the diff command for instance. Kind regards Matthias PS: if the information I provided is a bit confusing, do not hesitate to ask me to clarify it.
Attached Files
Comment
Matthias Enichlmayr

Join Date: Jun 2014

Posts: 31
#9

31 Jul 2014, 15:38

sorry about the smileys, they actuallly stand for a colon
If this is too confusing I could attach the do-file with the descriptive statistics of the variables (so you can get an idea about the potential problems)

Last edited by Matthias Enichlmayr; 31 Jul 2014, 15:41.
Comment
Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#10

21 Aug 2014, 05:16

Convergence, or nonconvergence in this case, is a tricky issue. Sometimes it's because of data miss-specification, others because of model miss-specification, yet others because the gods are against you. In your case, it seems that one variable (km_travelled_per_year) predicts the outcome perfectly, so there is no estimation possible. This means that all you need to predict the outcome is to know whether km_travelled_per_year is greater than 5. Sorry but I can't help you further with that.

Alfonso Sanchez-Penalver
Comment
Matthias Enichlmayr

Join Date: Jun 2014

Posts: 31
#11

21 Aug 2014, 09:43

you have been of great help, thanks for your efforts
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment