Hi everyone,
My name is Claudio and I am currently working on a credit risk project and need some advice on an econometric problem. I have previously estimated the probability of loans defaulting or entering arrears (1/0 dummy variables) with a logit model following the literature, and up to this point I had no particular issues.
Now, I want to complement the analysis by finding the drivers of the arrears balance, i.e. whether there is a relation between the amount in arrears and certain risk drivers. So, my dependent variable is the arrears balance for a large dataset of loans. The problem is that the arrears balance is zero for about 99% of the observations, so the mean on the whole sample is around 3 dollars. However, when the arrears balance variable is non-zero in the 1% of cases, the values distribute around a mean of 1000 circa.
So far, I’ve tried using OLS regression and for obvious reasons I obtain very small coefficients. I also considered running the regression only on the observations where the arrears amount is above zero, but this approach represents a significant loss of information and reduces the number of observations drastically.
I suspect that OLS might not be the best approach given the distribution of my dependent variable.
I’ve also tried a Tobit model since it handles the censored nature of the data (the lower limit is zero as the arrears balance cannot be negative).
Here’s the model I’ve been using in Stata 18:
I heard about Zero-Inflated models and Hurdle models but I am unsure on what is the most suitable for my analysis.
I’m looking for advice on the best model to use for this type of data and how to implement it effectively in Stata. Additionally, if anyone has experience with similar data or can point me to relevant literature or resources, that would be incredibly helpful.
I have always read Statalist threads as a visitor and this is my first post on Statalist, so I may have overlooked something. Please don't hesitate to let me know if there is anything in my query that should be changed or integrated.
Thanks so much for your help!
Best regards,
Claudio
My name is Claudio and I am currently working on a credit risk project and need some advice on an econometric problem. I have previously estimated the probability of loans defaulting or entering arrears (1/0 dummy variables) with a logit model following the literature, and up to this point I had no particular issues.
Now, I want to complement the analysis by finding the drivers of the arrears balance, i.e. whether there is a relation between the amount in arrears and certain risk drivers. So, my dependent variable is the arrears balance for a large dataset of loans. The problem is that the arrears balance is zero for about 99% of the observations, so the mean on the whole sample is around 3 dollars. However, when the arrears balance variable is non-zero in the 1% of cases, the values distribute around a mean of 1000 circa.
So far, I’ve tried using OLS regression and for obvious reasons I obtain very small coefficients. I also considered running the regression only on the observations where the arrears amount is above zero, but this approach represents a significant loss of information and reduces the number of observations drastically.
I suspect that OLS might not be the best approach given the distribution of my dependent variable.
I’ve also tried a Tobit model since it handles the censored nature of the data (the lower limit is zero as the arrears balance cannot be negative).
Here’s the model I’ve been using in Stata 18:
Code:
tobit arrears_bal_numeric i.ltv_q1_pct_5bins time_to_maturity ib3.loan_purpose_bins_enc i.int_rate_type_bins_enc ib1.employment_bins_enc i.total_income_q1_3bins ib4.occup_type_primary_bins_enc i.rbld_coll_orig_val_3bins l1.house_price_change_last12m l1.unemployment_rate l1.hicp i.epc_kwh_3bins, vce(cluster georeg_3digits_enc) ll(0)
I’m looking for advice on the best model to use for this type of data and how to implement it effectively in Stata. Additionally, if anyone has experience with similar data or can point me to relevant literature or resources, that would be incredibly helpful.
I have always read Statalist threads as a visitor and this is my first post on Statalist, so I may have overlooked something. Please don't hesitate to let me know if there is anything in my query that should be changed or integrated.
Thanks so much for your help!
Best regards,
Claudio
Comment