Help with continuous numerical variables with predominantly zero values

Claudio Stoduto

Join Date: Aug 2024

Posts: 1
#1

Help with continuous numerical variables with predominantly zero values

01 Aug 2024, 03:52

Hi everyone,

My name is Claudio and I am currently working on a credit risk project and need some advice on an econometric problem. I have previously estimated the probability of loans defaulting or entering arrears (1/0 dummy variables) with a logit model following the literature, and up to this point I had no particular issues.

Now, I want to complement the analysis by finding the drivers of the arrears balance, i.e. whether there is a relation between the amount in arrears and certain risk drivers. So, my dependent variable is the arrears balance for a large dataset of loans. The problem is that the arrears balance is zero for about 99% of the observations, so the mean on the whole sample is around 3 dollars. However, when the arrears balance variable is non-zero in the 1% of cases, the values distribute around a mean of 1000 circa.

So far, I’ve tried using OLS regression and for obvious reasons I obtain very small coefficients. I also considered running the regression only on the observations where the arrears amount is above zero, but this approach represents a significant loss of information and reduces the number of observations drastically.

I suspect that OLS might not be the best approach given the distribution of my dependent variable.

I’ve also tried a Tobit model since it handles the censored nature of the data (the lower limit is zero as the arrears balance cannot be negative).

Here’s the model I’ve been using in Stata 18:

Code:

tobit arrears_bal_numeric i.ltv_q1_pct_5bins time_to_maturity ib3.loan_purpose_bins_enc i.int_rate_type_bins_enc ib1.employment_bins_enc i.total_income_q1_3bins ib4.occup_type_primary_bins_enc i.rbld_coll_orig_val_3bins l1.house_price_change_last12m l1.unemployment_rate l1.hicp i.epc_kwh_3bins, vce(cluster georeg_3digits_enc) ll(0)

I heard about Zero-Inflated models and Hurdle models but I am unsure on what is the most suitable for my analysis.

I’m looking for advice on the best model to use for this type of data and how to implement it effectively in Stata. Additionally, if anyone has experience with similar data or can point me to relevant literature or resources, that would be incredibly helpful.

I have always read Statalist threads as a visitor and this is my first post on Statalist, so I may have overlooked something. Please don't hesitate to let me know if there is anything in my query that should be changed or integrated.

Thanks so much for your help!

Best regards,

Claudio
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10180
#2

01 Aug 2024, 05:59

If you can model the censoring process (default/not-default) separately from the continuous process (how much default), then you can specify a Heckman selection model. This will require you to have a variable or variables that explain how much one defaults but do not explain whether or not one defaults (the so-called exclusion restrictions). Firm size could be an example if the degree of default is correlated with firm size and on the other hand, both large and small firms default, so there is no relationship between defaulting and size. See

Code:

help heckman

for examples of applications.

Last edited by Andrew Musau; 01 Aug 2024, 06:05.
1 like
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3006
#3

01 Aug 2024, 06:10

Dear Claudio Stoduto,

Both the Tobit and the Heckman method recommended by Andrew are only valid under very strong assumptions. I would start with a simple Poisson regression, which is very robust (you need robust standard errors). Note that the Poisson regression assumes an exponential model, so you may want to log some of the continuous regressors.

Best wishes,

Joao
1 like
Comment

Announcement

Help with continuous numerical variables with predominantly zero values

Comment

Comment