How to control for serial correlation in the presence of groups in cross sectional data?

Laiy Kho

Join Date: Oct 2022

Posts: 48
#1

How to control for serial correlation in the presence of groups in cross sectional data?

01 Dec 2022, 11:37

Hello,

I have a cross sectional loan dataset with 46 groups (banks) within the observations spanning from 20 countries. Due to the presence of groups within the data, I want to control for serial correlation. I am aware robust cluster errors is the common solution to control for serial correlation. However, when I cluster errors at bank level or the country level, it inflates the VIF of my key explanatory variable immensely. Specifically, when I include bank level factor variable, the VIF for key explanatory variable is 17,617. I have two research questions that I seek to examine with the linear regression and logistic regression. The problem persists in both regression output. In addition, Stata also automatically drops some groups or banks, precisely 3 categories in linear regression (research question 1, sample size 1 million) and 6 categories due to collinearity and "predicts failure perfectly" error (research question 2, sample size: 250,000). I have attached the frequency of all banks/groups in the data.

Background on dataset:

The dataset (sample size: 1 million, period: 4 months) of an online firm registered in one country that allows multiple banks from different countries to sell the loans originated on their platform. I have over 40 banks in the dataset spanning from 20 countries. However, not all banks have observations (or sell loans) in all 4 months(some banks do, some banks don't). Each row in the dataset is one observation (one loan) with several characteristics, the start date, the interest rate, if it has been paid back, etc. Hence, each observation is unique as it represent different loans originated by multiple banks. I do not regard the data as time series as Time-series would be the same loans on several observations over time, daily, monthly, or so. I do not consider the data as panel as well, because I am assessing the performance of one online business platform(there is only one wave of data).

With the aforementioned issue, I am unable to solve possible serial correlation arising from the presence of groups in the data with cluster robust errors as I cannot cluster at bank level. Although my key explanatory variable varies mainly at bank level, I have tried clustering at country level. The same issue persists.
1. Is there any other solution to remove serial correlation?
2. Is it necessary to cluster at bank level? I currently control for bank level heterogeneity by including variables such as bank age, size, and geographical region (asia, america, europe, africa etc.).
3. Does serial correlation matter in logistic regression? One of my two research questions use logistic regression and the latter OLS.
Tags: None

Announcement

How to control for serial correlation in the presence of groups in cross sectional data?