I have a large unbalanced panel data with more than 200M observations (size = 16GB). In addition, I have a time series dataset with 5 observations (call them x1 to x5).
I want to run a regression which includes as controls 16 lags and 4 leads of the time series observations. It looks something like this: y_{i,t} = \sum_{tau=-4}^16 beta_{1,tau} x_{1,tau} + ... + \sum_{tau=-4}^16 beta_{5,tau} x_{5,tau} + epsilon_it *
The way that I currently do it is by merging the panel data to the time-series data. Since an individual observation, i, can have its first observation at any time, t, to include the lags (or leads) into the regression I need to merge not only the time series data to the panel dataset, but also all the lags and leads. This implies adding 20 variables to the large dataset.
The merge takes a very long time and results in a dataset of size 360GB. Needless to say that running a single regression with that size of dataset takes a very long time.
Is there a more efficient way to perform the analysis? I was wondering whether I can hold the time series data as a local matrix and then include the appropriate matrix values as regression covariates.
Any advice would be appreciated.
* In practice, I run 16 different regressions where the specification is y_{it+tau}=\beta_{tau} x_{1,t-tau} + \sum_{j=1}^4 \gamma_{2,j} x{2,t-tau-j} + ... +\sum_{j=1}^4 \gamma_{5,j} x{5,t-tau-j} + epsilon_{it} for tau = -4...12, following Jorda method.
I want to run a regression which includes as controls 16 lags and 4 leads of the time series observations. It looks something like this: y_{i,t} = \sum_{tau=-4}^16 beta_{1,tau} x_{1,tau} + ... + \sum_{tau=-4}^16 beta_{5,tau} x_{5,tau} + epsilon_it *
The way that I currently do it is by merging the panel data to the time-series data. Since an individual observation, i, can have its first observation at any time, t, to include the lags (or leads) into the regression I need to merge not only the time series data to the panel dataset, but also all the lags and leads. This implies adding 20 variables to the large dataset.
The merge takes a very long time and results in a dataset of size 360GB. Needless to say that running a single regression with that size of dataset takes a very long time.
Is there a more efficient way to perform the analysis? I was wondering whether I can hold the time series data as a local matrix and then include the appropriate matrix values as regression covariates.
Any advice would be appreciated.
* In practice, I run 16 different regressions where the specification is y_{it+tau}=\beta_{tau} x_{1,t-tau} + \sum_{j=1}^4 \gamma_{2,j} x{2,t-tau-j} + ... +\sum_{j=1}^4 \gamma_{5,j} x{5,t-tau-j} + epsilon_{it} for tau = -4...12, following Jorda method.