Technical issue - working with large dataset on the virtual machine

Firangiz Aghayeva

Join Date: Oct 2021

Posts: 33
#1

Technical issue - working with large dataset on the virtual machine

08 Nov 2023, 07:05

Dear Forum Users,

I am working on a binomial logistic regression. My dataset consists of 16 mln observations with 12235 variables (total of 195GB). Currently, I am running the regression on STATA with only 100000 observations out of 16mln. It takes around 24-36 hours to complete the task. I am working on a virtual machine (VMware Horizon Client). Now I need to analyze all dataset. I understand that I need to improve the configurations of the Virtual Machine in order to finish running the regression with all observations in a reasonable time. But I do not how fast it should be in terms of technical properties, like how many processors specifically are needed, etc. How do I identify how strong I want this virtual machine to be? It is easier to ask an IT specialist direct requests rather than just '' please, make it work faster''. Do you have any idea how I can solve this issue?

Kind regards,
Firangiz
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#2

08 Nov 2023, 08:12

Some preliminaries first. How many variables are you using in your regression model? What version and how many cores of Stata are you licensed for?

I suspect that you you likely don't need all of the variables for your model, so you can reduce the set to only those required. You might further consider potential reduction in memory size by -compress-ing your dataset to the smallest necessary data types. It's also very likely you have exhausted available RAM memory, and Stata is heavily swapping to disk which will always be 1000s of times slower than if the dataset and temporary results can fit in memory.
Comment
Firangiz Aghayeva

Join Date: Oct 2021

Posts: 33
#3

09 Nov 2023, 11:19

Dear Leonardo,

Thank you for your quick answer.

I am using pretty much all the variables. That's the issue. My model includes investor and firm ID fixed effects. There are more than 7k investor and 4k firm ID dummies I created for this model. I compressed the file, but it is not much of a decrease, now it is 191GB. I work with STATA 17, 4 core. The properties of the virtual machine are the following:
Processor: AMD EPYC 7542 32-Core Processor (2 processors)
RAM: 300 GB
System type: 64-bit Operating System, x64-based processor.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#4

09 Nov 2023, 13:16

In that case, there's not a huge amount of optimization that can be done with the model as it is. Stata usually recommends having twice the amount of RAM as the working dataset,so you're still likely short on RAM. Your additional cores are of no use since Stata will only use the 4 that are licenses.

Could you switch to an easier to fit model? Not being an economist, would it be sensible to consider random effect sfor either firm or investor, or both? I believe that would considerably decrease the size of the problem.
Comment
Firangiz Aghayeva

Join Date: Oct 2021

Posts: 33
#5

10 Nov 2023, 07:04

Could you please elaborate more on the alternative model with random effects?
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#6

10 Nov 2023, 09:43

The idea of representing each firm (say) by a fixed effect means estimating an intercept for each of them (plus or minus a reference category). This make the column dimension of the design matrix very big. If instead firms are assumed to follow a normal distribution, then a random intercept greatly reduces the dimensionality and should be faster to estimate. You'll need to figure out if this approach is suitable/possible for your problem.
Comment
Firangiz Aghayeva

Join Date: Oct 2021

Posts: 33
#7

13 Nov 2023, 07:07

I understand. This makes sense. If this is a suitable solution to my problem, how should I change my model? My current model looks like this:
logit invest_dummy SixD_hofV_1 Prior_relationship_dummy LP_GDP_currentmnLN LP_EFInvestmentFreedomLN Fund_Size_USDLN Fund_seq_numLN LP_WGILN Manager_WGILN Legal_origin_bias Fund_coinvestment_dummy LP_ID_dummy* Manager_ID_dummy*, vce(cluster double_clusterFVMC) noconstant
Comment

Announcement

Technical issue - working with large dataset on the virtual machine

Comment

Comment

Comment

Comment

Comment

Comment