Processing large dataset in STATA?

tig som

Join Date: Sep 2022

Posts: 58
#1

Processing large dataset in STATA?

30 Mar 2023, 03:48

I am using STATA/SE 17. I have 5,609,229 Observations in my dataset. Even a simple Stata syntax such as egen takes longer to run.
I tried to free my computer memory and have a good memory on my computer. However, there is no much difference.

Is there any way to process my STATA syntax faster?

Thank you, and I am looking forward to your tips and suggestions.
Tags: None

1 like
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

30 Mar 2023, 06:13

Check out the user written -gtools- suite. It speeds up the type of operations you mention substantially.
https://gtools.readthedocs.io/en/latest/
1 like
Comment
Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#3

30 Mar 2023, 07:13

You don't provide much information, such as the size of an observation or what egen functions are slow, or how slow is too slow. A simple egen, such as

Code:

egen y=mean(x)

runs in about a fifth of a second on a dataset that size if there is sufficient memory. Some -egen- functions will require sorting or searching, and will be much more time-consuming. But I suspect you may be running out of physical memory. On windows you can use the task manager to track memory usage while you run the Stata job. If it runs out of physical memory, it will use virtual memory which is 100s of times slower. -Gtools- suggested by Kolev is also very good but won't reduce memory requirements. I have some suggestions at http://www.nber.org/stata/efficient
Comment
tig som

Join Date: Sep 2022

Posts: 58
#4

30 Mar 2023, 11:52

I really appreciate your suggestions. Thank you very much Joro Kolev and Daniel Feenberg!!
I will try the gtools

Daniel Feenberg, I use egen as an example. However, I have several tasks to do in this dataset.

For example, when I use egen functions like

egen tag_varx = tag (VARX VARY VARZ) took me over two hours. Since the running couldn’t be completed, I stopped the task.

Sometimes, egen tagVarX= tag(Varx) if VarH=1 took me around 15 minutes.

Other syntax took me 5 minutes, 10 minutes, or more, depending on the syntaxes.

Specially margining a column from other datasets took me longer, but I don’t remember how long it was (approximately 30-40 minutes) .

As you suggested, I will see how to get more memory for my computer. Right now, my laptop has 34 GP of free space. With kind regards,
Comment
Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#5

30 Mar 2023, 18:32

I am surprised by the long run times given the tag example - it took less than one second to place 500,000 tags on a 5,000,000 observation dataset here. I think something is going wrong for you. I am not sure what 34GP could be. Do you mean 34GB? That also seems unlikely in a laptop. Even 32GB is very unusual in a laptop, although it is possible. Can you create a small (10 lines) test case that creates some data, then runs a command, and takes an inordinate amount of time? My test case was:

Code:

set rmsg on set obs 5000000 gen x=mod(_n,10) egen y=tag(x)

Make something like that.
Comment
Roman Mostazir

Join Date: Apr 2014

Posts: 874
#6

31 Mar 2023, 05:11

Few suggestions from my part as I have been dealing with a dataset which has 6million+ observations and did not experience any issues. I worked in a macbook air laptop which was on 8GB memory and apple-M2 chip. What system environment you are on?

- Try 'compress' command to compress the dataset. This will use less memory.
- Make sure you work in a local drive not a network drive. This often slows down the process based on internet connection stability and Stata has nothing to do with it
- Power the device from main plug and not from battery while working (it is something I experienced slowed down my work in the past)

Roman
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#7

31 Mar 2023, 08:04

OP might be confusing free space on the hard drive, with the available RAM.

OP, the speed with which your Stata operates, and the size of your datasets Stata can handle depends on how much RAM you have, and also for speed the processor you have.

How much available space you have on your hard drive is pretty much irrelevant as long as you have enough space save your data.
Comment
tig som

Join Date: Sep 2022

Posts: 58
#8

31 Mar 2023, 09:46

Thank you for all the suggestions. We talked about focusing on the subset of my dataset for now. Therefore, I currently do not need to work with all the 5 million observations.

If I have to work again on all the 5 million observations and happen to have issues, I will get back to you.

To answer some of your questions.

Daniel FeenbergYes, what I mean is 34GB in Local disk . and I tried your suggestion

set rmsg on

set obs 5000000

gen x=mod(_n,10)

egen y=tag(x)

It took me 2-3 seconds to run all the lines. It seems fast.

The issue could be because I don't only have more than 5 million observations but also around 225 variables. It couldn't run fast in my dataset, but it ran faster when it was the above lines.

Thank you for the suggestions and help.

Roman Mostazir, thank you for the insight about compressing

Joro Kolev thank you for the insight about RAM.

Since I am not using all the 5 million observations for now, I am not really in need of solutions for now. If I need to work on it, I will return to you.

Best regards,
Comment

Announcement

Processing large dataset in STATA?

Comment

Comment

Comment

Comment

Comment

Comment

Comment