Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Processing large dataset in STATA?

    I am using STATA/SE 17. I have 5,609,229 Observations in my dataset. Even a simple Stata syntax such as egen takes longer to run.
    I tried to free my computer memory and have a good memory on my computer. However, there is no much difference.

    Is there any way to process my STATA syntax faster?

    Thank you, and I am looking forward to your tips and suggestions.

  • #2
    Check out the user written -gtools- suite. It speeds up the type of operations you mention substantially.
    https://gtools.readthedocs.io/en/latest/

    Comment


    • #3
      You don't provide much information, such as the size of an observation or what egen functions are slow, or how slow is too slow. A simple egen, such as

      Code:
      egen y=mean(x)
      runs in about a fifth of a second on a dataset that size if there is sufficient memory. Some -egen- functions will require sorting or searching, and will be much more time-consuming. But I suspect you may be running out of physical memory. On windows you can use the task manager to track memory usage while you run the Stata job. If it runs out of physical memory, it will use virtual memory which is 100s of times slower. -Gtools- suggested by Kolev is also very good but won't reduce memory requirements. I have some suggestions at http://www.nber.org/stata/efficient

      Comment


      • #4

        I really appreciate your suggestions. Thank you very much Joro Kolev and Daniel Feenberg!!
        I will try the gtools

        Daniel Feenberg, I use egen as an example. However, I have several tasks to do in this dataset.



        For example, when I use egen functions like

        egen tag_varx = tag (VARX VARY VARZ) took me over two hours. Since the running couldn’t be completed, I stopped the task.




        Sometimes, egen tagVarX= tag(Varx) if VarH=1 took me around 15 minutes.




        Other syntax took me 5 minutes, 10 minutes, or more, depending on the syntaxes.




        Specially margining a column from other datasets took me longer, but I don’t remember how long it was (approximately 30-40 minutes) .




        As you suggested, I will see how to get more memory for my computer. Right now, my laptop has 34 GP of free space. With kind regards,


        Comment


        • #5
          I am surprised by the long run times given the tag example - it took less than one second to place 500,000 tags on a 5,000,000 observation dataset here. I think something is going wrong for you. I am not sure what 34GP could be. Do you mean 34GB? That also seems unlikely in a laptop. Even 32GB is very unusual in a laptop, although it is possible. Can you create a small (10 lines) test case that creates some data, then runs a command, and takes an inordinate amount of time? My test case was:

          Code:
          set rmsg on
          set obs 5000000
          gen x=mod(_n,10)
          egen y=tag(x)
          Make something like that.

          Comment


          • #6
            Few suggestions from my part as I have been dealing with a dataset which has 6million+ observations and did not experience any issues. I worked in a macbook air laptop which was on 8GB memory and apple-M2 chip. What system environment you are on?

            - Try 'compress' command to compress the dataset. This will use less memory.
            - Make sure you work in a local drive not a network drive. This often slows down the process based on internet connection stability and Stata has nothing to do with it
            - Power the device from main plug and not from battery while working (it is something I experienced slowed down my work in the past)
            Roman

            Comment


            • #7
              OP might be confusing free space on the hard drive, with the available RAM.

              OP, the speed with which your Stata operates, and the size of your datasets Stata can handle depends on how much RAM you have, and also for speed the processor you have.

              How much available space you have on your hard drive is pretty much irrelevant as long as you have enough space save your data.

              Comment


              • #8

                Thank you for all the suggestions. We talked about focusing on the subset of my dataset for now. Therefore, I currently do not need to work with all the 5 million observations.

                If I have to work again on all the 5 million observations and happen to have issues, I will get back to you.




                To answer some of your questions.

                Daniel FeenbergYes, what I mean is 34GB in Local disk . and I tried your suggestion

                set rmsg on

                set obs 5000000

                gen x=mod(_n,10)

                egen y=tag(x)




                It took me 2-3 seconds to run all the lines. It seems fast.

                The issue could be because I don't only have more than 5 million observations but also around 225 variables. It couldn't run fast in my dataset, but it ran faster when it was the above lines.




                Thank you for the suggestions and help.

                Roman Mostazir, thank you for the insight about compressing

                Joro Kolev thank you for the insight about RAM.

                Since I am not using all the 5 million observations for now, I am not really in need of solutions for now. If I need to work on it, I will return to you.




                Best regards,




                Comment

                Working...
                X