Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by Clyde Schechter View Post
    Re #12: Yes, if the included values should match on fpedats as well as cusip, then change -by(cusip)- to -by(cusip fpedats)-.

    Do try again with this. If -rangejoin- is truly freezing up your computer, it is due to memory issues. Doing this -by(cusip fpedats)- will somewhat decrease the memory needed for this, especially if many cusips have many different fpedats associated with them.

    But don't be so quick to assume that your computer is freezing up. With a data set this size, the computational task for -rangejoin- is very large and it may just be taking a long time to do it. There is no interim output from -rangejoin- to show that it is still running. And the way it works, it pretty much locks out attempts to do other things in Stata. My experience on this Forum is that people often seriously underestimate how long some calculations can take. If you are accustomed to calculations that complete in seconds or small numbers of minutes, a calculation that needs to run for days or weeks may be hard to imagine. But such calculations exist, and calculations that require pairing up observations can fall into that range. If you are running on a Windows machine, here's a way to tell if your computer is truly frozen, as opposed to just grinding out a very long calculation: open the Task Manager. Find the information about Stata. If the memory or CPU usage numbers are changing several times a minute, then Stata is still running. If those numbers are not changing, then there is a good chance Stata is frozen and you need to kill the process.

    I cannot think of a less computation-intensive way to do this. You might be able to speed up the -collapse- part (although I'm guessing you haven't even gotten to that yet, because the really slow part will be -rangejoin-) by replacing -collapse- with user written -gcollapse-, part of the -gtools- suite available from SSC. But I doubt that will make an enormous difference. So I think your options are either to be very patient, or find a faster machine to run on.

    But I do think you may be up against a really difficult memory problem. That is, you may find yourself waiting a long time for this to run, only for it to end with a "op. sys. refuses to provide memory" error. So one thing that would solve that problem is to break up your data set into a single data set per cusip, or even per cusip fpedats combination. Then run the code separately on each of those smaller data sets, and at the end, -append- all the results together. This will overcome the memory problem. It might also save time: the calculations for 100 cusips in a single file take much more than 100 times the calculations for a single cusip in a single file (ditto for cusip-fpedats combination). So calculation time will be substantially reduced this way. But, you have to do a lot more disk thrashing, which will slow you down. I don't have a good feel for which of these effects will win out overall. In part it depends on your hardware.

    Hi Clyde,
    I divided up my data into several chunks and ran the command on them separately, and every result was correct. I really appreciate your help!! Thank you so much!

    Comment

    Working...
    X