Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Can Stata be used to analyze super large dataset such as Revelio Labs

    Hi! I have to use individual-level (granular data) employee data provided by Revelio Labs for my research. I was wondering if Stata is able to manage this task. If not, any recommendations on what platform/software I should use? Thanks!!

  • #2
    super large dataset such as Revelio Labs
    How large in gigabytes? Stata reads data into memory, thus typically size of memory will be the limit. The advice is to have available memory of size 1.5 times the size of the Stata dataset. Reduce the size of data as early as possible, before using Stata for analysis. (see -help limits- for max values of: observations, variables and frames)

    Comment


    • #3
      You need enough core memory to hold your data, but Stata will work fine with terabytes of data if you have that much physical memory on your computer. I have run with 800GB of data with no problem at all. You don't say how many observations you have, or what the size of an observation would be. You should probably make some calculations. If you have a large dataset, it is important to be aware of those details and tailor your programs to keep within your available memory.

      Stata has the advantage that it supports byte variables, while the smallest size variable in most other software is float or even double. That makes a difference of a factor of 4 or 8 in many applications. However, there are some Stata procedures that make a copy of the dataset with all numeric variables promoted to double. Ouch! One issue in planning is that Stata doesn't document the memory usage of those problematic procedures, nor does it offer a way to see the memory used by a procedure. . You can watch your job running with task manager (Windows) or top (Linux) and see how much memory is used. I suggest starting with a small subset of the data, and observing the memory usage. Then you can extrapolate to the full dataset size to see if you are OK to run.

      I have some other suggestions at https://www.nber.org/stata/efficient/memory.html
      Last edited by Daniel Feenberg; 05 May 2024, 08:21.

      Comment

      Working...
      X