Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Workstation Consolidation

    Good Morning,

    I have an IT related question regarding hardware to support STATA.


    We have 15 STATA users, all on individual HP Z-Workstations, and want to consolidate these aging workstation on premise VMs

    Each workstation has two Intel E5-2650 v4 CPU, with 12 cores(24 core with hyper threading) , and 96 GB Ram
    STATA MP is used, and all cores are licensed.

    We want to move these users onto Windows Server 2019 VMs.

    Has anyone performed such a consolidation?

    Our question is around VM sizing and number of VM’s
    • Do we create 15 VM, with similar specs to the workstations (vCPU and RAM)
    • Do we create 5 VM, with much more resources, and have 3 users per VM
    • any other configuration
    I am the IT person on this team, not a STATA user. I wish I had metrics etc to help with sizing and to understand where the current bottleneck is, as I understand that some STATA scripts currently take a week to complete, but since these are standalone Window 7 workstations those metrics are unavailable.

    I expect there will be some trial and error as we tune this new configuration, but I hope we can start with a reasonable infrastructure to minimize the churn in the organization.

    Any suggestions or tips, either for planning the new infrastructure or monitoring and tuning once we go live?

  • #2
    We are mostly Linux here, so I can't give specific advice about VMs, but I do know that it is critical to know how much memory your users are using. If Stata requests more memory than is physically present jobs will run so slowly that they may never finish. Users usually have no idea about memory requirements, but as administrator you can probably get a good idea of what is needed by looking at the size of the existing .dta files on your system. Stata will typically load an entire .dta file before doing any other work. This isn't a foolproof estimate, since some commands will need more memory, and some users my not load the entire .dta. But it is a start for your planning. You might find that a minority of users need much more memory than others.

    Note that if your users have jobs that take a week to complete, you might try promoting https://gtools.readthedocs.io/en/latest/ as that can sometimes provide dramatic improvements. You might also look at https://www.nber.org/stata/efficient/

    Comment


    • #3
      Originally posted by [email protected] View Post
      We are mostly Linux here, so I can't give specific advice about VMs, but I do know that it is critical to know how much memory your users are using. If Stata requests more memory than is physically present jobs will run so slowly that they may never finish. Users usually have no idea about memory requirements, but as administrator you can probably get a good idea of what is needed by looking at the size of the existing .dta files on your system. Stata will typically load an entire .dta file before doing any other work. This isn't a foolproof estimate, since some commands will need more memory, and some users my not load the entire .dta. But it is a start for your planning. You might find that a minority of users need much more memory than others.

      Note that if your users have jobs that take a week to complete, you might try promoting https://gtools.readthedocs.io/en/latest/ as that can sometimes provide dramatic improvements. You might also look at https://www.nber.org/stata/efficient/
      The rule of thumb is 1.5 times dataset for RAM?

      Could a dataset be 200-300GB ?
      This is what has been reported to us, but its third hand information, and I wonder if its the size of a share of the file server. If a current dataset was 300G, you would want 512GB of RAM on that VM?

      Comment


      • #4
        and, you can contact Stata technical services via email [email protected]

        Comment


        • #5
          We have lots of experience with datasets up to about 1.5 terabytes. At that size, you have to careful what you ask for, but sorting, selecting, regressing, all work fine but may take hours. Some procedures are very slow though. I ask users to experiment with subsets and extrapolate to the full dataset before starting jobs like that. Mostly users don't like to do that but it helps a lot. Presumably your users have experience with the datasets they use. You could talk with the ones with 300GB datasets and see how they fit them into 96GB now. They have complicated procedures for divided up the files and reassembling the result and would be thrilled to avoid that hassle if they could have a larger workspace. Stata works very well with multiple users as long as they don't ask for more memory than is available. That makes sharing a large VM more attractive than having a private VM.

          Comment


          • #6
            Originally posted by [email protected] View Post
            We are mostly Linux here, so I can't give specific advice about VMs, but I do know that it is critical to know how much memory your users are using. If Stata requests more memory than is physically present jobs will run so slowly that they may never finish. Users usually have no idea about memory requirements, but as administrator you can probably get a good idea of what is needed by looking at the size of the existing .dta files on your system. Stata will typically load an entire .dta file before doing any other work. This isn't a foolproof estimate, since some commands will need more memory, and some users my not load the entire .dta. But it is a start for your planning. You might find that a minority of users need much more memory than others.

            Note that if your users have jobs that take a week to complete, you might try promoting https://gtools.readthedocs.io/en/latest/ as that can sometimes provide dramatic improvements. You might also look at https://www.nber.org/stata/efficient/
            Mind sharing your high level Linux config?

            We are exploring the option of either Linux VMs, or a Linux environment hosted from the Mainframe.

            Comment


            • #7
              We have several CentOS machines, the largest with 2TB of memory. All boot from a central server and share /usr. I don't recall any compatibility issues currently or in the past. We have not used VMs, simply because we couldn't see a significant upside to compensate for the increased complexity and slight loss of efficiency. As far as I can tell, VMs become necessary when incompetent programmers write applications that can't share with other applications, perhaps because they create port conflicts, or require libraries that are incompatible with other applications or would write over one user's files with another's. Stata doesn't make those mistakes. I believe VMs might allow us to move running jobs from one server to another, which would be a nice feature to have. If we are missing anything else, we aren't aware of it yet. Most people trying to convince us that we need VMs never get past the argument that "everyone is using VMs". I'd be interested in other arguments.

              Comment

              Working...
              X