Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Do file execution randomly terminates when run on server

    Dear all,

    I am having problem running a do file on the server. The server seems to terminate executing the do file randomly. Below are more details about the problem.

    I wrote a do file to estimate Tobit model repeatedly for values of mu and delta, each of them takes on natural number values from 0 to 100 independently, therefore yielding 10,201 tobit model estimations. I then store all the important statistics from each of the estimates. Below is the important part of the code.

    capture erase model_results.dta
    postutil clear
    tempname results
    postfile `results' str10 type_lh mu rho ll b_mu_less b_mu_more b_rho_less b_rho_more intercept_only n_total n_uncensored n_left_censored n_right_censored chisq converged using model_results
    xtset subject
    forvalues j = 0(1)100 {
    forvalues k = 0(1)100{
    qui xttobit target_sanction mu_max_less_`j' mu_max_more_`j' rho_max_less_`k' rho_max_more_`k' $controls, ll(0) ul(10)
    post `results' ("`i'") (`j') (`k') (e(ll)) (_b[mu_max_less_`j']) (_b[mu_max_more_`j']) (_b[rho_max_less_`k']) (_b[rho_max_more_`k']) (0) (e(N)) (e(N_unc)) (e(N_lc)) (e(N_rc)) (e(chi2)) (e(converged))
    }
    }
    postclose `results'
    use model_results, clear

    This code always runs on my laptop. However, my laptop is not powerful, so I ran up to 500 iterations to check if the code is working all fine, and it always does (Note: To change the number of iterations, I simply change the value that appears in the bracket in the forvalues j or forvalues k command, 0(1)100.).

    However, when my co-author’s RA tries to run it on the server, it fails to execute the do file randomly, meaning, sometimes, for smaller number of iterations (say 500), it runs completely, and sometimes it fails. When it fails, the log file shows no error. In fact, it terminates while estimating a tobit specification for some value of mu and delta. Below is a snapshot of the log file where it randomly got terminated while running a regression.

    82 | 5.397164 1.568425 3.44 0.001 2.323108 8.47122
    98 | 2.334637 1.600041 1.46 0.145 -.8013868 5.47066
    99 | .9528543 1.713668 0.56 0.578 -2.405874 4.311583
    100 | -13.05145 433.5157 -0.03 0.976 -862.7267 836.6238
    101 | -3.241807 1.833857 -1.77 0.077 -6.8361 .3524858
    102 | -6.82009 2.534712 -2.69 0.007 -11.78803 -1.852147
    |
    effdev | -1.094646 .1727472 -6.34 0.000 -1.433225 -.7560678
    lagpunishs~f | -.013456 .0937485 -0.14 0.886 -.1971998 .1702878
    _cons | 3.567182 1.664893 2.14 0.032 .3040523 6.830312
    -------------+----------------------------------------------------------------
    /sigma_u | 8.51e-17 .223141 0.00 1.000 -.4373483 .4373483
    /sigma_e | 3.267571 .2357784 13.86 0.000 2.805454 3.729688
    -------------+----------------------------------------------------------------
    rho | 6.79e-34 3.56e-18 0 1
    ------------------------------------------------------------------------------

    Obtaining starting values for full model:
    Iteration 0: log likelihood = -778.46696
    Iteration 1: log likelihood = -777.88898
    Iteration 2: log likelihood = -777.8827
    Fitting full model:
    Iteration 0: log likelihood = -656.29268 (not concave)
    Iteration 1: log likelihood = -491.30636 (not concave)
    Iteration 2: log likelihood = -446.49795 (not concave)
    Iteration 3: log likelihood = -427.56551
    Iteration 4: log likelihood = -407.98293
    Iteration 5: log likelihood = -406.09502
    Iteration 6: log likelihood = -406.05963
    Iteration 7: log likelihood = -406.05507
    Iteration 8: log likelihood = -406.05419

    My co-author’s RA is running the code on the server which I am not very familiar with, so I had asked him to provide a short description of the server and the command he is using to run it. Below is his answer.

    We have Centos OS which is similar to Linux in its working. The Stata installation is pretty much the same as mentioned for Linux systems on the official website. We are running Nvidia GPUs on the server which requires a command-line only (No GUI) interface.

    For running a Stata do file in batch mode, I first go to the directory where Stata has been installed using cd command. I then start the batch run with

    ./stata < /path_to_do_file > /path_to where_you_want_the_log_file &

    PS - The & at the end instructs it to run in the background.

    Has anyone faced problem with running a do file on the server when it runs perfectly on the laptop? By the way, I do know that all comment lines need to be erased from the do file to be executed on the server. Any help would be appreciated and happy to provide more details if needed.

    Thanks,
    Arjun
    Attached Files

  • #2
    Without speaking to the -tobit- specific aspect of your problem, and without inspecting either of your files (not many people here will load an attachment from a posting), I have some thoughts/questions, which might be moot were I or someone else willing to inspect your attachment. I also of course don't know what you might know or have tried. so the following may seem hopelessly simplistic to you.

    1) Is there any random number use prior to the failure? If there is and you didn't set the seed, some random quantity might be different on the server vs. the laptop.

    2) I'd put in some code to print out the values of mu and delta and anything else that varies (-display "mu = ... , delta = ...") to look for any patterns; perhaps that is implied in what you said above. Maybe even knowing the iteration number is relevant.

    3) You might try using -capture- on tobit and displaying the error:
    capture tobit .....
    Code:
    if (_rc != 0) di "tobit failed with a return code of " _rc

    4) Any chance different versions of Stata are being used?

    5) I wonder if the program is actually not failing in the estimation, but just in the display of results. That's one of the reasons I'd suggest all the various "echos" per above to try to get some more information about what is happening. Forcing output display other than just whatever -tobit- spits out might be helpful.

    Comment


    • #3
      While I doubt this is what's going on with O.P.'s program, for what it's worth, I have had the experience when running very long analyses, that sometimes they are interrupted in the middle, at some random time, and the output just comes to an end with no error messages, because the IT people who administer the server have killed my processes and force logged me off. Typically this is done when they install Windows updates or something like that. The reason I doubt this is happening in #1 is that this sort of thing happens only once in a while, and does not recur each time I run the same code.

      Comment


      • #4
        Let me address a potential issue with the operating system command.
        Code:
        ./stata < /path_to_do_file > /path_to where_you_want_the_log_file &
        redirects the "standard output" from Stata (stdout, in operating system terminology), but it does not redirect the "standard error"(stderr) containing error messages from the operating system (rather than from Stata) that might provide additional information. If the RA has disconnected from the server while this job is running, any such error messages will - I believe - be lost.

        The syntax for redirecting the stderr output to the same file as the stdout output varies depending on what shell the RA is running their CentOS session under. There are many possibilities, with varying syntax for redirection. Here's an example from the tcsh shell on macOS.
        Code:
        lisowskiw 8% cat foo
        This is foo
        lisowskiw 9% cat foo > output1
        lisowskiw 10% cat output1
        This is foo
        lisowskiw 11% cat foox > output2
        cat: foox: No such file or directory
        lisowskiw 12% cat output2
        lisowskiw 13% cat foox >& output3
        lisowskiw 14% cat output3
        cat: foox: No such file or directory
        You see that output2 does not contain the error message that was sent by the cat command to stderr, but output3 does contain it. Had this sequence been run in the background disconnected from the terminal, the error message would have gone either (a) to somewhere unexpected, (b) been emailed to the user's email account on the CentOS system, which is likely not an email account they read, or (c) gone into /dev/null, the Unix bit-bucket for discarded information.

        I'm afraid I'm not up-to-date on what the popular shells are on Unix systems these days so the RA may have to get help from a CentOS/linux/Unix expert to adapt this advice to the shell being used.

        Added in edit: crossed with #3; that is one example of the circumstances that would potentially generate a message on stderr. I'm also thinking of resource constraints causing the program to fail. Also let me add that I endorse recommendation #2 above.
        Last edited by William Lisowski; 30 Jun 2022, 13:10.

        Comment


        • #5
          Dear Mike, Clyde, William,

          My apologies for the late reply. Thank you so much for suggesting different ways to tackle the problem. I have now included the -capture- on Tobit in my do file. I will ask my co-author’s RA to follow through with William’s suggestion to capture the stderr. Regarding some of the questions posted by Mike; 1) The program has no random number use at any point of time, 2) I had looked at the log files from several failed runs on the server to see if there any values of delta or mu that it specifically fails at but could not find any pattern. It seems to get abruptly stopped. 4) This I don’t know but can that abruptly stop estimating a Tobit? 5) I think it is failing at estimation because there are some more commands later which stores the statistics in a data file with a new name. Those data files are not produced when it fails.

          There is one possible reason, which I earlier thought shouldn’t be but now reconsidering, that may explain why the do file stops abruptly on the server. For some values of mu and delta, the Tobit model may not converge. However, 1) On my laptop, when convergence is not achieved, the program does not stop running but moves on estimating Tobit for other values of mu and delta and produces the stored statistics data file. 2) I ran 500 iterations on my laptop with no convergence issue for any value of mu and delta, but the exact same do code with 500 iterations stopped abruptly on the server. 3) the log file posted above does not show any convergence issue when it abruptly stopped. Having said that, are there still reasons that the convergence issue can cause the program to terminate on the server but not on the laptop?

          Thanks a lot!

          Comment


          • #6
            Two further items.

            1) I note that Stata's Getting Started with Stata for Unix PDF manual included in all Stata installations (regardless of the operating system) and available through Stata's Help menu in Appendix B.3 recommends against the syntax the RA is using for running Stata jobs in batch.

            2) I have been assuming that the paths given in the RA's description are to filesystems local to the server on which Stata is being run. If instead they are network-mounted drives, or Dropbox, Google, or some other cloud drive, that can be a source of unexpected termination, as has been discussed elsewhere here, although I think that may usually include Stata error messages in the log when that happens.

            Comment

            Working...
            X