Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data being destroyed while saving

    I'm having a very weird problem with data being destroyed while writing/reading into Stata-MP on a slurm cluster.

    After data cleaning, I am running isid, everything is fine, then I save the file as a dta. But when I immediately read in the exact same file, isid fails. One third of my 66M records are having their IDs values erased in the process. One of the ID variables is numeric and the other is a float (formatted as a date) Has anyone experienced this problem? Is this likely a server problem or is there something about the ID variables themselves that I could be unclear about?


    Thank you. Explanatory code below. Unfortunately, my data are private so I can't share a sample.


    Code:
      
    
    . isid hhd_id date_DT  
    
    .  
    . * write  
    . save $use_data/household_days, replace  
    file /projects/project_name/data/use_data/household_days.dta saved  
    
    .  
    . use $use_data/household_days, clear  
    
    .  
    . isid hhd_id date_DT  
    variables hhd_id and date_DT do not uniquely identify the observations r(459);

  • #2
    No, you must share a sample. I don't care if the data are private, the FAQ explicitly says for you to either change the numbers or take measures such that the variables aren't their real values. Or, to get your problem to reproduce on a toy dataset.


    Thus far, I can't tell what the issue is aside from that duplicates exist. Show me what
    Code:
    duplicates report idhhd_id date_DT
    returns

    Edit: the likely solution lies in all the code you wrote before you got to this point. So we'll need more code and ideally data to answer this
    Last edited by Jared Greathouse; 11 Jul 2022, 15:41.

    Comment


    • #3
      My assumption is that this has something to do with the server to which you are writing/reading, rather than any problem with code or data.

      I cannot replicate your problem when saving to a local device -- the code below returns no errors when run on my machine:

      Code:
      clear
      set obs 50000000
      egen id = seq(),f(1)  b(100000)
      bys id: gen date = _n
      isid id date
      
      global dir "C:\Users\username\Desktop"
      
      save $dir/test, replace
      use $dir/test, clear
      
      isid id date

      Comment


      • #4
        You might try changing your code to write to and read from a temporary file, which (barring a truly peculiar Stata configuration) should be guaranteed to exist on a local filesystem for performance reasons, or at least, on a different filesystem than your permanent dataset. In post #3 Ali Atia makes the important point that his example is run on a filesystem attached to his computer, rather than on a network drive, or on some cloud device.

        Code:
        isid hhd_id date_DT  
        tempfile foo
        save "`foo'"
        use "`foo'", clear  
        isid hhd_id date_DT
        Last edited by William Lisowski; 11 Jul 2022, 18:52.

        Comment


        • #5
          Hi - Thank you very much for these helpful responses.

          I have not been able to replicate the problem with William's suggestion of using tempfiles which gives me a greater credence that it has to do with how it is being written to disk storage. Moreover, I learned that it is actually deleting all variables for this subset of observations. Not just the identifying variables or variables of a particular type. I've also replicated the problem across MP and SE in the event that it's some type of data race scenario.

          Unfortunately, my codebase is such that I can't use `tempfiles' at this juncture so I will update here if I can come up with a better answer.

          Comment


          • #6
            Originally posted by Jack Reimer View Post
            Code:
            * write
            . save $use_data/household_days, replace
            Try saving the cleaned-up dataset under a new name.

            That is, for the sake of I/O efficiency, maybe your SLURM cluster is feeding you back a more accessible cached instance of the original uncleaned dataset that has all of the problems that you thought that you had just taken care of.

            You can also check the datasignature to see whether the one you're using is the same one that you just saved. Inserting a few lines into what you post above:
            Code:
            isid hhd_id date_DT  
            
            * write
            datasignature
            local new_datasignature `r(datasignature)'
            
            save $use_data/household_days, replace  
            
            use $use_data/household_days, clear
            
            datasignature
            assert "`r(datasignature)'" == "`new_datasignature'"
            
            isid hhd_id date_DT

            Comment


            • #7
              Originally posted by Jack Reimer View Post
              a greater credence that it has to do with how it is being written to disk storage.
              You can use the dataset signature features to look into that, too, for example, with datasignature set and datasignature confirm.

              Comment


              • #8
                my codebase is such that I can't use `tempfiles' at this juncture
                It was not my suggestion that you actually use a tempfile for more than what it accomplished - verifying that the problem is somehow related to the filesystem to which the data is being written.

                If it is being written to a network filesystem or a filesystem linked to cloud storage, then reports of problems like yours are common on Statalist. If you need to store your data in that particular location, I think you would be advised to write the data to a standard filesystem (not networked or linked to cloud storage) and at the end of your job, use the linux mv command - rather than Stata processes - to relocate the dataset to the ultimate destination.

                If you think the filesystems reported in the output of the save commands from the code in posts #1 and #5 are both standard filesystems on the slurm cluster, then you might contact the cluster support team and ask them what the difference is between those two filesystems that accounts for the problems you are seeing.

                Comment


                • #9
                  Hello,

                  Ultimately, saving under a new name, as Joseph suggested, remediated the issue. I was able to confirm that it is something about the save process, not the `use` process because the problem was already apparent when I tried to instead read the dta file into R.

                  Quite odd but these suggestions were all quite helpful and helped resolve this issue

                  Comment

                  Working...
                  X