Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Safe to compress users' files

    Hi all

    I am an administer of a research lab that is running out of server space. I am considering writing a script that compresses all Stata files in users' directories.

    I have read the following post in this regard:
    https://www.statalist.org/forums/for...ll-stata-files

    I am wondering if it is possible to write a script that is 100% safe and does not change our researchers' files in any material way while also reducing size. In light of Daniel's comments in that post, my script will:
    1. Find the version of Stata and save it in that version.
    2. Save orphaned labels and e(sample) - Daniel can one do both options at the same time with save? I presume so, I will test this.

    daniel klein do you think this will be safe? As in, they can continue their analysis without any code breaking or results changing or compatibility issues and so on?

    By the way Daniel, thanks for writing encodelabel for me back in the day - it has been super useful over the years!

    Regards,
    Bruce


  • #2
    The command compress to compress a user's dataset within a Stata session followed by save should be safe at preserving the data.

    I just wonder about small side-effects as if and when a user's code assumes that a variable is of a particular storage type.

    You may be thinking about general compression techniques such as zip, however.

    Comment


    • #3
      Hi Nick.

      Thanks for your response.
      I am only considering the stata command "compress". Windows zipping I won't be using.

      Hmm. This is tough call as we badly need space but researchers will be a bit up in arms if code doesn't run to completion that previously did. I think it is very rare for someone to specifically code to refer to existing variable storage types. I know that datetimes need to be doubles and that users sometimes like to generate a var in a storage type they know. But I doubt they will later refer to the storage type? So if I compress the files using stata compression I think it is quite unlikely it will cause an issue.

      Thinking aloud, are there such commands that refer to the precise storage type? ds only has "numeric" or "string". I am trying to gauge how likely these side-effects are.

      All the best,
      Bruce

      Comment


      • #4
        Is expanding your storage space an option, possibly in conjunction with enforcing a user disk quota? This might be easier to manage on the hardware or admin side rather than with Stata.

        Comment


        • #5
          ds has options to select variables by storage type and display format.

          Comment


          • #6
            I do not think there is a perfectly safe way [Edit: in the sense that syntax scripts will never be affected]. From Stata's perspective, although no information is ever lost, using compress might actually change the data(set) and will do so if data can be compressed. This is what I get in Stata 17:

            Code:
            . clear
            
            . sysuse auto
            (1978 automobile data)
            
            . datasignature
              74:12(71728):3831085005:1395876116
            
            . compress
              variable mpg was int now byte
              variable rep78 was int now byte
              variable trunk was int now byte
              variable turn was int now byte
              variable make was str18 now str17
              (370 bytes saved)
            
            . sysuse auto
            no; dataset in memory has changed since last saved
            r(4);
            
            . datasignature
              74:12(71728):2155345365:1865188037
            Variables' storage types are considered data in the sense of datasignature and the data is considered changed after compression. Whether that will affect old syntax scripts is a different question. In most cases, probably not. Perhaps, you could notify users about your plans?

            Note that Stata has a zipfile command that works beyond Windows. The command is used to implement the new .dtas format. Obviously, zip-ping the datasets will affect syntax scripts as datasets need to be unzipped before they can be used (merged, appended, etc.).
            Last edited by daniel klein; 30 May 2023, 01:21.

            Comment


            • #7
              Originally posted by daniel klein View Post
              Note that Stata has a zipfile command that works beyond Windows. The command is used to implement the new .dtas format.
              Is it documented in the dta specs that version 121 is zip-compressed? I wasn’t able to see that. Stata will also default to earlier, backwards compatible formats if the new features aren’t needed, and I don’t believe those versions were ever composed.

              Comment


              • #8
                Note the difference between dta and dtas. The latter are a bundle of dta files, compressed into one zip-file/folder. See

                Code:
                help dtas

                Comment


                • #9
                  Ah thank you for the clarification.

                  Comment


                  • #10
                    Hi all

                    I am always so impressed by the amount and quality of the feedback I receive here on Statalist. Thanks for your input. I didn't know about the .dtas format and will look into that as an option.

                    We do keep their syntax so it is an option to check to see if they refer to storage types, and if not go ahead with compression (with their permission).

                    Kind regards,
                    Bruce

                    Comment

                    Working...
                    X