Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cross-platform method for setting temporary directory?

    Having read the Stata FAQ and a variety of posts here, it seems that one cannot set the temporary directory from within Stata (e.g., in a do-file) — instead it has to be set as an OS-level environment variable.

    The data provider for a project I'm doing requires that the temporary directory be redirected to a directory within the encrypted disk for this project for security reasons. I cannot change this specific requirement, so if you're about to suggest I simply encrypt the boot volume and don't fuss over the default tempdir... thanks, but I tried that argument with the data provider and was unsuccessful.

    My shop is cross-platform (macOS and Windows). Is the best I can do to ensure compliance with this requirement to 1) create OS-specific scripts to launch Stata for this project; and 2) in every do-file for the project, use assert to check that c(tmpdir) is where it needs to be?

    I know how to do that. It just seems like a Rube Goldberg way to do it.

  • #2
    As the author of the Statalist post you linked to, I can assure you that I still have not found an alternative approach to the kludge I documented for Macintoshes.

    I'm going to add this to the Stata 17 wishlist, with a link to your post. Several issues rising from privacy and nondisclosure concerns are not well addressed by Stata.

    Comment


    • #3
      I agree that it would be better for the user to be able to specify from within Stata the location of the directory that Stata uses as the temporary directory, but is it used for anything other than storage of user-specified temporary files? That is, anything other than
      Code:
      tempfile <whatever>
      save <whatever>
      use <whatever>
      and perhaps
      Code:
      preserve
      restore
      I really don't know, but if so, then isn't a workaround not to use them? Other than perhaps append, is there much that you need to do along those lines that you can't with frames?

      Comment


      • #4
        sn't a workaround not to use them?
        As long as you also avoid using any Stata command that makes use of them.

        Comment


        • #5
          Joseph Coveney, preserve/restore uses frames in Stata 16/MP, but the default amount of RAM used is 1GB. If you run out of RAM it'll save to disk, and other flavors of Stata save temp files to disk on preserve/restore. Far more vanilla Stata commands and user-written commands from SSC use preserve/restore than you probably think, so even if you never directly invoke preserve/restore it's still likely that you're leaking data to your disk in tempfiles.

          In most contexts, that's not a problem. Tempfiles are, by definition, transitory and get cleared by your OS on the regular. But in highly regulated contexts — such as when you're dealing with personally-identifiable personal health information or criminal justice information as I am — not knowing where every single bit of data will be stored is at best not great, and at worst a serious security risk.

          Ordinarily, my shop interacts with such data remotely, via the CLI or remote desktop sessions over secure connections to servers in my institution's datacenter that we control. Tempfile data leakage is one reason we do that, as well as paranoid-but-justifiable concerns about holding sensitive data in RAM on the local machine at all. This particular project will use data from a provider who requires that we air-gap the analysis machine, and that we take the additional precaution of redirecting tempfiles to the hardware-encrypted external device housing the data despite the boot volume (the default location for temporary directories) also being encrypted.

          Comment


          • #6
            Originally posted by Troy Payne View Post
            preserve/restore uses frames in Stata 16/MP
            Didn't know that, thanks.

            Anyway, not to digress, but the kinds of activities that Stata seems best for seem as if they could be done further downstream where the data wouldn't necessarily be personally identifiable. The sensitive data manipulations and so on would be done in the context of, for example, relational database management systems, which are typically designed with security more centrally in mind. The analyst working with Stata would be given access only to de-identified data via a view, stored procedure, user-defined function or whatever.

            Again,I don't know what you and others are asked to do with Stata—and such activities might involve hurdles utterly insurmountable without access to personally identifiable versions of the datasets—but there may be organizational solutions to data security more immediately available that still allow Stata to be used for what it's best for.

            Comment


            • #7
              Originally posted by Joseph Coveney View Post
              The sensitive data manipulations and so on would be done in the context of, for example, relational database management systems, which are typically designed with security more centrally in mind.
              No argument; for a lot of Stata users, you're probably right. The truth of it is that we're social scientists, not DBAs or programmers, and working in a RDBMS would add complexity we don't need.

              We're an academic criminal justice research center. We don't create most of the data we work with; we partner with criminal justice agencies to do public policy research. We frequently have to, for example, take individual-level arrest data and calculate time-to-failure to the suspect's next arrest. For that sort of work, we 100% have to have PII to create the linkage between arrests.

              We work under both our university IRB and the regulatory framework our partners require. We deidentify data as quickly as we possibly can to mitigate risk, and we don't load PII into memory if we don't have to for the specific task at hand.

              It's better for replicability of our results for us to stick to one tool for both management and analysis where we can. Stata works very well in nearly every respect, and it's our primary tool.

              Comment


              • #8
                I'm with Troy Payne on this. Separating "data preparation" and "data analysis" is a non-starter.

                Data that contains personally identifiable information is not restricted to, say, data with obvious identifiers like names, addresses, or social security numbers. The Wikipedia article on Personal Data tells us
                In the GDPR Personal Data is defined as:

                Any information relating to an identified or identifiable natural person ('data subject'); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.
                But note that those "factors specific to the ... identity of that natural person" are precisely the sort of data that social scientists study.

                With one set of survey data that I use, disclosure rules include restrictions that preclude publication of "summary statistics (including frequency distributions), tabulations, or graphs (including scatterplots and maps) that have cell sizes under 11 observations" as the price we pay for access to geographic identifiers below the state level. If I have two categorical variables with broad categories, I can produce one-way tabulations based on each. But if one combination of categories occurs but rarely, I then have to collapse categories to produce a two-way tabulation that would comply with the requirement.

                In this case, as the GDPR recognizes, what constitutes personal information is an emergent property in the analysis of unit-record data. Doing our work requires access to data that in some cases can be used to deduce characteristics of identifiable individuals, even lacking names, addresses, and other obvious identifiers. The requirement is to protect that data suitably and try to ensure that the results published from that data do not divulge personal data.

                Comment


                • #9
                  Originally posted by William Lisowski View Post
                  Doing our work requires access to data that in some cases can be used to deduce characteristics of identifiable individuals, even lacking names, addresses, and other obvious identifiers. The requirement is to protect that data suitably and try to ensure that the results published from that data do not divulge personal data.
                  Yup. You'd think you could use town-day as a unit of analysis in a study of arrests, but I help run the Alaska Justice Information Center. In my state, it's not that unusual to have just one arrest in a town in a day. So if I report non-public details at the town-day level, I'm reporting data that can be easily reidentified using information that's in the public domain — and that can violate professional ethics and, in some cases, the law surrounding criminal justice information disclosure. The conceptual concerns are complicated, and we'd rather focus on that complexity than building complex IT infrastructure.

                  We also have to explain our infrastructure to non-technical agency leaders who are sometimes skeptical of outsiders doing research at all, and keeping our data in flat tables with a simple security model is very useful there.

                  Comment

                  Working...
                  X