Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to detect user-written programs in do-files?

    Is there a reasonable way to detect user-written commands in Stata code?

    I lead a small (5-person, including me) group of criminologists. We're trying to ensure that our work is always 100% reproducible. Among the steps I'm trying to take there is to package user-written ado files with each project, to protect against changes from updated versions of user-written commands and the disappearance of commands in SSC and other sources.

    When writing code myself, I can just create a directory in my project's folder and add it to the adopath in the project's do-files. But sometimes I honestly forget what's user-written. And explaining how to find the relevant ado, grab all of the files in the package, and copy those files to my team is difficult and error prone.

    Ultimately, what I'd like is a program that will read the do-files in a project, determine what (if any) user-written files are used in the do files, create an ado folder in the project, and copy the user-written commands (and their helpfiles) there.

    I'm open to different approaches to solving this problem, so if you've solved this in some clever way, I'd love to hear it.

  • #2
    Questions about detecting user-written (nowadays: community-contributed) commands in (a)do-files and reproducibility come up from time to time. Here are some thoughts:

    There is no fully automated solution and it is unlikely that there ever will be. Why? Take the innocent-looking

    Originally posted by Troy Payne View Post
    determine what (if any) user-written files are used in the do files [...] copy the user-written commands (and their helpfiles)
    That could already turn out to be much harder than you might realize at first glance. Community-contributed commands include at least one ado-file and often include one help file; They might also include several ado-files, several help-files, possibly Mata libraries, and/or plug-ins, etc. These components are typically listed in the accompanying pkg-files. Community-contributed commands might also exist in more than one place, e.g., SJ and SSC, and the different versions might or might not be identical. Potentially different versions of community-contributed commands could be installed in different places on the same machine. Stata tracks the installation of community-contributed commands in trk-files, or at least it tries to. However, there might be several trk-files in different places on the same machine. Long story short: automatically identifying which version of a community-contributed command was used in a given code is almost impossible.

    Some of the potential complications that I have just listed will rarely occur. In my own experience, and from following Statalist for about 10 years, it seems that changes to community-contributed commands corrupting replication is something that happens even less often. I believe the problem of not being able to replicate results because of changes to community-contributed commands is mostly hypothetical. In practice, the benefits from updates to community-contributed commands due to improvements or bug-fixes will outweigh the potential damage almost every time. Note that StataCrop. will not preserve bugs in their official commands, even under version control. Therefore, bug fixes to official Stata commands would "corrupt" replication, too. Yet, you probably never thought about backing up the BASE ado-files.

    In my experience, the problems of replicating earlier work that arise from changes to community-contributed commands are often exaggerated. I do agree that it is annoying to first have to locate and install missing community-contributed commands that others have used. I have outlined what I deem a reasonable solution to that problem here.
    Last edited by daniel klein; 22 Jul 2020, 00:18.

    Comment


    • #3
      I also do not think that finding out user written commands would be easy. Very fundamentally Stata distinguishes between built in commands, and ado commands, but I do not think that Stata knows and keeps track of the distinction between Stata Corp written and user written commands. Here are the properties of one built in, one Stata Corp written, and one user written command:

      Code:
      . which summarize
      built-in command:  summarize
      
      . which regress
      C:\Program Files (x86)\Stata15\ado\base\r\regress.ado
      *! version 1.3.2  28feb2018
      
      . which cmp
      c:\ado\plus\c\cmp.ado
      *! cmp 8.3.9 4 March 2020
      *! Copyright (C) 2007-20 David Roodman 
      *! Version history at bottom
      I understand the problem Troy describes--one might forget what one has manually installed on their system, and then things crash on another system. I do not think this problem is huge--even if one has forgotten, reading through the do file will reveal which commands are user written.

      Finally as an irrelevant side note to this particular thread but something Daniel menstioned, I myself did not know that when Stata fixes bugs, the Version control no longer works, i.e., if you set the version of Stata to older version by version control, you would not replicate the bug. And this does occasionally create problems, see here:
      https://www.statalist.org/forums/for...t-not-in-15-16

      Comment


      • #4
        Originally posted by Joro Kolev View Post
        when Stata fixes bugs, the Version control no longer works, i.e., if you set the version of Stata to older version by version control, you would not replicate the bug.
        Been using Stata daily for nearly 15 years, and that was news to me too. Good to know.

        My original thinking here was that I’ll just instruct my team to create an /ado directory in each project, and they’ll plop any user/community-written files there, and add that directory to the top of the adopath in their code.

        But making sure that we’ve got all of the files related to the command is difficult — having just written an ado myself that has a dialog box, the which command doesn’t list the .dlg file (or the help file, or other files in the .pkg the file was installed from).

        One (probably bad?) idea could be to gather up all of the commands we use and keep a mirror of them in a private git repo. Our code could reference which commit was used in the project in the comments. There’d be a bit of maintenance overhead (manually checking for updates) but it could work.

        Comment


        • #5
          Let's accentuate the positive here: The reason for this problem is that you have found community-contributed commands to be helpful extensions of official Stata.

          If you change your adopath temporarily to look only in Stata's own filespace, then Stata won't be able to find community-contributed commands. That's not a great solution because your do-file will fail at the first command it doesn't understand. So you found one, but it's now harder to iterate. Commenting out calls to that command will typically break the do-file.

          Another way to think about it is to make a list of what extras you have installed and then search your code for those names.

          These approaches will surely fail if anyone is in the habit of putting community-contributed commands in Stata's filespace.

          I don't do it myself but for each project a do-file that checks that each community-contributed command needed is already installed might be good discipline.

          In collaborating with others either we work on my machine or we work on theirs. Then we find out the hard way if they don't have something they need.

          But how big is this problem really? Thinking logarithmically I can easily imagine people using say 3 or possibly 30 others, but not 300 or 3000.


          Comment


          • #6
            Originally posted by Nick Cox View Post
            Let's accentuate the positive here: The reason for this problem is that you have found community-contributed commands to be helpful extensions of official Stata.
            No argument. Stata's extensibility and excellent user community is fantastic, and it's a large part of why we use Stata. I'm not unhappy to have this problem.

            Originally posted by Nick Cox View Post
            But how big is this problem really? Thinking logarithmically I can easily imagine people using say 3 or possibly 30 others, but not 300 or 3000.
            Is it a deal-breaker? No. But in the past two weeks, we've had multiple instances of one team member using a command that others didn't have installed. It's not a huge issue in the moment, in that as you say we immediately discover the issue and resolve it.

            The real concern is five years from now, can we run the code? Commands disappear from SSC (and other repositories) from time to time. I'm looking for an automated way to package dependencies together with the project.

            Given the scale of the problem (relatively small) and the size of my team (also small), it sounds like a low-tech human-powered solution is best. During code review, we look for user/community-written code and add those commands to the project manually when the analyst hasn't done so. I was hoping for a more automated approach.

            Comment


            • #7
              Commands disappear from SSC (and other repositories)
              Let's split this up:

              SSC: rarely (Years ago, I asked Kit Baum to delete some then useless stuff, but never heard a cheep of puzzlement or protest. Naturally, many people contribute to SSC.)

              Stata Journal: never.

              Other places: I can't easily comment. This is where problems seem most likely. .

              My impression is that long-term Stata community scores pretty well on this score.

              Comment


              • #8
                That's fair about SSC... it's probably more stable than I give it credit for. I'm still nervous about external dependencies, however.

                Another issue for us is that we frequently have to work on air-gapped machines and machines with very restricted internet access. That's due to data security issues relating to criminal justice information and other sensitive research data. We do occasionally use user/community-written commands on those machines, and as you can imagine we're quite careful about that.

                Comment


                • #9
                  Originally posted by Troy Payne View Post
                  One (probably bad?) idea could be to gather up all of the commands we use and keep a mirror of them in a private git repo. Our code could reference which commit was used in the project in the comments. There’d be a bit of maintenance overhead (manually checking for updates) but it could work.
                  I don't have a solution to your problem, but since you mention writing your own -ado- files and packages, it would be a good idea to put those in a Git repo to secure version control and have "approved sources" from which your team can pull from. They need only update as needed, but you can also revert to older versions if necessary.

                  Comment


                  • #10
                    Originally posted by Leonardo Guizzetti View Post

                    I don't have a solution to your problem, but since you mention writing your own -ado- files and packages, it would be a good idea to put those in a Git repo to secure version control and have "approved sources" from which your team can pull from. They need only update as needed, but you can also revert to older versions if necessary.
                    Right. So my thinking on this so far is that I could create a git repo for tools my team uses, and have them net from there. But that requires setting up stata.toc and package files myself, or using ssc copy to grab them, and that is kinda tedious.

                    There is a solution using net, however. If I set my git repo to be the net ado folder, then use net install, Stata installs whatever package in the git repo folder. And Stata creates stata.trk, which is how Stata knows what packages are installed, and what the distribution dates are; it's what makes adoupdate work.

                    So then, as an example....

                    Code:
                    net set ado my_git_repo_path
                    net install unique.pkg, from(http://repec.org/bocode/u/)
                    My team and I can copy that into each project, and I can set that up as an initialization script that we prepopulate in each new project's repo. There's some overhead each time a command is added to the repo, but the state of user/community-written programs will be preserved in amber for that project (or, if the user wants, updated within the git repo, allowing for version control).

                    The advantage of using net install over ssc copy is that net install creates stata.trk, which is how adoupdate knows what to update. That is, if I do the above then

                    Code:
                    adoupdate, dir(my_get_repo_path
                    the commands in the git repo will update just as they would have had they been installed in the PLUS or PERSONAL folders with ssc install. So really, the only maintenance overhead is at the initial install step.

                    Comment

                    Working...
                    X