Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • New feature for version controlling user-written packages

    I want to share a new feature for version controlling user-written packages. This reduces the practice of managing the ado-paths into a single line of code. Read more in this blog post.

    The workflow this new feature implements is not new. However, we has not seen a big uptake in this workflow in the research community despite it being promoted by us in DIME Analytics in addition to other research best practices promoters in the Stata community. We believe this is due to managing the ado-path is too technically daunting to the regular researcher. Therefore, the motivation of this new feature was to hide that complexity in a command option that to the researcher is non-daunting single line of code.

    This workflow makes the user set the PLUS folder to a folder meant to be created in the same location as the rest of the project code. We refer to this folder as the "project ado-folder". The project ado-folder should then be included when sharing the project's code with team members etc. When other users run this project's code, they will then also use the exact versions of the commands that are installed in the project ado-folder.

    In "strict mode", which is intended to be the standard use case, this workflow also deletes all other adopaths apart from BASE. This guarantees that all user-written commands needed to reproduce the results of the project are indeed included in the project specific ado-folder. Read more about this in the blog post linked above, and the technical resources linked to in that post.

    Any feedback from this community on this new feature is much appreciated. You are also more than welcome to promote this new feature in your channels.


  • #2
    I started reading the first linked item and got no further than the description that SSC is

    Stata’s standard platform for distributing user-written commands
    This is unfortunately likely to be misleading. Stata has no such single standard platform in any strict sense. Henceforth I use terminology "community-contributed" interchangeably with the older terminology "user-written": there is no difference otherwise.

    1. The community is welcome to submit code together with articles to the Stata Journal and if published such code will be distributed via the Stata Journal's website. There is no barrier to authors publishing in the Stata Journal but maintaining their software elsewhere.

    2. SSC is a community initiative started and maintained by Kit Baum over more than 25 years. It is acknowledged in a friendly and positive spirit via the ssc command in official Stata but it is not the standard platform in any other official sense.

    3. Many programmers prefer to maintain their work via GitHub.

    4. Other programmers prefer to post their work on their own websites. Indeed, when net awareness was introduced by StataCorp many years ago the guess was that this would be how most user-programmers would prefer to work!

    This set of arrangements is undoubtedly complicated but exists for good reasons, including personal convenience. For example, I use the Stata Journal and SSC and have never used methods #3 or #4 but I strongly respect those making different choices.

    The general problem of reproducibility can be time-consuming and even serious, but if authors use out-of-date versions of software they are exacerbating the problem. Often this is unwitting and unavoidable: for example, it can be a few years between doing most of the work and your work being finally published and none of that stops other people revising their code in that interval.

    We can argue whether SSC should have been based on maintaining different versions of commands indefinitely. That is possible in a limited sense but only by changing the names of commands. Having posted many versions of many packages over the time span of SSC I can't recall a serious instance in which the disappearance of an older version was flagged to me as problematic. On the whole I believe that this simplicity is in most researchers' best interests, where researchers includes all students downloading community-contributed commands.

    The above is authoritative in these senses: I am an Editor of the Stata Journal and have been involved with SSC since the beginning, being originally a joint author of the ssc commands. Naturally it can't be a complete description of the terrain.

    All that said, Kristopher's post flags a way to reduce these problems that may match your own style of working. Needing to install a community-contributed package to manage your use of community-contributed package is a nice touch!

    FWIW, I don't believe that Stata is unusual here. R and Matlab among other software have a central core and a great variety of places you can install extras from and authors often use their own code that is not always made public. There are people with longer memories but I spent too much time in the mid-1970s rewriting other people's Fortran programs to adjust for different versions, features not supported by the local compiler, and idiosyncratic output subroutine or function calls;


    Comment


    • #3
      Originally posted by Nick Cox View Post
      FWIW, I don't believe that Stata is unusual here. R and Matlab among other software have a central core and a great variety of places you can install extras from and authors often use their own code that is not always made public.
      Stata differs from R in that old Stata versions are no longer available (purchasable) whenever a new version is released. From this perspective, you could say that if the software used to write community-contributed commands is proprietary and no longer available, there is no way anyone can (legally) guarantee reproducible results.

      I have stated elsewhere that I do not see any merit, scientifically or otherwise, in reproducing wrong or erroneous results because of bugs in older versions of the software. Overall, I believe that obtaining erroneous results unknowingly because of using outdated software is more problematic than not being able to do so.

      Comment


      • #4
        Indeed: complexities and contradictions abound here.

        * I don't think older versions of R, either core or particular packages, are easily accessible either.

        * The Stata Technical Bulletin and Stata Journal have taken the idea of an archive more seriously than most places -- for more than 30 years now.

        * I can't myself check that a command will run in any version of Stata other than very recent ones. But that's chicken and egg, as there has been no personal imperative or incentive to copy previous installations from machine to machine. If someone reports that a command of mine won't work in Stata 14, or whatever, there is usually something that can be done directly. But I don't have Stata 14 to hand to check myself.

        I guess that Daniel will agree that the reproducibility problem is much more general than whether previous authors were bitten by a bug that has subsequently been fixed. The questions usually start with: What did the authors do exactly, because their results appear puzzling if not absurd? And the answer often lies in some data management issue.
        Last edited by Nick Cox; 10 Apr 2023, 06:06.

        Comment


        • #5
          I don't know about Matlab, but R and Python have been developing solutions for version-controlling packages for a while, and this is key in ensuring that code in production is not affected when non-backward compatible updates are made. For both software, you can also easily install older versions. When I started learning Python, both Python 2.7 and Python 3 were widely used.

          Comment


          • #6
            Luiza Andrade Thanks for #6.

            That's about the core or whatever Python calls it, is it not? The main issue here is with extras, which tend to be nearer the research frontier, and more labile.

            Comment


            • #7
              Nick Cox Yes, that's right. For handling package versions, you would typically have separate environments for each project or for different groups of projects. I believe there is also a workflow in pip for version control, which I am not super familiar with. In R, there is a package called `renv` that tracks the version of each package used, regardless of its source (CRAN, GitHub, etc). The workflow for `renv` is very similar to setting the plus ado path to each project directory, which is what Kristoffer wrote a wrapper for.
              Last edited by Luiza Andrade; 10 Apr 2023, 10:31.

              Comment


              • #8
                Originally posted by daniel klein View Post
                I have stated elsewhere that I do not see any merit, scientifically or otherwise, in reproducing wrong or erroneous results because of bugs in older versions of the software. Overall, I believe that obtaining erroneous results unknowingly because of using outdated software is more problematic than not being able to do so.
                I agree with this, but isn't comparing the different versions of the software you used to create inconsistent results the only what of knowing which one of them (if any) had bugs?

                Comment


                • #9
                  Originally posted by Luiza Andrade View Post

                  I agree with this, but isn't comparing the different versions of the software you used to create inconsistent results the only what of knowing which one of them (if any) had bugs?
                  That would be a rather poor strategy to identify and correct potential bugs. A more direct and proactive approach is to use something like test-driven development or related strategies. Basically, design for known cases where code should and shouldn’t work, and ensure the behaviour is as expected. I’ve seen many examples where any type of proactive approach would catch simple errors, to packages with very robust test suites. In Stata, this idea is what -cscript- was designed for and it’s how StataCorp tests it’s own commands. It is not the only way, but it provides a convenient framework, at least as a starting point.

                  Comment


                  • #10
                    I don't think this is about bugs at all, so I agree with Leonardo and Luiza here. User-written commands support a massive share of work in Stata and I can't remember ever seeing a complete package that doesn't use at least one. In other languages, package managers and repositories archive prior versions of these kinds of dependencies. They then allow setups like the screenshotted "requirements.txt" file, which will install the correct package versions and ensure the code will always run. In Stata, there is no manager or repository that does this. It is very easy for developers to make breaking changes to user-written commands, even just in syntax changes -- we've even done it in major version updates!

                    Therefore, the basic problem is that writing "ssc install xxx" doesn't work as reproducible code if the user-written package has changed since publication, and there's no way to identify or retrieve the working version given the Stata dev ecosystem. This is regardless of whether there was any true error in the user-written package. So in order to provide a persistent reproducibility package to, say, AEA, the easiest way to achieve this functionality is to include the complete code of user-written commands. This accomplishes that in a very simple fashion!


                    Attached Files
                    Last edited by Benjamin Daniels; 10 Apr 2023, 11:48.

                    Comment


                    • #11
                      Originally posted by Leonardo Guizzetti View Post

                      That would be a rather poor strategy to identify and correct potential bugs. A more direct and proactive approach is to use something like test-driven development or related strategies. Basically, design for known cases where code should and shouldn’t work, and ensure the behaviour is as expected. I’ve seen many examples where any type of proactive approach would catch simple errors, to packages with very robust test suites. In Stata, this idea is what -cscript- was designed for and it’s how StataCorp tests it’s own commands. It is not the only way, but it provides a convenient framework, at least as a starting point.
                      Yes, that makes perfect sense when you are the developer. But it's hard to make sure that all community-contributed commands follow the same quality assurance standard. And if you are only using a package and your results change, you should still be able to tell if they are changing because your data has changed or the code has changed. So you need to know what version of a command or package you where using to be able to rule that out.
                      Last edited by Luiza Andrade; 10 Apr 2023, 11:51.

                      Comment


                      • #12
                        Thank you all for your feedback.

                        I would like to keep the discussion in this thread about the tool I shared. I would like to refer anyone who is not convinced that reproducibility in general is important to this overview where there are links to further reading. Though, this I think we all agree on.

                        I am not saying that version controlling user-written commands is the only thing or the most important thing needed for reproducibility. However, I would like to refer anyone who is not convinced that this is one important component of reproducibility to, for example, the American Economic Review’s submission requirements that says “Our policy does require […] that “programs” be archived at the AEA Data and Code Repository”.

                        I would like the rest of the discussion to be about the original topic. Is this tool useful when creating the reproducibility packages required by the top journals? If not, what can be improved? I can’t stop anyone to use this thread to discuss what top journals should or should not require, but I do not think it is helpful for me to engage with those comments.

                        I appreciate the cases you have highlighted where Stata cannot achieve this level of reproducibility (very old versions of Stata etc.). One day there might be a solution to those extreme cases as well. In the meantime, we are happy to have made it easier for users to satisfy these requirements in the more regular cases.

                        I will stay away from commenting more on SSC. However, I want to say that this feature is not only for user-written commands from SSC. The blog post also mentions net install which is used when installing commands directly from GitHub. The requirement from these top journals is that programs installed from GitHub are also included in the reproducibility package, as files and resources on GitHub can be moved or deleted.

                        Comment


                        • #13
                          Lars Vilhuber, the American Economic Association's Data Editor, has been working on a project that makes a daily snapshot of the SSC Archive. He has run this for more than a year. His updated command, 'ssc2', allows you to say something like

                          ssc2 install cmp, date(2022-01-07)

                          which installs version 8.6.7 of that routine. If installed from ssc (or ssc2 without date) Stata installs version 8.7.5 of that package. Conceptually, this would provide the ability to collect the versions of materials from SSC that are used, as of a prior date.

                          This is still under development (in particular, it should be possible to only "snapshot" changed files, rather than the entire contents), but it does provide access to superseded versions of the code.

                          Kit

                          Comment


                          • #14
                            Thanks Kit!

                            It is Lars Vilhuber's work in general that has been an inspiration for this new feature. He has told us about his daily SSC snapshots. I think that is a cool feature, but I see it as a compliment to our feature shared here. Our new feature creates a time capsule of all ado programs no matter source. SSC, GitHub etc. If this time-capsule was not provided, then ssc2 can be useful to trying to retroactively create it to figure out the exact code that generated some results.

                            How would ssc2 work if you are trying to install an old version of a command that you have a newer version already installed? Then you would have to change the ado-paths as in my feature anyways to install the commands in a different location, right?

                            Comment


                            • #15
                              Originally posted by Kristoffer Bjarkefur View Post
                              I would like to keep the discussion in this thread about the tool I shared. I would like to refer anyone who is not convinced that reproducibility in general is important to this overview where there are links to further reading.
                              I seriously doubt that using different versions of software contributes substantially to the replication crisis, but enough with this.


                              On the tool:

                              I do not see how

                              Code:
                              ieboilstart , version(14.1)
                              `r(version)'
                              is an improvement over

                              Code:
                              version 14.1
                              ieboilstart whatever_else_it_does
                              Having to reference a local macro (technically, a global here) is error-prone and arguably makes the code harder to read.


                              Changing the PLUS directory, which is what the adopath() option does, will not suffice for installing community-contributed commands where you want them if net ado has been set to a different place. Unfortunately, there is no easy way of obtaining the directories to which net is set (yes, you could work around that with log files). This implies that you probably do not want to net set in ieboilstart or, if you do, explicitly tell users about that.

                              Moreover, telling users not to keep ssc install commands in their code (source) might not be the best approach. Why not set up a (sub-)command that looks into the set ado-folder and merely does nothing when a community-contributed command is already found? Then tell users to use this instead of ssc install in their code. A basic version could look like

                              Code:
                              *! version 1.0.0  11apr2023
                              program install_ado
                                  
                                  syntax anything [ , FROM(passthru) NOIsily ]
                                  
                                  if (`"`from'"' != "") local net_or_ssc net
                                  else local net_or_ssc ssc
                                  
                                  capture `noisily' which `anything'
                                  
                                  if ( !_rc ) exit
                                  
                                  if (_rc != 111) error _rc
                                  
                                  `ssc_or_net' install `anything' , `from'
                                  
                              end
                              For a related approach, see rqrs


                              Aside from these points, your approach should work well for what you want.
                              Last edited by daniel klein; 11 Apr 2023, 15:05.

                              Comment

                              Working...
                              X